← Back to home January 4, 2025

AI Agent Development: End of Year Reflections

Patterns that have become obvious in AI agent development - from code-first approaches to continual learning and the convergence of RL, multi-agent systems, and context management.

I was describing to some friends recently how I’d summarize how AI agent development has been changing and where it’s going.

1. Code should be first class.

I saw smolagents quite a while ago and it’s been interesting to see it progress. Anthropic’s programmatic tool calling lets Claude write code that invokes tools in a code execution container, rather than requiring round trips through the model for each tool call. Fewer API calls, lower latency, and Claude can filter or process data before it hits the context window. In the mid-term, think of problems as “how can I solve this with code” instead of reaching for traditional tools. In the long term we will see much better integration, and AI assistants will show better autonomy in answering that question given your primary objective.

2. “Just give it a file system” is beating complex external systems.

For knowledge work I’d take Claude Code with grep on a library of documents over vanilla RAG every single time. (At enterprise scale with millions of documents, RAG still wins, but for most repos and doc libraries, start simple.) This extends to memory: sometimes the best “memory system” is just letting the model write/read/update its own notes in the same workspace. LangChain’s How agents can use filesystems for context engineering covers this well.

3. Systems are being built around AI-managed context.

This is the pattern underneath a lot of what’s happening. Context isn’t just memory or conversation history. It’s capabilities: which tools are available, what persona the agent is operating under, what skills it can invoke. And increasingly, agents are managing this themselves.

Claude Skills are file-based: a SKILL.md file with instructions, optional scripts, and resources that get loaded from the filesystem when needed. Kiro Powers take a function-based approach: MCP servers that activate dynamically based on context. Different implementations, same insight: don’t load everything upfront. Let the agent pull what it needs.

This connects directly to multi-agent thinking. A “specialist” is really just a persona plus a tool set. Dynamically loaded profiles (Skills, Powers) are basically sequential multi-agent: load a specialist, do the work, unload. The agent is managing its own embodiments.

I’m bullish on this happening through the file system. Claude Code is already fantastic at reading, writing, and navigating files. It’s been post-trained for this environment. We have a good harness. The “skill factory” pattern (agents creating new skills for themselves) is just file writes. Memory is file writes. Tool descriptions are file reads. The primitives are already there.

RL will make this better. Training agents to know when to load a skill vs create a new one, what context to pull for a given task, when to ask for clarification vs proceed. That’s the next level.

4. Plan-first is dominant and will stay that way.

The “interview the user” pattern will be deployed for many new domains and use cases. (Try Claude Code Plan Mode, or Kiro’s spec-driven development, and imagine how that can be applied to your processes.)

The agent can inquire your preferences and requirements, and they’ll learn what questions they need to ask to have enough details to complete the task to our satisfaction (or until some verifiable end point).

Beads extends this to task management: a git-backed graph issue tracker where agents can break down work, track dependencies, and pick up the plan across sessions.

5. “Continual learning” is being mentioned by everyone, but not really defined by anybody.

Most current implementations are what I described above: AI-managed context engineering. Memory systems, skill creation, dynamic capability loading. Agents get better at their jobs by writing to the filesystem, not by updating weights.

Expect pushback from ML researchers who argue that “continual learning” should imply weight updates. Letta’s take on “learning in token space” is the best framing I’ve seen: today’s agents are weights plus context, and updates to learned context should be the primary mechanism for learning from experience.

6. RL remains central and is getting more sophisticated.

A few directions I’m watching:

Open environments are scaling up. The more environments and embodiments (think: persona + tool set) we design for post-training, the better. Prime Intellect’s Environments Hub is building an open community platform for this. On the robotics side, NVIDIA’s GEAR group (Jim Fan’s team) is pushing cross-embodiment policies: training that transfers across different robot forms.

Rewards are getting more granular. Original GRPO gave advantage signals to entire trajectories. PRMs (process reward models) were big for reasoning models, but have been less prominent in research discussions since GRPO showed their limitations at scale. But now we’re seeing turn-level credit assignment come back. This paper built on Will Brown’s verifiers library extends GRPO to multi-turn settings with fine-grained rewards. Tree-GRPO derives step-level signals using tree search. The trend is finer-grained credit assignment, often modeling the agent as an MDP.

RL for context management. This ties back to section 3. We’re seeing RL applied to memory: mem-agent trains agents to manage markdown-based memory (inspired by Obsidian). MemAgent learns what notes to take via reward shaping on final task performance. Memory-R1 takes a more structured approach with explicit memory operations. Beyond memory: RL for learning to ask good questions, knowing when a task is executable vs when to gather more context.

7. Evals unlock everything else.

The optimization loop I’d recommend: start with a simple program, optimize context and prompts until you plateau. Then increase program complexity (add tools, agents, retrieval), optimize that. Repeat until gains stop. Finetuning and RL come last. But none of this works without effective evals. Evals are the foundation.

Setting up evals forces you to define what “good” means. That alone leads to much better product thinking and user experiences. PMs will need to own this as AI adoption spreads. If you want to go deep, Hamel and Shreya’s course is the best resource I’ve seen. Eugene Yan’s Product Evals in Three Simple Steps is also worth reading. I’ll have a longer post on evals soon.

8. Protocols are maturing and specializing.

MCP (Anthropic) standardizes how agents connect to tools and data. It’s the most widely adopted today because it solves a narrow problem well. A2A (Google, now Linux Foundation) enables agent-to-agent collaboration across platforms: peer-to-peer task delegation when agents don’t share memory or context. ACP (IBM) focuses on local/edge coordination where latency and reliability are critical.

9. Infrastructure is catching up.

Running agents in production used to mean duct-taping together sandboxes, auth, and observability. Now we’re seeing real infrastructure emerge:

E2B — sandboxes that start in <200ms, used by 88% of Fortune 100
Modal — code-first with great ML ergonomics
Docker MCP Gateway — 200+ curated tools with automatic security auditing
AWS Bedrock AgentCore — 9 services covering runtime, memory, gateway, browser, code interpreter, and identity. Framework-agnostic, supports A2A protocol, and just went GA

The boring-but-important stuff (RBAC, secrets management, agent registries, observability) is finally getting built.

10. Multi-agent is splitting into obvious winners + frontiers.

Sequential multi-agent systems will continue to excel in shared editing spaces. This includes the dynamic context pattern from section 3: loading a specialist profile is basically sequential multi-agent. Another common pattern: generate → critique → revise loops.

Rule of thumb: parallel execution is amazing when agents aren’t sharing the same workspace (sub agents doing research in parallel, running experiments, scouting docs, etc.). But when agents are editing the same surface, sequential tends to dominate unless you add real coordination.

For file-based work, tools like MCP Agent Mail solve this: identities, inbox/outbox, searchable threads, and advisory file reservation “leases” (TTL) so agents can signal intent and avoid clobbering each other.

For browser-based work, we’re earlier. Claude in Chrome can orchestrate across multiple tabs (competitive analysis, multi-dashboard workflows) but there’s no inter-agent communication layer yet. You can run agent-per-tab, but they can’t coordinate. That’s a gap waiting to be filled.

These patterns are converging.

Some recent work sits right at the intersection. RLMs from Alex Zhang et al.: the model manages its own context through a Python REPL (code-first), learns to fold context end-to-end through RL, delegates to sub-LLMs when needed, and iteratively refines before signaling completion (plan-first). It’s the patterns from earlier, trained end-to-end. The Bitter Lesson applied to context management. I’m bullish on future iterations that are file-based instead of REPL-based.

Open source is consolidating these patterns too. LangChain/LangGraph’s Deep Agents, inspired by Claude Code, Manus, and Deep Research, bundles: detailed system prompts, a planning tool (write_todos), file system backend for context/memory, and sub-agent spawning. Always a few weeks behind frontier players, but a solid culmination of everything I’ve described here.

One gap: we don’t have great benchmarks for these systems. We need evals that are long-context, multi-turn, and dense (where context from early turns matters later). And beyond measuring performance, we need to measure improvement — how well does an agent’s context engineering help it get better over time? I’m working on something here.

Updated Sat Jan 10

Since publishing this, these patterns have continued to gain traction. Two more examples of the file system approach: Cursor’s Dynamic Context Discovery syncs MCP tool descriptions to folders so the agent looks them up on demand (46.9% token reduction). And Vercel’s How to build agents with filesystems and bash shows them replacing custom tooling with grep, cat, and find. Cost dropped from ~$1.00 to ~$0.25 per call on Opus.

Cite this post

If you found this useful, feel free to cite it:

Hitchcock, W. (2025). AI Agent Development: End of Year Reflections. williamhitchcockai.com. https://williamhitchcockai.com/blog/ai-agent-development-end-of-year

@article{hitchcock2025agents,
  title={AI Agent Development: End of Year Reflections},
  author={Hitchcock, William},
  year={2025},
  month={January},
  url={https://williamhitchcockai.com/blog/ai-agent-development-end-of-year}
}