Reliable Multi-Agent Systems in Python at 95% Accuracy
TL;DR
Autonomous agents often fail in production due to context pollution. Here is the exact 'Manager-Critic' architecture and Pydantic pattern I used to achieve 95% reliability.
The "Agent" Illusion vs Reality
When we talk about AI Agents in 2026, we often imagine autonomous digital employees. The reality is far more fragile. An agent is simply a loop with access to tools, but without rigorous state management, that loop becomes a spiral of hallucination.
What is a Multi-Agent System? A Multi-Agent System (MAS) is an architecture where multiple specialized LLM instances collaborate to solve complex tasks. Unlike a single zero-shot prompt, a MAS distributes reasoning, planning, and execution across distinct nodes.
I recently built a multi-agent system designed to autonomously refactor legacy codebases. The goal was simple: one agent reads the code, another plans the refactor, and a third executes it.
The initial accuracy? Below 40%. Here is how I pushed it to 95%.
1. Context Pollution is the Enemy
The biggest mistake I made early on was sharing the entire conversation history between agents.
# DON'T DO THIS
messages = [
{"role": "user", "content": "Refactor this file..."},
{"role": "assistant", "content": "I am analyzing..."},
{"role": "tool", "content": "File content..."} # <--- 5000 lines of code
]
# Passing this to the next agent is suicide.
When Agent A (The Planner) hands off work to Agent B (The Coder), Agent B doesn't need to know the 5,000 lines of file reading logs. It only needs the Plan.
The Solution: Ephemeral Context Windows I implemented a "Manager" node that sanitizes the context before passing it down. The Manager extracts the Intent and the Artifact, discarding the Process. This reduces token noise and forces the model to focus on the immediate task.
2. Structured Outputs or Death
If you are parsing Regex from an LLM's raw text response in 2026, you are doing it wrong.
I switched entirely to Pydantic models for agent communication. Agent A doesn't "say" what to do; it returns a JSON object adhering to a strict schema.
class RefactorPlan(BaseModel):
files_to_touch: List[str]
risk_level: Literal["low", "medium", "high"]
steps: List[str]
# This forces the model to think in structure, not prose.
Using Gemini's structured output mode, this eliminated 99% of "I'm sorry, I misunderstood" errors. The model simply cannot respond outside the schema.
3. The "Critic" Loop
Optimism is a bug in AI. Models want to please you. If you ask "Did you fix the bug?", they will say "Yes!".
I introduced a distinct Critic Agent.
- Coder Persona: "I fixed the bug."
- Critic Persona: "I don't trust you. Run the tests."
The Critic has no write access to the code. It only has read access and execution access (CLI). If the output of npm test isn't green, the Critic rejects the Coder's pull request. Even better, it passes the stderr output back to the Coder.
Conclusion
Building agents isn't about better prompting. It's about better Architecture. It is software engineering applied to probabilistic functions. You need types, you need state isolation, and you need adversarial testing.
The future isn't just "smarter" models; it's more disciplined systems.
FAQ
Q: What is the difference between an Agent and a Chain? A: A Chain is a fixed sequence of steps. An Agent decides its own steps based on reasoning.
Q: Why Pydantic over standard JSON? A: Pydantic provides runtime validation, ensuring that the LLM's output matches the exact types your code expects.
Q: How do you handle infinite loops? A: We implement a "Maximum Turn Count" (e.g., 10 turns) after which the Manager kills the process and reports failure.
Q: Is this open source? A: Parts of the architecture are available on my GitHub.
Q: Which model is best for the Critic? A: We found that Claude 3.5 Sonnet excels at critique, while GPT-4o is faster for code generation.
Written by
Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. Creator of LetX, QuantumSketch, and more.