Why Multi-Agent Systems Matter
Single-agent architectures hit a ceiling fast. The moment your AI application needs to juggle multiple concerns — real-time user interaction, background evaluation, adaptive logic, and post-processing — a monolithic agent becomes a liability. It bloats with instructions, confuses tool boundaries, and becomes nearly impossible to debug.
Multi-agent systems solve this by decomposing complex workflows into specialized agents, each with a focused mandate, its own state, and a clear contract with the rest of the system. You need them when your application has conflicting concerns (e.g., one agent must be helpful to a user while another must evaluate that user impartially), different latency requirements (real-time vs. batch), or different model needs (fast/cheap vs. slow/powerful).
In a recent engagement, we built an AI-native technical assessment platform with 8 specialized agents handling everything from real-time coding assistance to psychometric evaluation. Here is what we learned about making multi-agent systems work in production.
Architecture Overview: Designing Agent Boundaries
The first decision in any multi-agent system is how to draw the boundaries. Get this wrong and you end up with agents that are either too chatty (constant hand-offs, high latency) or too monolithic (back to the single-agent problem).
The Hierarchical Orchestrator-Worker Pattern
We used a hierarchical orchestrator-worker pattern built on LangGraph's StateGraph primitives. A supervisor agent (the Session Orchestrator) coordinates hand-offs between specialized worker agents. Each worker owns a distinct phase of the workflow:
- Coding Agent — Candidate-facing ReAct agent that helps candidates write code with configurable helpfulness levels
- Interview Agent — Background state machine tracking candidate performance and adapting question difficulty using Item Response Theory (hidden from the candidate entirely)
- Evaluation Agent — Post-session agent performing evidence-based scoring across 4 dimensions with agentic data discovery
- Fast Progression Agent — Speed-optimized (20–40s) gate check for real-time question advancement
- Comprehensive Agent — Deep evaluation (3–5 min) generating detailed hiring manager reports
- Question Generation Agent — LLM-powered question variant generation with difficulty targeting
- Question Evaluation Agent — Per-question solution assessment
- Supervisor Agent — The orchestrator coordinating hand-offs between all of the above
Why LangGraph?
We chose LangGraph 1.0 for four reasons that matter in production:
- StateGraph primitives — Type-safe, reproducible state management using TypedDict schemas
- Native checkpointing — Conversation persistence and crash recovery out of the box
- Conditional routing — Dynamic multi-agent orchestration without writing a custom router
- Streaming support — Real-time token streaming for responsive user experiences
Drawing Boundaries: Rules of Thumb
Separate agents when they have different audiences (candidate-facing vs. internal), different timing (real-time vs. async), or different trust levels (agents that must never share context). In our system, the Coding Agent and Interview Agent run simultaneously but are strictly isolated — the candidate never sees evaluation data, and the evaluator never biases the coding assistant.
Key Patterns for Production Multi-Agent Systems
1. The Supervisor Pattern
The supervisor agent acts as the entry point and router. It inspects incoming state, determines which worker should handle the current phase, and manages hand-offs. In LangGraph, this maps naturally to conditional edges on the StateGraph:
from langgraph.graph import StateGraph, END
from typing import TypedDict
class SessionState(TypedDict):
phase: str
messages: list
candidate_theta: float
current_question: dict
def route_to_agent(state: SessionState) -> str:
if state["phase"] == "coding":
return "coding_agent"
elif state["phase"] == "evaluation":
return "evaluation_agent"
elif state["phase"] == "progression":
return "fast_progression_agent"
return END
graph = StateGraph(SessionState)
graph.add_conditional_edges("supervisor", route_to_agent)
The supervisor itself can be a simple state machine — no LLM call needed. This is a key insight: not every agent needs a language model. State machine agents cost zero tokens and execute in microseconds.
2. Agent Isolation
When agents serve different stakeholders or trust levels, isolation is non-negotiable. We implemented a 5-layer isolation strategy:
- Network layer — Separate API endpoints for candidate-facing and internal agents
- API filtering — Strip sensitive fields before data crosses trust boundaries
- Context isolation — Separate LangGraph threads (deterministic UUIDv5 thread IDs) so agents never share conversation history
- Tool access control — Non-overlapping tool sets per agent; the coding agent cannot access evaluation tools
- Audit logging — Immutable logs of every cross-agent communication for compliance
3. Middleware Pipeline
Cross-cutting concerns — caching, model selection, state extraction, checkpointing — should not live inside individual agents. We designed a composable middleware stack with 15 layers that intercept requests before and after model execution:
- Before middleware: Prompt caching setup, model selection (tiered by task complexity), turn guidance injection
- After middleware: State extraction from model output, checkpointing to PostgreSQL, persistence of evaluation artifacts
This keeps agents focused on their core logic while middleware handles the plumbing.
4. Adaptive State with Item Response Theory
For dynamic difficulty adjustment, we implemented a psychometric algorithm (IRT) as a state machine agent. After each question, the system updates a candidate ability estimate (theta) using the formula P = 1 / (1 + e^-(θ - b)) where θ is the ability estimate and b is the question difficulty. The system converges to an accurate ability estimate within 5–10 questions across a difficulty scale of 1–10.
Because this is pure math — no LLM involved — it runs at zero token cost and deterministic latency.
Production Considerations
Latency: Targeting Sub-2-Second Responses
Multi-agent systems are latency traps. Every hand-off, every LLM call, every tool invocation adds up. We hit our <2s p99 latency target through:
- Parallel tool calls — LangGraph supports batching tool invocations within a single agent step
- State machine agents — The Interview Agent and Supervisor run without LLM calls, eliminating their latency contribution
- Model tiering — Claude Haiku for fast progression checks (20–40s budget), Claude Sonnet for deep evaluation (3–5 min budget)
- Streaming — Token-level streaming to the frontend so users see responses forming immediately
Cold start latency dropped from 8–12s to 2–3s (a 70% improvement) through Cloud Run configuration tuning and connection pooling.
Cost Management: 40% Reduction via Prompt Caching
LLM costs scale with conversation length. In long technical interviews, context windows fill up fast. We implemented a three-tier prompt caching strategy using Anthropic's cache control:
- Tier 1: System prompt (~15K tokens) — 100% cache hit rate
- Tier 2: Tool definitions (~5K tokens) — 100% cache hit rate
- Tier 3: Message context (~2K tokens) — 100% cache hit rate for recent turns
Result: token costs per session dropped from $2.50 to $1.50 — a 40% reduction. At scale, this is the difference between a viable product and a money pit.
Error Handling and Observability
Multi-agent systems fail in ways single agents do not. An agent can hang, produce malformed state, or enter an infinite hand-off loop. Our approach:
- LangSmith tracing — Every agent invocation, tool call, and state transition is traced end-to-end
- Sentry integration — Exception tracking with agent context attached
- Checkpointing — LangGraph's native PostgreSQL checkpointer means sessions survive crashes and can be resumed
- Deterministic thread IDs — UUIDv5 derived from session identifiers, making debugging reproducible
Results and Lessons Learned
After 12 weeks of development and deployment, the system achieved:
- 8 specialized agents running in production with clear boundaries and responsibilities
- 100+ concurrent sessions supported without degradation
- <2s p99 response latency for candidate-facing interactions
- 40% cost reduction through three-tier prompt caching
- $1.50 average cost per assessment session
- 70% cold start improvement (from 8–12s down to 2–3s)
What We Would Do Differently
- Start with fewer agents. Begin with 2–3 and split only when you have clear evidence that an agent is doing too much. Premature decomposition creates coordination overhead.
- Invest in observability early. LangSmith tracing should be wired in from day one, not retrofitted.
- Use state machines aggressively. Every agent that does not need an LLM should be a state machine. They are faster, cheaper, and more predictable.
- Design for isolation from the start. Retrofitting trust boundaries between agents is significantly harder than building them in from the beginning.
Getting Started
If you are building your first multi-agent system with LangGraph, here is a practical sequence:
- Map your workflow — Identify the distinct phases, audiences, and trust levels in your application.
- Define your state schema — Use TypedDict to create a shared state contract. This is the backbone of your system.
- Build the supervisor first — Start with a simple conditional router.
- Implement one worker agent — Get a single agent working end-to-end with checkpointing and streaming before adding more.
- Add isolation layers — If agents serve different stakeholders, wire in context isolation and tool access control immediately.
- Set up observability — Integrate LangSmith tracing before you add your second agent.
- Optimize costs last — Prompt caching and model tiering are powerful but only matter once the system works correctly.
Multi-agent systems are not inherently more complex than single agents — they are differently complex. The patterns above, drawn from a real production deployment, should give you a concrete starting point for your own implementation.