How to Build Production Multi-Agent Systems with LangGraph

Why Multi-Agent Systems Matter

Single-agent architectures hit a ceiling fast. The moment your AI application needs to juggle multiple concerns — real-time user interaction, background evaluation, adaptive logic, and post-processing — a monolithic agent becomes a liability. It bloats with instructions, confuses tool boundaries, and becomes nearly impossible to debug.

Multi-agent systems solve this by decomposing complex workflows into specialized agents, each with a focused mandate, its own state, and a clear contract with the rest of the system. You need them when your application has conflicting concerns (e.g., one agent must be helpful to a user while another must evaluate that user impartially), different latency requirements (real-time vs. batch), or different model needs (fast/cheap vs. slow/powerful).

In a recent engagement, we built an AI-native technical assessment platform with 8 specialized agents handling everything from real-time coding assistance to psychometric evaluation. Here is what we learned about making multi-agent systems work in production.

Architecture Overview: Designing Agent Boundaries

The first decision in any multi-agent system is how to draw the boundaries. Get this wrong and you end up with agents that are either too chatty (constant hand-offs, high latency) or too monolithic (back to the single-agent problem).

The Hierarchical Orchestrator-Worker Pattern

We used a hierarchical orchestrator-worker pattern built on LangGraph's StateGraph primitives. A supervisor agent (the Session Orchestrator) coordinates hand-offs between specialized worker agents. Each worker owns a distinct phase of the workflow:

Coding Agent — Candidate-facing ReAct agent that helps candidates write code with configurable helpfulness levels
Interview Agent — Background state machine tracking candidate performance and adapting question difficulty using Item Response Theory (hidden from the candidate entirely)
Evaluation Agent — Post-session agent performing evidence-based scoring across 4 dimensions with agentic data discovery
Fast Progression Agent — Speed-optimized (20–40s) gate check for real-time question advancement
Comprehensive Agent — Deep evaluation (3–5 min) generating detailed hiring manager reports
Question Generation Agent — LLM-powered question variant generation with difficulty targeting
Question Evaluation Agent — Per-question solution assessment
Supervisor Agent — The orchestrator coordinating hand-offs between all of the above

Why LangGraph?

We chose LangGraph 1.0 for four reasons that matter in production:

StateGraph primitives — Type-safe, reproducible state management using TypedDict schemas
Native checkpointing — Conversation persistence and crash recovery out of the box
Conditional routing — Dynamic multi-agent orchestration without writing a custom router
Streaming support — Real-time token streaming for responsive user experiences

Drawing Boundaries: Rules of Thumb

Separate agents when they have different audiences (candidate-facing vs. internal), different timing (real-time vs. async), or different trust levels (agents that must never share context). In our system, the Coding Agent and Interview Agent run simultaneously but are strictly isolated — the candidate never sees evaluation data, and the evaluator never biases the coding assistant.

Key Patterns for Production Multi-Agent Systems

1. The Supervisor Pattern

The supervisor agent acts as the entry point and router. It inspects incoming state, determines which worker should handle the current phase, and manages hand-offs. In LangGraph, this maps naturally to conditional edges on the StateGraph:

from langgraph.graph import StateGraph, END
from typing import TypedDict

class SessionState(TypedDict):
    phase: str
    messages: list
    candidate_theta: float
    current_question: dict

def route_to_agent(state: SessionState) -> str:
    if state["phase"] == "coding":
        return "coding_agent"
    elif state["phase"] == "evaluation":
        return "evaluation_agent"
    elif state["phase"] == "progression":
        return "fast_progression_agent"
    return END

graph = StateGraph(SessionState)
graph.add_conditional_edges("supervisor", route_to_agent)

The supervisor itself can be a simple state machine — no LLM call needed. This is a key insight: not every agent needs a language model. State machine agents cost zero tokens and execute in microseconds.

2. Agent Isolation

When agents serve different stakeholders or trust levels, isolation is non-negotiable. We implemented a 5-layer isolation strategy:

Network layer — Separate API endpoints for candidate-facing and internal agents
API filtering — Strip sensitive fields before data crosses trust boundaries
Context isolation — Separate LangGraph threads (deterministic UUIDv5 thread IDs) so agents never share conversation history
Tool access control — Non-overlapping tool sets per agent; the coding agent cannot access evaluation tools
Audit logging — Immutable logs of every cross-agent communication for compliance

3. Middleware Pipeline

Cross-cutting concerns — caching, model selection, state extraction, checkpointing — should not live inside individual agents. We designed a composable middleware stack with 15 layers that intercept requests before and after model execution:

Before middleware: Prompt caching setup, model selection (tiered by task complexity), turn guidance injection
After middleware: State extraction from model output, checkpointing to PostgreSQL, persistence of evaluation artifacts

This keeps agents focused on their core logic while middleware handles the plumbing.

4. Adaptive State with Item Response Theory

For dynamic difficulty adjustment, we implemented a psychometric algorithm (IRT) as a state machine agent. After each question, the system updates a candidate ability estimate (theta) using the formula P = 1 / (1 + e^-(θ - b)) where θ is the ability estimate and b is the question difficulty. The system converges to an accurate ability estimate within 5–10 questions across a difficulty scale of 1–10.

Because this is pure math — no LLM involved — it runs at zero token cost and deterministic latency.

Production Considerations

Latency: Targeting Sub-2-Second Responses

Multi-agent systems are latency traps. Every hand-off, every LLM call, every tool invocation adds up. We hit our <2s p99 latency target through:

Parallel tool calls — LangGraph supports batching tool invocations within a single agent step
State machine agents — The Interview Agent and Supervisor run without LLM calls, eliminating their latency contribution
Model tiering — Claude Haiku for fast progression checks (20–40s budget), Claude Sonnet for deep evaluation (3–5 min budget)
Streaming — Token-level streaming to the frontend so users see responses forming immediately

Cold start latency dropped from 8–12s to 2–3s (a 70% improvement) through Cloud Run configuration tuning and connection pooling.

Cost Management: 40% Reduction via Prompt Caching

LLM costs scale with conversation length. In long technical interviews, context windows fill up fast. We implemented a three-tier prompt caching strategy using Anthropic's cache control:

Tier 1: System prompt (~15K tokens) — 100% cache hit rate
Tier 2: Tool definitions (~5K tokens) — 100% cache hit rate
Tier 3: Message context (~2K tokens) — 100% cache hit rate for recent turns

Result: token costs per session dropped from $2.50 to $1.50 — a 40% reduction. At scale, this is the difference between a viable product and a money pit.

Error Handling and Observability

Multi-agent systems fail in ways single agents do not. An agent can hang, produce malformed state, or enter an infinite hand-off loop. Our approach:

LangSmith tracing — Every agent invocation, tool call, and state transition is traced end-to-end
Sentry integration — Exception tracking with agent context attached
Checkpointing — LangGraph's native PostgreSQL checkpointer means sessions survive crashes and can be resumed
Deterministic thread IDs — UUIDv5 derived from session identifiers, making debugging reproducible

Results and Lessons Learned

After 12 weeks of development and deployment, the system achieved:

8 specialized agents running in production with clear boundaries and responsibilities
100+ concurrent sessions supported without degradation
<2s p99 response latency for candidate-facing interactions
40% cost reduction through three-tier prompt caching
$1.50 average cost per assessment session
70% cold start improvement (from 8–12s down to 2–3s)

What We Would Do Differently

Start with fewer agents. Begin with 2–3 and split only when you have clear evidence that an agent is doing too much. Premature decomposition creates coordination overhead.
Invest in observability early. LangSmith tracing should be wired in from day one, not retrofitted.
Use state machines aggressively. Every agent that does not need an LLM should be a state machine. They are faster, cheaper, and more predictable.
Design for isolation from the start. Retrofitting trust boundaries between agents is significantly harder than building them in from the beginning.

Getting Started

If you are building your first multi-agent system with LangGraph, here is a practical sequence:

Map your workflow — Identify the distinct phases, audiences, and trust levels in your application.
Define your state schema — Use TypedDict to create a shared state contract. This is the backbone of your system.
Build the supervisor first — Start with a simple conditional router.
Implement one worker agent — Get a single agent working end-to-end with checkpointing and streaming before adding more.
Add isolation layers — If agents serve different stakeholders, wire in context isolation and tool access control immediately.
Set up observability — Integrate LangSmith tracing before you add your second agent.
Optimize costs last — Prompt caching and model tiering are powerful but only matter once the system works correctly.

Multi-agent systems are not inherently more complex than single agents — they are differently complex. The patterns above, drawn from a real production deployment, should give you a concrete starting point for your own implementation.