Multi-Agent System Architecture: A Practical Guide for Production

When You Need Multi-Agent Systems

Not every AI application needs multiple agents. A single well-designed agent with good tools can handle surprisingly complex tasks. But multi-agent systems become necessary when your application has:

Conflicting concerns — one agent must be helpful (e.g., a coding assistant) while another must evaluate impartially (e.g., a performance assessor). These cannot share context without bias.
Different latency requirements — real-time user interaction (sub-second) alongside background analysis (minutes). A single agent cannot serve both.
Different trust levels — some agents interact with users, others access internal data. Isolation boundaries matter for security and correctness.
Specialized expertise — complex workflows where each step requires different tools, prompts, and model configurations.

If none of these apply, stick with a single agent. Complexity is not free.

Architecture Patterns

Pattern 1: Orchestrator-Worker

A supervisor agent routes work to specialized worker agents. The orchestrator manages state, hand-offs, and error recovery. Workers focus on their specific task.

This is our most-used pattern. It maps naturally to LangGraph's StateGraph where the orchestrator is the entry node and workers are subgraphs with conditional routing between them.

Best for: Sequential workflows with clear stages (intake → process → evaluate → output).

Pattern 2: Peer-to-Peer

Agents communicate directly without a central orchestrator. Each agent decides when to hand off work and to whom. This is more flexible but harder to debug and reason about.

Best for: Collaborative tasks where multiple agents contribute to a shared output (e.g., research + writing + editing).

Pattern 3: Hierarchical

Multi-level orchestration where a top-level supervisor delegates to mid-level supervisors, who delegate to workers. This manages complexity for large systems but adds latency.

Best for: Large-scale systems with 10+ agents that naturally group into subsystems.

Drawing Agent Boundaries

The most important design decision in a multi-agent system is where to draw the boundaries between agents. Get this wrong and you end up with agents that are either too chatty (constant hand-offs, high latency) or too monolithic (back to the single-agent problem).

Rules of Thumb

Separate by audience — agents that talk to users vs. agents that run internally should be separate
Separate by timing — real-time agents vs. batch processing agents should be separate
Separate by trust — agents that need different security contexts or data access should be separate
Do NOT separate by "step" — breaking a sequential workflow into one-agent-per-step creates unnecessary hand-off overhead. Only separate when there is a genuine boundary (audience, timing, or trust change).

State Management

State is the hardest part of multi-agent systems. Every agent needs access to some shared state, but unrestricted access creates coupling and bugs.

State Design Principles

Typed state — define state schemas using TypedDict or Pydantic models. No untyped dictionaries.
Scoped access — each agent reads and writes only the state fields it needs. Use reducer functions to control how state updates merge.
Immutable history — append to state rather than overwriting. This makes debugging and replay possible.
Checkpointing — persist state at every agent hand-off. If an agent fails, you can resume from the last checkpoint instead of restarting the entire workflow.

LangGraph State Example

In LangGraph, state is defined as a TypedDict with optional reducer annotations. Each node (agent) in the graph receives the full state but should only modify its designated fields:

from typing import TypedDict, Annotated
from operator import add
from langgraph.graph import StateGraph

class WorkflowState(TypedDict):
    messages: Annotated[list, add]  # append-only
    current_phase: str
    evaluation_score: float | None
    final_output: str | None

Error Handling and Recovery

In production, agents fail. LLM calls time out, tool calls return unexpected results, and agents produce invalid outputs. Design for failure from the start.

Essential Error Handling Patterns

Retry with backoff — transient LLM failures should be retried (2-3 attempts with exponential backoff)
Fallback models — if the primary model fails, fall back to an alternative (e.g., GPT-4 → Claude Sonnet)
Output validation — validate agent outputs against expected schemas before passing to the next agent. Invalid outputs trigger retry or error state.
Circuit breakers — if an agent fails repeatedly, stop retrying and escalate (log, alert, or fall back to a manual process)
Timeout boundaries — set timeouts on every agent execution. A stuck agent should not block the entire workflow.

Observability

Multi-agent systems are hard to debug without good observability. You need to trace the full execution path: which agents ran, what state they received, what they produced, and how long they took.

What to Track

Agent execution order and timing
State at each hand-off point
LLM calls: model, tokens, latency, cost
Tool calls: inputs, outputs, errors
Final vs. intermediate outputs

LangSmith is our go-to for multi-agent observability. It traces the full graph execution with nested spans for each agent, making it straightforward to identify where things went wrong.

Cost Management

Multi-agent systems multiply LLM costs because multiple agents make multiple calls per user request. Cost management is not optional.

Use the cheapest viable model for each agent — a classification agent does not need GPT-4
Cache agent outputs — if an agent produces the same output for similar inputs, cache it
Limit agent iterations — set maximum iteration counts to prevent runaway agents
Track cost per workflow execution — know what each user request costs end-to-end

Lessons From Production

After deploying 8+ production agents across multiple client engagements, here are the lessons that were not obvious upfront:

Start with one agent. Add agents only when you hit a genuine boundary (audience, timing, trust). Premature decomposition creates unnecessary complexity.
State design is architecture. If you get the state schema right, the agent boundaries usually follow naturally. If you are struggling with boundaries, revisit your state design.
Test agents in isolation first. Each agent should be testable independently with mocked state inputs. Integration testing comes after unit testing.
Human-in-the-loop is not optional. Production multi-agent systems need escape hatches where humans can intervene, approve, or override. Build these from the start.
Monitor cost per agent, not just total cost. One expensive agent in a pipeline of cheap ones is easy to miss in aggregate metrics.

Multi-agent systems are powerful but not simple. The architecture decisions you make early — boundaries, state, error handling — determine whether the system is maintainable at scale or becomes an untestable tangle. Invest the time in getting the foundations right.