Migrating LangChain to Production: Lessons from the Field

Why Migrate? Recognizing the Signs

Not every LLM stack needs a rewrite. But certain patterns reliably signal that your current architecture is holding you back. We encountered all of them during a recent engagement with an AI-powered coding assistant serving thousands of developers.

Single-Model Lock-in

The platform was hardwired to Claude Sonnet with no runtime flexibility. Every request, regardless of complexity, hit the same expensive model. The inability to route requests to cost-effective models for straightforward tasks meant token costs grew linearly with user growth.

Poor Cache Utilization

Cache hit rates hovered around 10%. System prompts, tool definitions, and conversation context were being re-sent and re-processed on nearly every call. At scale, this translates directly into burning money.

Build and Deployment Friction

The existing system ran on TypeScript with LangGraph v0.4.9 and a custom framework. TypeScript compilation pushed deploy times to 5-10 minutes. The tightly coupled sandbox provider meant that changing infrastructure required touching code across multiple layers.

Concurrency Bottlenecks

Serial request handling limited the platform to roughly 2 requests per second across 10 concurrent users. As the user base grew, this became the primary scaling constraint.

If any of these sound familiar, a LangChain migration is likely worth the investment.

Planning the Migration

Assessment: Know What You Have

Before writing a line of migration code, inventory your current architecture. Document every integration point, state dependency, and provider coupling. We catalogued five core pain points: single-model lock-in, poor cache utilization, build complexity, limited concurrency, and tight coupling to the sandbox provider.

Picking the Right LangChain Version

We chose LangChain v1 (Python) and LangGraph for agent orchestration. The decision to target v1 was deliberate — native async support and direct SDK integration gave us capabilities that the TypeScript ecosystem could not match. Target the latest stable release and ensure your team has access to LangSmith for tracing and debugging.

Migration Strategy: Incremental, Not Big-Bang

We ran an incremental migration rather than a full cutover. Each component was migrated, tested in isolation via the middleware pattern, and then integrated. The middleware architecture made it possible to run old and new paths in parallel during the transition, validating output equivalence before switching traffic.

Architecture Decisions That Mattered

Python Over TypeScript

This was the first and most consequential decision. LangChain's Python SDK is the primary implementation with access to v1 Alpha features, native async, and direct SDK integration. The TypeScript port lags behind in features and community support. If your team is split, invest in Python proficiency; the ecosystem payoff is substantial.

Middleware-First Design

Rather than embedding logic in custom LangGraph nodes, we built a composable middleware stack that intercepts requests before and after model execution:

from langchain.middleware import MiddlewareStack

stack = MiddlewareStack([
    InputProcessor(),       # Normalize and validate inputs
    TokenOptimizer(),       # Reduce token count before model call
    ModelRouter(),          # Route to appropriate model by complexity
    CacheMiddleware(),      # Check and populate cache
])

agent = create_agent(
    middleware=stack,
    models={
        "complex": "claude-sonnet-4-5",
        "simple": "gemini-3-flash",
        "fallback": "gpt-4.1-mini"
    }
)

This separation of concerns made each component independently testable. When the team needed to swap model providers mid-project, changes were confined to a single layer.

Three-Tier Caching Strategy

The caching architecture was the single biggest contributor to cost reduction. We defined three fixed cache breakpoints:

Tier 1 — System Prompt (~15K tokens): 100% cache hit rate. Static across all requests.
Tier 2 — Tool Definitions (~5K tokens): 100% cache hit rate. Tool schemas rarely change.
Tier 3 — First 2 Messages (~2K tokens): 100% cache hit rate. Initial conversation context is fixed per session.

The key insight: middleware appends dynamic content without cache_control markers, preserving the static cache while still enabling dynamic content. This pushed overall cache hit rates from ~10% to over 90%.

Multi-Model Routing

The new architecture supports runtime model selection across Claude Sonnet 4.5 (complex reasoning), Gemini 3 Flash (simple tasks), and GPT-4.1 mini (fallback). A routing middleware inspects task complexity and selects the appropriate model. This alone drove 30-40% of the token cost savings.

Provider Abstraction

Tools like write_file(), read_file(), and execute_cmd() are decoupled from infrastructure through a clean SandboxProvider protocol interface. The current deployment uses Modal for code execution, but Docker and E2B can be swapped in without changing any tool code. This abstraction paid dividends mid-project when provider requirements shifted.

Common Pitfalls in LangChain Migrations

State Management: The Silent Killer

The most subtle bugs came from state contamination across conversation turns. State from one user message could leak into subsequent runs, causing incorrect workflow tracking. We solved this with per-run state isolation: each user message gets its own isolated run context, while shared conversation context persists across the session.

Prompt Compatibility

Prompts that worked in TypeScript did not always produce identical results in Python. Token boundaries differ between SDKs, and subtle formatting differences can change model behavior. Budget time for prompt regression testing with automated comparison tooling.

Testing Gaps

The middleware pattern helped with testing. Because each component has a clean input-output contract, we could write deterministic unit tests for every layer except the model call itself. For the model layer, we used LangSmith tracing to build regression datasets.

Underestimating Cache Complexity

Naive caching approaches that cache entire conversations break as soon as the conversation evolves. Fixed breakpoints with explicit cache boundaries are more predictable than dynamic cache invalidation strategies.

Performance Optimization: The Numbers

The combined effect of the architectural changes:

Cache Hit Rate: ~10% to ~90% (9x improvement)
Warm Request Latency: 1-2 seconds down to under 100ms (10-20x faster)
Single User Throughput: 0.5 req/s to 100+ req/s (200x improvement)
Concurrent Users: 2 req/s across 10 users to 50+ req/s across 100 users (25x improvement)
Token Costs: 50-70% reduction through caching and multi-model routing
Deployment Time: 5-10 minutes down to under 2 minutes (5x faster)

The throughput gains came from three sources: Python's native async eliminating serial request bottlenecks, the caching strategy avoiding redundant model calls, and multi-model routing sending simple tasks to faster, cheaper models.

Results and Timeline

Three Months, Start to Production

The entire migration was completed in three months with no downtime. The work was structured in three phases: assessment and architecture design (weeks 1-3), core migration with middleware implementation (weeks 4-9), and optimization, testing, and cutover (weeks 10-12).

Recommendations for Your Migration

Start with state design before features. Get your state isolation model right first. Everything else depends on it.
Instrument everything from day one. Track token usage, cache hit rates, and latency per-request. You cannot optimize what you cannot measure.
Abstract providers early, even if you only have one. The cost of the abstraction is low; the cost of refactoring a tightly coupled system is high.
Test middleware components independently. Deterministic unit tests on everything except the model call. Use LangSmith for the rest.

LangChain migrations are not trivial, but the payoff is real. The architectural patterns here — middleware-first design, tiered caching, and provider abstraction — provide a proven path to production-grade performance at significantly lower cost.