LESSON
Day 320: Deployment & Serving - Production LLM Infrastructure
The core idea: Shipping an LLM to production is not "put model behind API." It is the system design problem of turning a costly, stateful, latency-sensitive model into a service that can survive real traffic, real failures, real budgets, and real policy constraints without drifting away from product quality.
Today's "Aha!" Moment
The insight: A production LLM stack is where all the earlier lessons stop being separate topics and become one operating system:
- training and post-training determine behavior
- evaluation decides what is acceptable
- quantization and inference optimization determine economics
- deployment architecture determines whether the whole thing survives traffic
Why this matters: Many teams optimize one layer in isolation and then get surprised somewhere else:
- the model is good, but too expensive
- the runtime is fast, but p95 is unstable
- the safety layer is strong, but user utility collapses
- autoscaling works, but long-context traffic starves everyone else
Concrete anchor: A chat product may need low TTFT for interactive feel, a background batch pipeline may want maximum throughput, and an agent endpoint may need strong tenant isolation and tool gating. "One serving setup" rarely fits all three.
Keep this mental hook in view: Production LLM serving is workload-aware infrastructure, not just model hosting.
Why This Matters
20/15.md showed that inference speed depends on the runtime stack: prefill, decode, batching, KV cache, kernels, and scheduler design.
This final lesson closes the month by widening the lens one last time:
- once inference is efficient, you still need to decide how to deploy it safely and sustainably
That means answering questions like:
- how do we split traffic across models or model sizes?
- how do we isolate tenants or workloads with very different shapes?
- when do we autoscale replicas versus batch more aggressively?
- how do we roll out a new model without silently breaking quality or cost?
This is where model engineering becomes platform engineering.
Learning Objectives
By the end of this session, you should be able to:
- Explain the main architectural responsibilities of a production LLM serving stack.
- Describe how routing, autoscaling, caching, isolation, and observability interact in deployment decisions.
- Evaluate deployment options against real product constraints: latency, throughput, cost, safety, and release risk.
Core Concepts Explained
Concept 1: Production LLM Serving Starts with Workload Segmentation, Not with One Universal Endpoint
For example, a company serves three workloads from the same base model family:
- interactive chat with low TTFT expectations
- background summarization jobs with large batch windows
- agent workflows with tool calls and long contexts
If all of them share exactly the same serving pool and scheduling policy, one workload will usually damage another.
At a high level, Different LLM workloads stress the system differently. The right first question is:
- what kinds of traffic are we actually serving?
not:
- which serving framework should we pick first?
Mechanically: Workload segmentation often happens along lines such as:
- short interactive vs long-form generation
- online user-facing vs offline batch
- tool-using vs plain text generation
- high-priority enterprise tenants vs background low-priority traffic
- small model routing vs large model fallback
Once that segmentation is explicit, you can make better choices about:
- replica pools
- schedulers
- queue separation
- admission control
- cost allocation
In practice:
- one cluster can host multiple serving classes, but they usually need different policies
- long-context requests often deserve separate queues or separate pools
- model cascades become easier to manage when routes are explicit
The trade-off is clear: More workload specialization improves control and predictability, but increases operational complexity and the number of moving parts.
A useful mental model is: A production LLM platform is closer to an airport than to a single road. Different planes, routes, and priorities need different gates and traffic rules.
Use this lens when:
- Best fit: planning the first serious serving architecture for a real product portfolio.
- Misuse pattern: putting every request through one identical path because it is simpler on paper.
Concept 2: Deployment Quality Comes from Routing, Scaling, and Isolation Working Together
For example, a team serves all traffic with one quantized 70B model. Quality is high, but cost is extreme. They add a smaller model for easy requests, route hard prompts upward, isolate enterprise tenants, and autoscale decode-heavy pools differently from prefill-heavy ones. The result is not one model improvement, but a serving-architecture improvement.
At a high level, In production, "deployment" means choosing how requests meet capacity and policy.
Mechanically: Important building blocks include:
- request routing
- send traffic to the right model, region, tier, or queue
- may depend on tenant, prompt size, product surface, or risk class
- autoscaling
- scale replicas or serving pods against concurrency, queue depth, TTFT, GPU saturation, or tokens/sec
- isolation
- prevent one tenant or workload from consuming disproportionate memory, cache, or scheduler time
- caching
- prefix caching, embedding caches, result caches, or tool-response caches can reduce repeated work
- fallback and graceful degradation
- smaller model fallback
- delayed batch path
- reduced context windows
- temporary tool disablement under pressure
These mechanisms work best together. Routing without isolation can still collapse under noisy neighbors. Autoscaling without queue control can just scale chaos faster.
In practice:
- deployment policy is as important as model quality
- tiered serving often dominates "single best model for everyone"
- graceful degradation is part of reliability, not a sign of failure
The trade-off is clear: More routing and isolation logic gives better control over cost and latency, but makes debugging and release management more complex.
A useful mental model is: A good serving stack is a traffic-control system for expensive reasoning.
Use this lens when:
- Best fit: deciding how to serve multiple products, tenants, or model sizes under one platform.
- Misuse pattern: relying on autoscaling alone to solve problems that are really routing or fairness problems.
Concept 3: Release Safety in LLM Serving Requires Observability and Controlled Rollout, Not Just Healthy Pods
For example, a new model version is deployed and all pods stay green. CPU, memory, and GPU look normal. But user satisfaction drops, refusal rate spikes, and JSON tool-call validity falls. From infrastructure health alone, the deployment looked fine.
At a high level, Serving health and model behavior health are different things. Production LLM infrastructure must observe both.
Mechanically: A robust deployment pipeline usually needs:
- infrastructure telemetry
- queue depth
- TTFT
- p95/p99 latency
- GPU utilization
- cache hit behavior
- error rates
- behavioral telemetry
- refusal rates
- task success
- structured output validity
- policy violations
- cost per successful task
- controlled rollout
- canaries
- shadow traffic
- cohort-based routing
- rollback thresholds tied to both infra and behavior metrics
This is what makes LLM deployment different from serving a static classifier. The service can stay technically alive while the product is behaviorally regressing.
In practice:
- model rollouts should be treated like product rollouts, not just container updates
- rollback criteria should include task and policy metrics, not just uptime
- observability needs to join systems metrics and model metrics in one release view
The trade-off is clear: Better release control reduces catastrophic regressions, but slows down rollout velocity and increases measurement overhead.
A useful mental model is: A healthy LLM deployment is one where the hardware, the queue, and the user-visible behavior are all inside acceptable bounds at the same time.
Use this lens when:
- Best fit: productionizing model updates, runtime changes, or safety-policy changes.
- Misuse pattern: declaring success because the service remained available.
Troubleshooting
Issue: "Our serving cluster is healthy, but user experience is worse."
Why it happens / is confusing: Infrastructure metrics only show whether the system is alive, not whether the model outputs remain useful, safe, or well formed.
Clarification / Fix: Add behavioral release metrics to the rollout gate: task completion, refusal profile, structured output validity, and quality regressions on shadow or sampled traffic.
Issue: "Autoscaling should solve our latency spikes."
Why it happens / is confusing: Scaling feels like the general-purpose answer to demand.
Clarification / Fix: If routing, queueing, or tenant isolation are wrong, autoscaling may simply add more expensive replicas while preserving the same contention pattern.
Issue: "One large model endpoint is simpler, so it is probably best."
Why it happens / is confusing: Operational simplicity is attractive.
Clarification / Fix: A single endpoint can be viable early, but mixed workloads often justify model cascades, queue separation, or traffic classes once cost and tail latency start to dominate.
Advanced Connections
Connection 1: Deployment & Serving <-> Inference Optimization
20/15.md focused on how to make one serving path fast.
This lesson answers the next step:
- how to run many serving paths safely in production
- how to route among them
- how to scale them
- how to release them without breaking the product
Connection 2: Deployment & Serving <-> The Whole Month
This month began with data and post-training, then moved through preference optimization, safety, evaluation, quantization, and inference.
Production serving is where all of those constraints meet:
- data and training shape behavior
- alignment shapes intent following
- safety shapes policy boundaries
- eval shapes release decisions
- quantization and runtime shape cost and latency
That is why deployment is the natural capstone for the whole block.
Resources
Optional Deepening Resources
-
[PAPER] Efficient Memory Management for Large Language Model Serving with PagedAttention
- Focus: A foundational serving-runtime idea that directly affects production architecture choices.
-
[DOC] vLLM Documentation
- Focus: Practical runtime concepts for scalable LLM serving and continuous batching.
-
[DOC] KServe Documentation
- Focus: Platform patterns for model deployment, autoscaling, and inference services on Kubernetes.
-
[DOC] Ray Serve Documentation
- Focus: Request routing, replicas, and multi-model serving patterns for production AI systems.
Key Insights
- Production LLM serving begins with workload segmentation - different request classes need different policies, queues, and serving shapes.
- Routing, scaling, and isolation are the core deployment levers - not just model choice or raw hardware count.
- A production rollout is only healthy if both infrastructure and behavior stay within bounds - uptime alone is not enough.