LESSON

016 30 min intermediate

Day 320: Deployment & Serving - Production LLM Infrastructure

The core idea: Shipping an LLM to production is not "put model behind API." It is the system design problem of turning a costly, stateful, latency-sensitive model into a service that can survive real traffic, real failures, real budgets, and real policy constraints without drifting away from product quality.

Today's "Aha!" Moment

The insight: A production LLM stack is where all the earlier lessons stop being separate topics and become one operating system:

training and post-training determine behavior
evaluation decides what is acceptable
quantization and inference optimization determine economics
deployment architecture determines whether the whole thing survives traffic

Why this matters: Many teams optimize one layer in isolation and then get surprised somewhere else:

the model is good, but too expensive
the runtime is fast, but p95 is unstable
the safety layer is strong, but user utility collapses
autoscaling works, but long-context traffic starves everyone else

Concrete anchor: A chat product may need low TTFT for interactive feel, a background batch pipeline may want maximum throughput, and an agent endpoint may need strong tenant isolation and tool gating. "One serving setup" rarely fits all three.

Keep this mental hook in view: Production LLM serving is workload-aware infrastructure, not just model hosting.

Why This Matters

20/15.md showed that inference speed depends on the runtime stack: prefill, decode, batching, KV cache, kernels, and scheduler design.

This final lesson closes the month by widening the lens one last time:

once inference is efficient, you still need to decide how to deploy it safely and sustainably

That means answering questions like:

how do we split traffic across models or model sizes?
how do we isolate tenants or workloads with very different shapes?
when do we autoscale replicas versus batch more aggressively?
how do we roll out a new model without silently breaking quality or cost?

This is where model engineering becomes platform engineering.

Learning Objectives

By the end of this session, you should be able to:

Explain the main architectural responsibilities of a production LLM serving stack.
Describe how routing, autoscaling, caching, isolation, and observability interact in deployment decisions.
Evaluate deployment options against real product constraints: latency, throughput, cost, safety, and release risk.

Core Concepts Explained

Concept 1: Production LLM Serving Starts with Workload Segmentation, Not with One Universal Endpoint

For example, a company serves three workloads from the same base model family:

interactive chat with low TTFT expectations
background summarization jobs with large batch windows
agent workflows with tool calls and long contexts

If all of them share exactly the same serving pool and scheduling policy, one workload will usually damage another.

At a high level, Different LLM workloads stress the system differently. The right first question is:

what kinds of traffic are we actually serving?

not:

which serving framework should we pick first?

Mechanically: Workload segmentation often happens along lines such as:

short interactive vs long-form generation
online user-facing vs offline batch
tool-using vs plain text generation
high-priority enterprise tenants vs background low-priority traffic
small model routing vs large model fallback

Once that segmentation is explicit, you can make better choices about:

replica pools
schedulers
queue separation
admission control
cost allocation

In practice:

one cluster can host multiple serving classes, but they usually need different policies
long-context requests often deserve separate queues or separate pools
model cascades become easier to manage when routes are explicit

The trade-off is clear: More workload specialization improves control and predictability, but increases operational complexity and the number of moving parts.

A useful mental model is: A production LLM platform is closer to an airport than to a single road. Different planes, routes, and priorities need different gates and traffic rules.

Use this lens when:

Best fit: planning the first serious serving architecture for a real product portfolio.
Misuse pattern: putting every request through one identical path because it is simpler on paper.

Concept 2: Deployment Quality Comes from Routing, Scaling, and Isolation Working Together

For example, a team serves all traffic with one quantized 70B model. Quality is high, but cost is extreme. They add a smaller model for easy requests, route hard prompts upward, isolate enterprise tenants, and autoscale decode-heavy pools differently from prefill-heavy ones. The result is not one model improvement, but a serving-architecture improvement.

At a high level, In production, "deployment" means choosing how requests meet capacity and policy.

Mechanically: Important building blocks include:

request routing
- send traffic to the right model, region, tier, or queue
- may depend on tenant, prompt size, product surface, or risk class
autoscaling
- scale replicas or serving pods against concurrency, queue depth, TTFT, GPU saturation, or tokens/sec
isolation
- prevent one tenant or workload from consuming disproportionate memory, cache, or scheduler time
caching
- prefix caching, embedding caches, result caches, or tool-response caches can reduce repeated work
fallback and graceful degradation
- smaller model fallback
- delayed batch path
- reduced context windows
- temporary tool disablement under pressure

These mechanisms work best together. Routing without isolation can still collapse under noisy neighbors. Autoscaling without queue control can just scale chaos faster.

In practice:

deployment policy is as important as model quality
tiered serving often dominates "single best model for everyone"
graceful degradation is part of reliability, not a sign of failure

The trade-off is clear: More routing and isolation logic gives better control over cost and latency, but makes debugging and release management more complex.

A useful mental model is: A good serving stack is a traffic-control system for expensive reasoning.

Use this lens when:

Best fit: deciding how to serve multiple products, tenants, or model sizes under one platform.
Misuse pattern: relying on autoscaling alone to solve problems that are really routing or fairness problems.

Concept 3: Release Safety in LLM Serving Requires Observability and Controlled Rollout, Not Just Healthy Pods

For example, a new model version is deployed and all pods stay green. CPU, memory, and GPU look normal. But user satisfaction drops, refusal rate spikes, and JSON tool-call validity falls. From infrastructure health alone, the deployment looked fine.

At a high level, Serving health and model behavior health are different things. Production LLM infrastructure must observe both.

Mechanically: A robust deployment pipeline usually needs:

infrastructure telemetry
- queue depth
- TTFT
- p95/p99 latency
- GPU utilization
- cache hit behavior
- error rates
behavioral telemetry
- refusal rates
- task success
- structured output validity
- policy violations
- cost per successful task
controlled rollout
- canaries
- shadow traffic
- cohort-based routing
- rollback thresholds tied to both infra and behavior metrics

This is what makes LLM deployment different from serving a static classifier. The service can stay technically alive while the product is behaviorally regressing.

In practice:

model rollouts should be treated like product rollouts, not just container updates
rollback criteria should include task and policy metrics, not just uptime
observability needs to join systems metrics and model metrics in one release view

The trade-off is clear: Better release control reduces catastrophic regressions, but slows down rollout velocity and increases measurement overhead.

A useful mental model is: A healthy LLM deployment is one where the hardware, the queue, and the user-visible behavior are all inside acceptable bounds at the same time.

Use this lens when:

Best fit: productionizing model updates, runtime changes, or safety-policy changes.
Misuse pattern: declaring success because the service remained available.

Troubleshooting

Issue: "Our serving cluster is healthy, but user experience is worse."

Why it happens / is confusing: Infrastructure metrics only show whether the system is alive, not whether the model outputs remain useful, safe, or well formed.

Clarification / Fix: Add behavioral release metrics to the rollout gate: task completion, refusal profile, structured output validity, and quality regressions on shadow or sampled traffic.

Issue: "Autoscaling should solve our latency spikes."

Why it happens / is confusing: Scaling feels like the general-purpose answer to demand.

Clarification / Fix: If routing, queueing, or tenant isolation are wrong, autoscaling may simply add more expensive replicas while preserving the same contention pattern.

Issue: "One large model endpoint is simpler, so it is probably best."

Why it happens / is confusing: Operational simplicity is attractive.

Clarification / Fix: A single endpoint can be viable early, but mixed workloads often justify model cascades, queue separation, or traffic classes once cost and tail latency start to dominate.

Advanced Connections

Connection 1: Deployment & Serving <-> Inference Optimization

20/15.md focused on how to make one serving path fast.

This lesson answers the next step:

how to run many serving paths safely in production
how to route among them
how to scale them
how to release them without breaking the product

Connection 2: Deployment & Serving <-> The Whole Month

This month began with data and post-training, then moved through preference optimization, safety, evaluation, quantization, and inference.

Production serving is where all of those constraints meet:

data and training shape behavior
alignment shapes intent following
safety shapes policy boundaries
eval shapes release decisions
quantization and runtime shape cost and latency

That is why deployment is the natural capstone for the whole block.

Resources

Optional Deepening Resources

[PAPER] Efficient Memory Management for Large Language Model Serving with PagedAttention
- Focus: A foundational serving-runtime idea that directly affects production architecture choices.
[DOC] vLLM Documentation
- Focus: Practical runtime concepts for scalable LLM serving and continuous batching.
[DOC] KServe Documentation
- Focus: Platform patterns for model deployment, autoscaling, and inference services on Kubernetes.
[DOC] Ray Serve Documentation
- Focus: Request routing, replicas, and multi-model serving patterns for production AI systems.

Key Insights

Production LLM serving begins with workload segmentation - different request classes need different policies, queues, and serving shapes.
Routing, scaling, and isolation are the core deployment levers - not just model choice or raw hardware count.
A production rollout is only healthy if both infrastructure and behavior stay within bounds - uptime alone is not enough.

← Back to LLM Training, Alignment, and Serving

← Back to Learning Hub