LESSON
Day 319: Inference Optimization - 10x Faster LLM Inference
The core idea: Inference optimization is not one trick that magically makes LLMs fast. It is the systems problem of turning a numerically correct model into a serving pipeline that meets latency, throughput, and cost targets under real traffic, real context lengths, and real concurrency.
Today's "Aha!" Moment
The insight: Most large serving gains come from treating LLM inference as a queueing and memory-management problem, not just as "run matrix multiplications faster."
Why this matters: Once models are large, serving speed depends on more than raw FLOPs:
- how requests are batched
- how KV cache is stored and reused
- how prefill and decode phases behave
- how the scheduler shares hardware across competing requests
Concrete anchor: A team may quantize a model and still see poor user latency because the scheduler fragments batches, KV cache spills, or long prompts block short ones. The model is compressed, but the serving system is still inefficient.
Keep this mental hook in view: Inference speed comes from the serving stack, not just from the model weights.
Why This Matters
20/14.md showed that quantization improves the economics of serving by shrinking memory footprint and bandwidth pressure.
This lesson extends that idea:
- even with a good quantized model, you can still leave most performance on the table if scheduling, batching, and cache management are poor
That is why "10x faster inference" usually does not come from a single low-level kernel. It comes from combining several layers of optimization:
- model representation
- kernel efficiency
- request scheduling
- KV cache handling
- batching strategy
This is also the last big bridge before deployment and serving architecture in 20/16.md.
Learning Objectives
By the end of this session, you should be able to:
- Explain why prefill and decode are different inference phases with different bottlenecks.
- Describe the major serving optimizations: continuous batching, KV cache management, paged attention, fused kernels, and speculative decoding.
- Evaluate inference optimizations against the real metrics that matter: tail latency, throughput, memory pressure, and cost per request.
Core Concepts Explained
Concept 1: LLM Inference Has Two Distinct Phases, and They Behave Differently
For example, a user sends a long prompt and asks for a short answer. Another user sends a short prompt and expects a long completion. Even if both requests touch the same model, they stress the system differently.
At a high level, LLM serving is usually easiest to reason about when separated into:
- prefill
- process the input prompt and build the KV cache
- decode
- generate tokens one step at a time using the accumulated cache
These phases have different performance characteristics.
Mechanically: Prefill tends to:
- use large matrix operations
- benefit strongly from batching
- be sensitive to prompt length
Decode tends to:
- run repeatedly token by token
- be dominated by memory access and cache reuse
- be sensitive to scheduler decisions and concurrency
This distinction matters because an optimization that helps prefill may do little for decode, and vice versa.
In practice:
- long prompts can dominate latency even if generation is short
- high token throughput does not guarantee good first-token latency
- system tuning should separate TTFT and decode speed rather than treating "latency" as one number
The trade-off is clear: Optimizing the system for long batched prefills can hurt interactive responsiveness, while optimizing for low-latency decode can reduce total throughput.
A useful mental model is: Prefill is the cost of loading context into the model's working memory. Decode is the repeated cost of thinking one token at a time.
Use this lens when:
- Best fit: diagnosing whether latency problems come from prompt ingestion or token generation.
- Misuse pattern: comparing serving systems on one blended throughput number only.
Concept 2: Big Serving Gains Come from Scheduler and Memory Design as Much as from Math Kernels
For example, a naive server batches requests statically and allocates KV cache contiguously. Under mixed workloads, short requests wait behind long ones and memory becomes fragmented. Another server uses continuous batching and paged cache management, so tokens from many requests can be interleaved efficiently.
At a high level, At scale, inference optimization is about how well the runtime keeps the accelerator busy without wasting memory or starving interactive requests.
Mechanically: Major optimizations include:
- continuous batching
- admit new requests into decode batches as older requests finish
- improves utilization over fixed batch boundaries
- KV cache optimization
- reuse and place attention state efficiently
- often one of the dominant memory problems in long-context serving
- paged attention / paged KV management
- reduces fragmentation and makes cache allocation more flexible
- kernel fusion and specialized runtimes
- reduce overhead between operations
- improve arithmetic and memory efficiency
- speculative decoding
- draft tokens with a smaller model or auxiliary mechanism, then verify with the main model
- can accelerate generation when acceptance rate is good
Each helps for a different reason. Together they change the utilization profile of the serving system.
In practice:
- a "faster model" may still serve poorly without a strong runtime
- runtime choice can change hardware requirements dramatically
- queueing policy and fairness matter just as much as raw kernel speed for user experience
The trade-off is clear: More advanced serving runtimes can dramatically improve efficiency, but they increase operational complexity, hardware assumptions, and debugging difficulty.
A useful mental model is: The model is the engine. The scheduler, cache manager, and kernels are the transmission, cooling, and gearbox that determine how much of that engine you can actually use.
Use this lens when:
- Best fit: comparing serving frameworks or explaining why vLLM- or TensorRT-style runtimes outperform naive loops.
- Misuse pattern: assuming kernel-level optimization alone explains production latency.
Concept 3: "Faster Inference" Only Counts If the Right SLOs Improve
For example, a new runtime improves aggregate tokens/sec by 30%, but p95 first-token latency worsens, short requests queue behind long requests, and cost per successful conversation barely changes. The benchmark says "faster"; the product says "not better."
At a high level, Inference optimization should be judged by service objectives, not by one flattering metric.
Mechanically: Useful serving metrics often include:
- TTFT (time to first token)
- tokens/sec during decode
- p95 / p99 latency
- throughput under concurrency
- GPU memory utilization
- cost per request or per generated token
- quality regressions after runtime tricks
The key is that these metrics interact:
- bigger batches improve utilization but may hurt interactivity
- more cache reuse helps throughput but raises memory footprint
- speculative decoding may help only on certain prompt and model distributions
So the goal is not maximum speed in isolation. It is a serving profile that matches the product.
In practice:
- chat systems often prioritize TTFT and tail latency
- offline batch generation may prioritize throughput
- agent systems may care about long-context stability and tool-call overhead
The trade-off is clear: The serving profile that is best for one workload can be wrong for another, even on the same model.
A useful mental model is: Inference optimization is SLO tuning for a generative system, not a single contest for highest tokens/sec.
Use this lens when:
- Best fit: deciding which optimizations to ship for a specific product surface.
- Misuse pattern: treating leaderboard-style speed numbers as universal truth.
Troubleshooting
Issue: "We quantized the model, but latency barely improved."
Why it happens / is confusing: Smaller weights help, but serving may still be limited by batching policy, prompt length distribution, KV cache behavior, or runtime overhead.
Clarification / Fix: Measure prefill separately from decode, inspect memory utilization and queueing, and verify whether the runtime is actually using the optimized kernels for your setup.
Issue: "Our throughput looks great, but users say the system feels slower."
Why it happens / is confusing: Aggregate throughput can improve while first-token latency or tail latency gets worse.
Clarification / Fix: Track TTFT, p95, and fairness across short vs long requests. A serving system can be efficient overall and still feel bad interactively.
Issue: "Speculative decoding should always help."
Why it happens / is confusing: More tokens per verification step sounds universally better.
Clarification / Fix: It depends on acceptance rate, draft-model overhead, and workload shape. If verification rejects too many drafts, the complexity may not pay off.
Advanced Connections
Connection 1: Inference Optimization <-> Quantization
20/14.md addressed model-side serving efficiency through lower precision.
This lesson expands to runtime-side efficiency:
- cache layout
- batching
- scheduling
- decoding strategy
Quantization helps, but it is only one part of the serving stack.
Connection 2: Inference Optimization <-> Deployment & Serving
This lesson sets up 20/16.md.
Once you optimize inference, you still need to answer deployment questions:
- how many replicas?
- how to autoscale?
- how to route long vs short requests?
- how to isolate expensive tenants?
So optimization naturally flows into production serving architecture.
Resources
Optional Deepening Resources
-
[PAPER] Efficient Memory Management for Large Language Model Serving with PagedAttention
- Focus: Why KV cache allocation and continuous batching are central to modern high-throughput LLM serving.
-
[PAPER] Accelerating Large Language Model Decoding with Speculative Sampling
- Focus: The core idea behind speculative decoding and when draft-and-verify can improve generation speed.
-
[DOC] vLLM Documentation
- Focus: Practical serving concepts around paged attention, batching, and scalable LLM inference.
-
[DOC] TensorRT-LLM Documentation
- Focus: Kernel- and runtime-level optimization patterns for production LLM serving.
Key Insights
- Inference optimization starts with understanding prefill and decode as different phases - each has different bottlenecks and different optimization levers.
- Serving speed comes from the runtime stack as much as from the model - batching, KV cache management, schedulers, and kernels are where many of the biggest gains appear.
- The right optimization is the one that improves the product SLOs that matter - tokens/sec alone is not enough.