LESSON

015 30 min intermediate

Day 319: Inference Optimization - 10x Faster LLM Inference

The core idea: Inference optimization is not one trick that magically makes LLMs fast. It is the systems problem of turning a numerically correct model into a serving pipeline that meets latency, throughput, and cost targets under real traffic, real context lengths, and real concurrency.

Today's "Aha!" Moment

The insight: Most large serving gains come from treating LLM inference as a queueing and memory-management problem, not just as "run matrix multiplications faster."

Why this matters: Once models are large, serving speed depends on more than raw FLOPs:

how requests are batched
how KV cache is stored and reused
how prefill and decode phases behave
how the scheduler shares hardware across competing requests

Concrete anchor: A team may quantize a model and still see poor user latency because the scheduler fragments batches, KV cache spills, or long prompts block short ones. The model is compressed, but the serving system is still inefficient.

Keep this mental hook in view: Inference speed comes from the serving stack, not just from the model weights.

Why This Matters

20/14.md showed that quantization improves the economics of serving by shrinking memory footprint and bandwidth pressure.

This lesson extends that idea:

even with a good quantized model, you can still leave most performance on the table if scheduling, batching, and cache management are poor

That is why "10x faster inference" usually does not come from a single low-level kernel. It comes from combining several layers of optimization:

model representation
kernel efficiency
request scheduling
KV cache handling
batching strategy

This is also the last big bridge before deployment and serving architecture in 20/16.md.

Learning Objectives

By the end of this session, you should be able to:

Explain why prefill and decode are different inference phases with different bottlenecks.
Describe the major serving optimizations: continuous batching, KV cache management, paged attention, fused kernels, and speculative decoding.
Evaluate inference optimizations against the real metrics that matter: tail latency, throughput, memory pressure, and cost per request.

Core Concepts Explained

Concept 1: LLM Inference Has Two Distinct Phases, and They Behave Differently

For example, a user sends a long prompt and asks for a short answer. Another user sends a short prompt and expects a long completion. Even if both requests touch the same model, they stress the system differently.

At a high level, LLM serving is usually easiest to reason about when separated into:

prefill
- process the input prompt and build the KV cache
decode
- generate tokens one step at a time using the accumulated cache

These phases have different performance characteristics.

Mechanically: Prefill tends to:

use large matrix operations
benefit strongly from batching
be sensitive to prompt length

Decode tends to:

run repeatedly token by token
be dominated by memory access and cache reuse
be sensitive to scheduler decisions and concurrency

This distinction matters because an optimization that helps prefill may do little for decode, and vice versa.

In practice:

long prompts can dominate latency even if generation is short
high token throughput does not guarantee good first-token latency
system tuning should separate TTFT and decode speed rather than treating "latency" as one number

The trade-off is clear: Optimizing the system for long batched prefills can hurt interactive responsiveness, while optimizing for low-latency decode can reduce total throughput.

A useful mental model is: Prefill is the cost of loading context into the model's working memory. Decode is the repeated cost of thinking one token at a time.

Use this lens when:

Best fit: diagnosing whether latency problems come from prompt ingestion or token generation.
Misuse pattern: comparing serving systems on one blended throughput number only.

Concept 2: Big Serving Gains Come from Scheduler and Memory Design as Much as from Math Kernels

For example, a naive server batches requests statically and allocates KV cache contiguously. Under mixed workloads, short requests wait behind long ones and memory becomes fragmented. Another server uses continuous batching and paged cache management, so tokens from many requests can be interleaved efficiently.

At a high level, At scale, inference optimization is about how well the runtime keeps the accelerator busy without wasting memory or starving interactive requests.

Mechanically: Major optimizations include:

continuous batching
- admit new requests into decode batches as older requests finish
- improves utilization over fixed batch boundaries
KV cache optimization
- reuse and place attention state efficiently
- often one of the dominant memory problems in long-context serving
paged attention / paged KV management
- reduces fragmentation and makes cache allocation more flexible
kernel fusion and specialized runtimes
- reduce overhead between operations
- improve arithmetic and memory efficiency
speculative decoding
- draft tokens with a smaller model or auxiliary mechanism, then verify with the main model
- can accelerate generation when acceptance rate is good

Each helps for a different reason. Together they change the utilization profile of the serving system.

In practice:

a "faster model" may still serve poorly without a strong runtime
runtime choice can change hardware requirements dramatically
queueing policy and fairness matter just as much as raw kernel speed for user experience

The trade-off is clear: More advanced serving runtimes can dramatically improve efficiency, but they increase operational complexity, hardware assumptions, and debugging difficulty.

A useful mental model is: The model is the engine. The scheduler, cache manager, and kernels are the transmission, cooling, and gearbox that determine how much of that engine you can actually use.

Use this lens when:

Best fit: comparing serving frameworks or explaining why vLLM- or TensorRT-style runtimes outperform naive loops.
Misuse pattern: assuming kernel-level optimization alone explains production latency.

Concept 3: "Faster Inference" Only Counts If the Right SLOs Improve

For example, a new runtime improves aggregate tokens/sec by 30%, but p95 first-token latency worsens, short requests queue behind long requests, and cost per successful conversation barely changes. The benchmark says "faster"; the product says "not better."

At a high level, Inference optimization should be judged by service objectives, not by one flattering metric.

Mechanically: Useful serving metrics often include:

TTFT (time to first token)
tokens/sec during decode
p95 / p99 latency
throughput under concurrency
GPU memory utilization
cost per request or per generated token
quality regressions after runtime tricks

The key is that these metrics interact:

bigger batches improve utilization but may hurt interactivity
more cache reuse helps throughput but raises memory footprint
speculative decoding may help only on certain prompt and model distributions

So the goal is not maximum speed in isolation. It is a serving profile that matches the product.

In practice:

chat systems often prioritize TTFT and tail latency
offline batch generation may prioritize throughput
agent systems may care about long-context stability and tool-call overhead

The trade-off is clear: The serving profile that is best for one workload can be wrong for another, even on the same model.

A useful mental model is: Inference optimization is SLO tuning for a generative system, not a single contest for highest tokens/sec.

Use this lens when:

Best fit: deciding which optimizations to ship for a specific product surface.
Misuse pattern: treating leaderboard-style speed numbers as universal truth.

Troubleshooting

Issue: "We quantized the model, but latency barely improved."

Why it happens / is confusing: Smaller weights help, but serving may still be limited by batching policy, prompt length distribution, KV cache behavior, or runtime overhead.

Clarification / Fix: Measure prefill separately from decode, inspect memory utilization and queueing, and verify whether the runtime is actually using the optimized kernels for your setup.

Issue: "Our throughput looks great, but users say the system feels slower."

Why it happens / is confusing: Aggregate throughput can improve while first-token latency or tail latency gets worse.

Clarification / Fix: Track TTFT, p95, and fairness across short vs long requests. A serving system can be efficient overall and still feel bad interactively.

Issue: "Speculative decoding should always help."

Why it happens / is confusing: More tokens per verification step sounds universally better.

Clarification / Fix: It depends on acceptance rate, draft-model overhead, and workload shape. If verification rejects too many drafts, the complexity may not pay off.

Advanced Connections

Connection 1: Inference Optimization <-> Quantization

20/14.md addressed model-side serving efficiency through lower precision.

This lesson expands to runtime-side efficiency:

cache layout
batching
scheduling
decoding strategy

Quantization helps, but it is only one part of the serving stack.

Connection 2: Inference Optimization <-> Deployment & Serving

This lesson sets up 20/16.md.

Once you optimize inference, you still need to answer deployment questions:

how many replicas?
how to autoscale?
how to route long vs short requests?
how to isolate expensive tenants?

So optimization naturally flows into production serving architecture.

Resources

Optional Deepening Resources

[PAPER] Efficient Memory Management for Large Language Model Serving with PagedAttention
- Focus: Why KV cache allocation and continuous batching are central to modern high-throughput LLM serving.
[PAPER] Accelerating Large Language Model Decoding with Speculative Sampling
- Focus: The core idea behind speculative decoding and when draft-and-verify can improve generation speed.
[DOC] vLLM Documentation
- Focus: Practical serving concepts around paged attention, batching, and scalable LLM inference.
[DOC] TensorRT-LLM Documentation
- Focus: Kernel- and runtime-level optimization patterns for production LLM serving.

Key Insights

Inference optimization starts with understanding prefill and decode as different phases - each has different bottlenecks and different optimization levers.
Serving speed comes from the runtime stack as much as from the model - batching, KV cache management, schedulers, and kernels are where many of the biggest gains appear.
The right optimization is the one that improves the product SLOs that matter - tokens/sec alone is not enough.

← Back to LLM Training, Alignment, and Serving

← Back to Learning Hub