Inference Optimization - 10x Faster LLM Inference

LESSON

LLM Training, Alignment, and Serving

015 30 min intermediate

Day 319: Inference Optimization - 10x Faster LLM Inference

The core idea: Inference optimization is not one trick that magically makes LLMs fast. It is the systems problem of turning a numerically correct model into a serving pipeline that meets latency, throughput, and cost targets under real traffic, real context lengths, and real concurrency.


Today's "Aha!" Moment

The insight: Most large serving gains come from treating LLM inference as a queueing and memory-management problem, not just as "run matrix multiplications faster."

Why this matters: Once models are large, serving speed depends on more than raw FLOPs:

Concrete anchor: A team may quantize a model and still see poor user latency because the scheduler fragments batches, KV cache spills, or long prompts block short ones. The model is compressed, but the serving system is still inefficient.

Keep this mental hook in view: Inference speed comes from the serving stack, not just from the model weights.


Why This Matters

20/14.md showed that quantization improves the economics of serving by shrinking memory footprint and bandwidth pressure.

This lesson extends that idea:

That is why "10x faster inference" usually does not come from a single low-level kernel. It comes from combining several layers of optimization:

This is also the last big bridge before deployment and serving architecture in 20/16.md.


Learning Objectives

By the end of this session, you should be able to:

  1. Explain why prefill and decode are different inference phases with different bottlenecks.
  2. Describe the major serving optimizations: continuous batching, KV cache management, paged attention, fused kernels, and speculative decoding.
  3. Evaluate inference optimizations against the real metrics that matter: tail latency, throughput, memory pressure, and cost per request.

Core Concepts Explained

Concept 1: LLM Inference Has Two Distinct Phases, and They Behave Differently

For example, a user sends a long prompt and asks for a short answer. Another user sends a short prompt and expects a long completion. Even if both requests touch the same model, they stress the system differently.

At a high level, LLM serving is usually easiest to reason about when separated into:

These phases have different performance characteristics.

Mechanically: Prefill tends to:

Decode tends to:

This distinction matters because an optimization that helps prefill may do little for decode, and vice versa.

In practice:

The trade-off is clear: Optimizing the system for long batched prefills can hurt interactive responsiveness, while optimizing for low-latency decode can reduce total throughput.

A useful mental model is: Prefill is the cost of loading context into the model's working memory. Decode is the repeated cost of thinking one token at a time.

Use this lens when:

Concept 2: Big Serving Gains Come from Scheduler and Memory Design as Much as from Math Kernels

For example, a naive server batches requests statically and allocates KV cache contiguously. Under mixed workloads, short requests wait behind long ones and memory becomes fragmented. Another server uses continuous batching and paged cache management, so tokens from many requests can be interleaved efficiently.

At a high level, At scale, inference optimization is about how well the runtime keeps the accelerator busy without wasting memory or starving interactive requests.

Mechanically: Major optimizations include:

Each helps for a different reason. Together they change the utilization profile of the serving system.

In practice:

The trade-off is clear: More advanced serving runtimes can dramatically improve efficiency, but they increase operational complexity, hardware assumptions, and debugging difficulty.

A useful mental model is: The model is the engine. The scheduler, cache manager, and kernels are the transmission, cooling, and gearbox that determine how much of that engine you can actually use.

Use this lens when:

Concept 3: "Faster Inference" Only Counts If the Right SLOs Improve

For example, a new runtime improves aggregate tokens/sec by 30%, but p95 first-token latency worsens, short requests queue behind long requests, and cost per successful conversation barely changes. The benchmark says "faster"; the product says "not better."

At a high level, Inference optimization should be judged by service objectives, not by one flattering metric.

Mechanically: Useful serving metrics often include:

The key is that these metrics interact:

So the goal is not maximum speed in isolation. It is a serving profile that matches the product.

In practice:

The trade-off is clear: The serving profile that is best for one workload can be wrong for another, even on the same model.

A useful mental model is: Inference optimization is SLO tuning for a generative system, not a single contest for highest tokens/sec.

Use this lens when:


Troubleshooting

Issue: "We quantized the model, but latency barely improved."

Why it happens / is confusing: Smaller weights help, but serving may still be limited by batching policy, prompt length distribution, KV cache behavior, or runtime overhead.

Clarification / Fix: Measure prefill separately from decode, inspect memory utilization and queueing, and verify whether the runtime is actually using the optimized kernels for your setup.

Issue: "Our throughput looks great, but users say the system feels slower."

Why it happens / is confusing: Aggregate throughput can improve while first-token latency or tail latency gets worse.

Clarification / Fix: Track TTFT, p95, and fairness across short vs long requests. A serving system can be efficient overall and still feel bad interactively.

Issue: "Speculative decoding should always help."

Why it happens / is confusing: More tokens per verification step sounds universally better.

Clarification / Fix: It depends on acceptance rate, draft-model overhead, and workload shape. If verification rejects too many drafts, the complexity may not pay off.


Advanced Connections

Connection 1: Inference Optimization <-> Quantization

20/14.md addressed model-side serving efficiency through lower precision.

This lesson expands to runtime-side efficiency:

Quantization helps, but it is only one part of the serving stack.

Connection 2: Inference Optimization <-> Deployment & Serving

This lesson sets up 20/16.md.

Once you optimize inference, you still need to answer deployment questions:

So optimization naturally flows into production serving architecture.


Resources

Optional Deepening Resources


Key Insights

  1. Inference optimization starts with understanding prefill and decode as different phases - each has different bottlenecks and different optimization levers.
  2. Serving speed comes from the runtime stack as much as from the model - batching, KV cache management, schedulers, and kernels are where many of the biggest gains appear.
  3. The right optimization is the one that improves the product SLOs that matter - tokens/sec alone is not enough.

PREVIOUS Quantization - Make Your LLM 4x Smaller and Faster NEXT Deployment & Serving - Production LLM Infrastructure

← Back to LLM Training, Alignment, and Serving

← Back to Learning Hub