Quantization - Make Your LLM 4x Smaller and Faster

LESSON

LLM Training, Alignment, and Serving

014 30 min intermediate

Day 318: Quantization - Make Your LLM 4x Smaller and Faster

The core idea: Quantization makes LLM serving cheaper and faster by storing and computing with lower-precision numbers, but it is never "free compression." It is a deliberate trade-off between memory, bandwidth, latency, hardware compatibility, and the amount of numerical distortion you are willing to tolerate.


Today's "Aha!" Moment

The insight: Quantization matters so much for LLMs because inference is often bottlenecked less by raw arithmetic than by moving huge matrices through memory and device bandwidth.

Why this matters: If you can shrink weights from fp16 to int8 or 4-bit, you may fit larger models on cheaper hardware, increase batch size, improve throughput, and reduce serving cost dramatically. But the catch is that every bit you remove makes the numeric representation rougher.

Concrete anchor: A model that does not fit on one GPU at fp16 may suddenly fit once quantized. That can change the entire deployment shape. But if quality drops on long-tail prompts, multilingual tasks, or tool use, the saved memory may not be worth the regressions.

Keep this mental hook in view: Quantization is a serving trade-off, not a free lunch.


Why This Matters

20/13.md established that evaluation exists to support real deployment decisions.

Quantization is one of the clearest places where that matters:

This is why quantization belongs right after evaluation in the sequence.

The real engineering question is not:

It is:


Learning Objectives

By the end of this session, you should be able to:

  1. Explain why quantization is so effective for LLM inference economics.
  2. Describe the main quantization choices: post-training vs quantization-aware, activation vs weight quantization, and common precision levels such as int8 and 4-bit.
  3. Evaluate quantization as a trade-off across memory, latency, throughput, hardware support, and quality regression.

Core Concepts Explained

Concept 1: Quantization Exists Because LLM Inference Is Often Memory-Bound

For example, a 7B or 13B model runs acceptably in fp16, but each request spends a large fraction of time loading weights and KV state through memory. GPU compute units are not always the only bottleneck; getting data to them is often the harder part.

At a high level, Lower precision shrinks the size of model parameters and often reduces memory bandwidth pressure. That can matter as much as, or more than, the raw arithmetic savings.

Mechanically: Quantization maps higher-precision values into a lower-precision representation. In practice this often means:

The win comes from:

In practice:

The trade-off is clear: You gain efficiency, but you introduce approximation error into the model's numerical representation.

A useful mental model is: Quantization is like compressing a large map so it fits in your pocket. It becomes easier to carry and faster to use, but some detail is inevitably lost.

Use this lens when:

Concept 2: The Real Design Space Is Not Just "Quantized or Not," but Which Parts, Which Precision, and Which Method

For example, One team uses straightforward int8 weight-only quantization for inference stability. Another uses 4-bit grouped quantization to fit a larger model on limited GPUs. A third uses quantization-aware training because post-training quality loss is too high for its domain.

At a high level, Quantization is a family of choices, not a binary setting.

Mechanically: Important distinctions include:

And also:

And finally:

These choices are shaped by:

In practice:

The trade-off is clear: Aggressive quantization buys larger efficiency gains, but increases the chance of subtle quality regressions or unsupported fast paths.

A useful mental model is: Quantization choices are closer to storage-engine tuning than to a simple toggle. You pick the point on the curve that fits your workload.

Use this lens when:

Concept 3: Quantization Is Only Good If the Quality Loss Is Measured on the Behaviors You Actually Care About

For example, a quantized model keeps benchmark accuracy on common tasks, but becomes worse at long-context reasoning, tool calling, or structured JSON generation. The average score looks fine, yet the product gets less reliable in the flows that matter most.

At a high level, Quantization changes the model numerically, so the right question is not:

It is:

Mechanically: Evaluation after quantization should include:

This is why quantization is inseparable from evaluation. Compression that is not measured is just hidden risk.

In practice:

The trade-off is clear: Faster and cheaper inference is valuable, but only if the degraded behaviors are either negligible or strategically acceptable.

A useful mental model is: Quantization is a budget negotiation between systems efficiency and model fidelity.

Use this lens when:


Troubleshooting

Issue: "The quantized model is much smaller, so it should also always be much faster."

Why it happens / is confusing: Smaller artifacts intuitively feel faster.

Clarification / Fix: Speed gains depend on hardware support, kernel quality, batch shape, sequence length, and whether the bottleneck is really memory movement. Smaller is necessary, not sufficient.

Issue: "Benchmark quality barely changed, so the deployment is safe."

Why it happens / is confusing: Aggregate evals can hide the exact failure modes introduced by lower precision.

Clarification / Fix: Re-run product-specific regressions, long-context tests, structured output tests, and safety checks. Quantization failures often live in the tails.

Issue: "We should always use the lowest bit-width possible."

Why it happens / is confusing: Lower precision sounds like a strictly better efficiency frontier.

Clarification / Fix: More aggressive quantization can create unsupported kernels, unstable quality, or workflow-specific regressions. The best bit-width is the one that wins under your real constraints.


Advanced Connections

Connection 1: Quantization <-> LLM Evaluation

20/13.md explained that a useful eval must support a real decision.

Quantization is exactly that kind of decision:

Without evaluation, quantization is guesswork.

Connection 2: Quantization <-> Inference Optimization

This lesson prepares 20/15.md.

Quantization is one serving optimization, but not the whole story. Real inference speed also depends on:

So quantization is powerful, but it sits inside a larger serving stack.


Resources

Optional Deepening Resources


Key Insights

  1. Quantization works because LLM serving is heavily constrained by memory footprint and bandwidth - shrinking weights changes the deployment economics directly.
  2. There is no single best quantization setting - precision, method, hardware, and model sensitivity all shape the right choice.
  3. A quantized model is only "better" if evaluation says the trade-off is acceptable - savings in memory and latency must be judged alongside real behavior quality.

PREVIOUS LLM Evaluation - Measuring What Matters NEXT Inference Optimization - 10x Faster LLM Inference

← Back to LLM Training, Alignment, and Serving

← Back to Learning Hub