Caching & Performance Optimization - 10x Faster, 90% Cheaper

LESSON

RAG, Agents, and LLM Production

014 30 min intermediate

Day 334: Caching & Performance Optimization - 10x Faster, 90% Cheaper

The core idea: The cheapest LLM token is the one you never ask the model to process again. Production caching works when you separate immutable prompt work, slow-changing retrieval work, and live user state, then reuse only the layers whose correctness boundary is explicit.


Today's "Aha!" Moment

Yesterday's observability lesson ended with Elena's stolen-laptop incident and a trace that finally explained where the assistant was spending time. The surprising part was not the model's answer length. It was everything that happened before the first generated token: the assistant kept reloading the same incident-response instructions, the same tool schema, and the same device-policy excerpts for every analyst question in the same case.

On Monday morning, three analysts ask variants of the same question: "Can the employee keep working until IT receives the laptop?", "Does policy require immediate session revocation?", and "What is the lost-device procedure for an encrypted MacBook?" The live facts are different per device, but the front half of the work is mostly identical. If the system recomputes that front half from scratch every time, latency and cost rise even though the product is not learning anything new.

The key realization is that "cache the answer" is usually the wrong starting point. Final answers often depend on fresh device state, user identity, or the latest safety policy. The real win comes from caching the work that is truly shared: long static instructions, stable policy packs, repeated retrieval candidates, and occasionally exact non-personalized answers. Performance improves because the request stops being treated as brand-new from byte zero.


Why This Matters

An internal security assistant is a good stress test for caching because it mixes two very different kinds of information. The lost-device policy, escalation playbook, and tool contract may stay stable for hours. The MDM record saying whether Elena's laptop is online, encrypted, or already wiped can change minute by minute. If you blur those layers together, you either waste money by recomputing stable context or create risk by serving stale conclusions as if they were live facts.

That distinction matters most under bursty traffic. When a phishing wave or laptop-theft incident generates many similar questions, the assistant's common path dominates the bill: retrieval, prompt assembly, and model prefill happen over and over before any personalized reasoning begins. Good observability from 21/13.md tells you where that repeated work lives. Good caching turns that visibility into a controlled speedup.

It also sets up the next lesson. Once cached content is part of the runtime, every hit has to respect the same trust boundary as a fresh response. 21/15.md will look at safety guardrails and content filtering, which still apply when an answer, prefix, or retrieved context came from cache instead of being generated anew.


Learning Objectives

By the end of this session, you should be able to:

  1. Distinguish cacheable layers inside an LLM application and explain what work each layer avoids in a real request path.
  2. Design cache keys and invalidation rules that preserve correctness across policy versions, model versions, tenant boundaries, and live user state.
  3. Combine caching with broader performance controls so the common path becomes cheap without making the miss path fragile or opaque.

Core Concepts Explained

Concept 1: Start by locating repeated work, not by picking a cache technology

In Elena's assistant, one request looks roughly like this:

analyst question
  -> normalize / classify intent
  -> retrieve relevant policy chunks
  -> call live tools (MDM, session revocation, ticket state)
  -> assemble prompt
  -> prefill model on static prefix
  -> decode answer

That flow hides several different opportunities for reuse. An exact response cache can sometimes answer a public policy question immediately. A retrieval cache can reuse the ranked policy chunks for a normalized query when the corpus snapshot has not changed. An embedding cache can avoid recomputing vectors for identical texts. A prompt-prefix or KV cache can skip recomputing attention state for the long shared prefix made of system instructions, tool definitions, and stable policy text.

Those layers are not interchangeable because they save different kinds of work. Response caching bypasses most of the request path. Retrieval caching saves upstream search and reranking effort. Prefix or KV caching saves model-side prefill on repeated prompt prefixes. If the trace from lesson 333 shows that most latency sits in retrieval and prefill, putting Redis in front of the final answer may do very little. The architecture decision has to match the measured hotspot.

The recurring security-assistant scenario makes this concrete. "What is the lost-device policy for contractor laptops?" might be safe to answer from an exact cache if the policy pack version is unchanged and the question requires no live state. "Is Elena's laptop still encrypted right now?" is different. The policy explanation can be reused, but the final answer still depends on a fresh MDM lookup. One request is mostly shared context; the other has a thin but critical live layer.

Concept 2: Cache keys and invalidation rules are the real product contract

Caches do not fail because the lookup is slow. They fail because the hit was treated as valid when it should not have been. In practice, the cache key defines what the system claims is interchangeable.

For Elena's assistant, that usually means different keys for different layers:

This is why TTL-only thinking is too weak for production correctness. Suppose the security team updates the lost-device playbook at 09:17 to require legal review before notifying a contractor. A ten-minute TTL still allows the old policy to keep circulating until 09:27. Versioned keys or explicit invalidation events are safer because they make the new policy cold-start immediately and old entries naturally miss. TTL is still useful for cleanup, but it should not be the main correctness mechanism for fast-changing contracts.

Two other boundaries matter just as much. First, tenant and authorization scope belong in the key whenever cached material could reveal restricted information. Second, the model version belongs in any prefix cache because prompt tokenization, tool formatting, or serving behavior may differ across model updates. A cache hit is only correct when the serving conditions are equivalent, not merely similar.

Concept 3: Large wins come from shaping the common path, not from caching everything

The "10x faster, 90% cheaper" headline becomes plausible when a large share of traffic follows a narrow repeated path. In the security-assistant scenario, that common path might be dozens of analysts asking policy clarifications during the same incident burst. If the system can reuse the policy retrieval result, reuse the long static prompt prefix, and avoid the expensive model route for simple policy-only questions, the average request cost drops sharply even though the hardest requests still pay full price.

That makes caching part of a broader runtime strategy. Exact caches help only when there is true repeat traffic. Prefix caches help only when the prompt is structured so static content comes first and dynamic fields come last. Retrieval caches help only when the corpus snapshot is explicit. For the remaining requests, you still need a well-designed slow path: live tool calls, fresh reasoning, bounded retries, and observability that explains misses rather than hiding them.

The best production systems therefore optimize in layers. They make the common path cheap, keep the live path trustworthy, and measure both separately. Useful metrics are not just "cache hit rate." You also care about end-to-end latency by hit and miss, stale-hit incidents, invalidation lag after a policy update, and the fraction of traffic that still requires full live execution. Without those measurements, a cache can look healthy while silently bypassing new instructions or serving answers whose factual basis has expired.

This is also where the next lesson naturally connects. Once a cache can replay prior outputs or prior context faster than the live path, safety rules have to travel with the cache key and invalidation plan. A harmful answer that is cheap to replay is more dangerous than one that is merely expensive.


Troubleshooting

Issue: "We should cache the final answer whenever questions look similar."

Why it happens / is confusing: Similar wording feels like repeated intent, and repeated intent sounds like an obvious cache hit.

Clarification / Fix: Similar wording is not enough when the answer depends on live tool state, tenant scope, or recent policy changes. In Elena's assistant, a semantic or exact response cache is safest for stable policy FAQs, not for questions that depend on the current device record.

Issue: "A short TTL is basically the same as proper invalidation."

Why it happens / is confusing: TTL is easy to implement and creates the impression that staleness is bounded.

Clarification / Fix: TTL only bounds how long stale data can survive; it does not guarantee freshness after an urgent policy or safety update. Prefer versioned keys or event-driven invalidation for policy packs, retrieval snapshots, and guardrail changes, then use TTL as a cleanup mechanism.

Issue: "Prefix caching should just work automatically."

Why it happens / is confusing: Teams hear that provider-side caching is transparent and assume any long prompt will benefit.

Clarification / Fix: Prefix caches usually require exact repeated structure. If timestamps, request IDs, or user-specific fields appear early in the prompt, the shared prefix disappears. Put static instructions, tool definitions, and stable context first; append live fields at the end.


Advanced Connections

Connection 1: Prompt-prefix caching and database prepared statements

Both techniques separate stable structure from changing parameters. A prepared SQL statement reuses the parsed plan while new bind values supply the live parts. A prompt-prefix or KV cache does something similar for inference: it reuses the expensive prefill on the stable prefix, then processes only the request-specific tail. In Elena's assistant, the lost-device instructions and tool schema behave like the prepared portion, while the current device ID and MDM status are the runtime parameters.

Connection 2: Cache invalidation and configuration rollout safety

Policy caches behave a lot like distributed configuration. When the security team changes the contractor-notification rule, every region and every worker needs to stop trusting the old version quickly. The hard problem is not storing the cached value; it is making sure the new version becomes authoritative everywhere before analysts act on stale guidance. That is the same operational shape as a config rollout, a CDN purge, or a feature-flag update.


Resources

Optional Deepening Resources


Key Insights

  1. Caching starts with workload decomposition - You speed up LLM systems by reusing the right layer of work, not by assuming the whole answer is safely repeatable.
  2. A cache key is a correctness claim - Policy version, tenant scope, model version, and live-state boundaries determine whether reuse is valid.
  3. Common-path optimization is the real win - Large savings appear when repeated requests take a cheap path while the live miss path remains explicit, observable, and safe.

PREVIOUS LLM Monitoring & Observability - Production Visibility NEXT Safety Guardrails & Content Filtering - Protecting Users and Brand

← Back to RAG, Agents, and LLM Production

← Back to Learning Hub