Degraded Modes, Playbooks, and Incident Evidence

LESSON

014 20 min beginner

Degraded Modes, Playbooks, and Incident Evidence

Core Insight

Imagine checkout normally promises three things: users can review a cart, pay for an order, and receive a confirmation. Now the payment provider becomes slow and ambiguous. The system can no longer prove quickly whether some charges succeeded.

The worst response is pretending everything is normal. The second worst response is turning everything off when a smaller honest promise would still be useful. A better response might be: keep carts readable, preserve order intent, pause new payment attempts, show a clear pending state, and run repair when payment evidence returns.

That smaller promise is a degraded mode. A degraded mode is planned behavior that keeps a reduced but honest service promise when the normal promise is unsafe.

A playbook is the operational plan for entering, running, and leaving that mode. Good playbooks are driven by evidence: what triggered the mode, what user impact exists, what amplification must stop, and what recovery evidence proves it is safe to restore the full path.

A Smaller Promise For Checkout

Use checkout as the worked example.

Normal mode:

user can:
  browse cart
  place order
  authorize payment
  receive confirmation

Payment-degraded mode:

user can:
  browse cart
  preserve order intent
  see payment_pending or payment_unavailable

system will not:
  start new risky payment attempts
  claim payment succeeded without evidence
  discard the user's intent silently

The degraded mode is smaller, but it is still valuable. It avoids duplicate side effects, preserves user intent, and tells the truth about uncertainty.

The key is that the degraded promise is explicit. "Payment unavailable" means something different from "order failed." "Payment pending" means the system has a repair path and is waiting for evidence. Those states should exist before the incident.

Evidence That Enters And Exits The Mode

A degraded mode needs triggers. For the checkout example, entry evidence might include:

payment provider latency above the request deadline;
timeout rate above a threshold;
retry volume rising;
queue age growing for payment repair jobs;
mismatch between authorized payments and confirmed orders.

The playbook should also name what to do:

enter payment-degraded mode when:
  provider timeout rate > threshold for 5 minutes
  and pending payment repair queue is growing

actions:
  pause new payment attempts
  preserve carts and order intents
  disable retry storms
  show a clear pending or unavailable state
  page the owner team

Exit evidence is different from silence. Alarms getting quieter is not enough if a backlog remains, retries are still high, or repair has not caught up.

exit only after:
  provider latency is healthy for 15 minutes
  retry rate is normal
  repair queue age is below target
  sample order and payment records agree

This turns recovery from a guess into a controlled transition.

Playbooks As Design Requirements

A useful playbook is not just a document for humans. It tells the software what controls and evidence must exist.

If the playbook says "pause new payment attempts," the system needs a feature flag, admission rule, or routing control that can actually do that. If it says "show pending," the product needs a pending state and support tooling that understands it. If it says "recover when records agree," the system needs a query or report that compares those records.

That means playbooks feed back into design:

playbook step:
  pause optional recommendation enrichment

software requirement:
  recommendations can be disabled independently

playbook step:
  repair pending payments

software requirement:
  payment intent, provider id, order id, and repair status are durable

The best time to discover missing controls is during a rehearsal, not during an outage. A degraded mode that has never been tested is only a hope.

Worked Trace: Payment Becomes An Uncertain Operation

At 09:00, the checkout service observes that authorization calls are timing out. A timeout does not prove payment failed; the provider may have accepted the request after the caller gave up. The first safe move is therefore not to retry every payment. It is to change the promise the system makes.

normal promise:
  a successful checkout returns a confirmed order and payment result

payment-degraded promise:
  cart and order intent are preserved
  new authorization attempts are paused or tightly limited
  the user sees pending or unavailable, never invented success

1. Enter From Evidence, Not Anxiety

One slow request is not enough to change all checkout behavior. The playbook combines evidence: provider timeout rate, p99 latency beyond the request deadline, rising retry rate, and a growing count of authorizations without matching orders. These signals say both that the dependency is unhealthy and that normal retries are likely to amplify the problem.

enter when, for a bounded interval:
  provider timeouts exceed threshold
  AND repair queue age grows
  AND retry budget is being consumed

The condition should be observable and reversible. A manual override may exist, but it needs a named owner and audit evidence; otherwise an emergency switch becomes an invisible second incident.

2. Preserve Intent Without Creating New Risk

When the mode enters, checkout records an order intent with a stable operation id. It can keep the cart visible and let the user see payment_pending or payment_unavailable. It stops issuing repeated external authorization calls unless a controlled repair process owns the attempt.

order_intent = order-42
payment_operation = pay:order-42
state = pending_external_evidence

user message:
  "We are confirming your payment. Do not submit again."

This state is useful because it is honest. “Failed” would invite a duplicate attempt even though the provider may already hold an authorization. “Confirmed” would promise more than the system can prove. A pending state gives support, repair workers, and the user the same vocabulary for the uncertainty.

3. Stop Amplification And Protect Capacity

The playbook disables optional recommendation calls, applies admission control to new payment work, and enforces retry budgets. Existing repair work is bounded so it cannot flood the provider when it returns. The earlier backpressure lesson supplies the controls; the degraded mode tells responders which controls must be used for this particular promise.

protect:
  durable order intent, status lookup, reconciliation records

degrade:
  recommendations, nonessential email, immediate confirmation

block or delay:
  uncontrolled payment retries

The order matters. If the system pauses payment calls but still labels every checkout as failed, users become the retry storm. If it preserves a pending intent but has no repair worker or queryable provider reference, it merely stores uncertainty instead of managing it.

4. Reconcile And Exit Gradually

The repair worker queries the provider using pay:order-42, then writes one durable outcome: confirm the order, void/refund an authorization that cannot be fulfilled, or leave it pending with a visible reason. The playbook's exit condition is not “the alert is quiet.” It includes healthy provider latency, normal retry rate, a repair queue below its age target, and sampled agreement between payment and order records.

exit sequence:
  verify dependency health and repair backlog
  reopen a small percentage of normal checkouts
  observe errors, retries, and record agreement
  ramp traffic gradually
  restore optional features last

Recovery needs the same caution as entry. Reopening every payment path at once can create a surge that restarts the failure. The evidence must show that both the external dependency and the system's accumulated obligations are safe enough to resume.

Evidence Makes The Playbook Executable

A playbook is not only a document. Every step should correspond to a control, a state, and a query. “Pause new attempts” requires a flag or admission rule. “Preserve intent” requires a durable state model. “Reconcile records” requires joinable order, operation, and provider ids. “Exit after repair” requires a measurable backlog and a safe sample check.

This makes incident evidence part of the mechanism. Trigger evidence explains why the mode began. User-impact evidence shows which promises changed. Amplification evidence shows whether retries or queues are worsening the situation. Recovery evidence proves that the smaller promise can safely become the normal one again.

The trade-off is deliberate: users may wait longer or lose optional features, but the system avoids duplicate side effects and false confirmations. A degraded mode is successful when it makes uncertainty visible, preserves the facts needed to repair it, and prevents responders from improvising dangerous changes under pressure.

Test The Mode Before The Incident

A mode is only credible if a team can exercise it without waiting for a real outage. A rehearsal can inject a provider timeout, enter the mode through the same trigger path, create one pending order, and verify that no duplicate external authorization is issued when the user retries. It should also test the less dramatic path: a healthy provider returns, the repair worker reconciles the pending record, and the traffic ramp stays below the dependency's recovery capacity.

rehearsal evidence:
  entry trigger fired and was visible
  user received a truthful pending state
  retry reused one operation id
  repair wrote one durable decision
  exit checks passed before normal traffic returned

This rehearsal finds missing joins between product, controls, and operations. A feature flag that cannot be changed safely, a pending state that support cannot query, or a repair queue without an age metric are design gaps, not merely runbook gaps. The result should be a small, repeatable test and an owner for keeping it current as contracts and dependencies change.

Record the rehearsal result and any manual steps, so the next responder begins from tested evidence rather than institutional memory.

Incident Evidence

Incident evidence is the set of facts that lets responders understand, act, and later learn.

Useful evidence usually falls into four groups.

Trigger evidence explains why the mode began: latency, errors, queue age, saturation, missing events, or conflicting records.

User-impact evidence shows what customers experienced: failed checkouts, pending orders, duplicate attempts, delayed confirmations, support tickets, or affected tenants.

Amplification evidence shows whether the system is making the incident worse: retries, backlog growth, repair pressure, fanout, or optional jobs competing with core paths.

Recovery evidence proves the system is ready to leave degraded mode: healthy dependency behavior, drained queues, reconciled records, normal retry rates, and successful sample workflows.

During the incident, this evidence prevents superstition. After the incident, it makes the review concrete. The team can ask which promise failed, which control helped, which signal arrived late, and which missing evidence made responders guess.

Failure Modes And Trade-offs

The representative failure is an improvised emergency change creating a second incident. Someone disables the wrong path, opens a risky manual bypass, restores traffic too early, or clears a queue without preserving the evidence needed for repair.

Another failure is a degraded mode that lies. If the interface says "failed" when the system really means "unknown," users may retry and create duplicate work. If it says "saved" without enough evidence, the system has promised more than it can prove.

Degraded modes also have costs. Users may wait longer, lose optional features, or see a smaller workflow. The trade-off is accepting a smaller truthful promise instead of corrupting data, duplicating side effects, or hiding uncertainty behind false success.

Recovery has a trade-off too. Restoring quickly reduces user friction, but restoring before backlogs, retries, and records are healthy can restart the incident. A good playbook exits gradually and reversibly.

Practice Prompt

Pick one critical workflow: checkout, login, password reset, message send, file upload, seat booking, or account deletion. Fill in these lines:

normal promise:
degraded promise:
entry evidence:
user-facing message or state:
controls to stop amplification:
data preserved for repair:
exit evidence:
rehearsal or test:

If the degraded promise is "we do whatever still works," tighten it. A degraded mode should be a named contract, not improvisation.

Resources

[BOOK] Site Reliability Engineering: Managing Incidents
- Focus: Incident roles, structured response, and reducing confusion under pressure.
[BOOK] Site Reliability Engineering: Emergency Response
- Focus: Prepared response, operational discipline, and avoiding second-order failures.
[ARTICLE] Static Stability Using Availability Zones
- Focus: Designing systems that keep a smaller reliable promise during dependency or zone failures.

Key Takeaways

A degraded mode is a smaller honest promise, not accidental half-failure.
Playbooks should define entry evidence, fallback behavior, guardrails, and exit evidence.
Incident evidence is part of the mechanism because it tells responders what is true enough to act on.
Recovery should be evidence-based, gradual, and reversible when the normal path has been unsafe.

← Back to Distributed Systems Foundations

← Back to Distributed Systems

← Back to Learning Hub