Observability and Debugging Distributed Systems
LESSON
Observability and Debugging Distributed Systems
Core Insight
A buyer reports: “Checkout spun for twenty seconds, my bank shows a charge, and the site said the order failed.” The checkout path crosses a web service, inventory, a payment provider, a queue, an order worker, and an email service. Each component can report that it is mostly healthy. None of those local statements answers the buyer's question: was my order created, charged, both, or neither?
Distributed systems divide work and therefore divide evidence. One service sees an HTTP request. Another sees a payment authorization. A queue sees a message wait. A worker sees an attempt to create an order. A database sees a durable state transition. If those facts cannot be joined, an incident becomes a collection of plausible stories rather than an investigation.
Observability is the ability to reconstruct a user-visible promise from the evidence a system retains. It is more than a dashboard and more than “having logs.” It means that an engineer can ask what happened to one order, across time and boundaries, distinguish a fact from an inference, and find the safe repair action.
The trade-off is real. Logs, traces, metrics, labels, and retained event payloads consume storage, CPU, bandwidth, attention, and privacy budget. Good observability does not collect every possible datum. It deliberately preserves the evidence required to explain important state changes, retries, queue crossings, ownership changes, and repair decisions.
Start With The Promise And Its Identities
The checkout promise is not “the request got a 200 response.” It is closer to: “the buyer receives one truthful outcome for the intended purchase, and the system can repair any partial completion without charging twice.” That promise needs several identifiers because one identifier cannot represent every part of the workflow.
order_id: order-42 durable business object
payment_operation_id: pay:order-42 one intended charge; used for idempotency
request_id: req-A one HTTP attempt
trace_id: trace-81 one observed execution path
message_id: msg-551 one queued delivery attempt
These ids are related, not interchangeable. The buyer may retry after a timeout. The retry gets a new request_id and perhaps a new trace, but it must retain payment_operation_id: pay:order-42 so the payment service can recognize the same intended charge. A queue may redeliver one message and assign a new delivery attempt while the business order remains order-42.
This distinction prevents two common mistakes. If every retry receives a new payment operation id, a timeout can become a double charge. If every asynchronous worker creates a new anonymous identity, investigators cannot connect a late order confirmation to the checkout that created it.
The Evidence Has Different Jobs
Observability signals answer different questions. Treating them as substitutes loses useful information.
Structured logs record local facts at an event boundary: a request arrived, a dependency timed out, a message was accepted, or an idempotency key matched existing work. A useful checkout log includes the relevant ids, result, reason, and component. It should not need to include a full card number, address, or unbounded payload to be useful.
Traces connect work across calls. A trace is a tree or graph of spans that shows which operation waited on which dependency and for how long. A queue producer can put trace context in message headers; the worker can continue or link the trace when it consumes the message. The trace explains the path and timing, not whether a business transition was made durable.
Metrics summarize many executions. Queue depth, age of the oldest message, payment-provider latency, timeout rate, retry rate, and order-confirmation lag can show whether one incident is part of a broader condition. Metrics are excellent for detecting a pattern and poor at explaining the fate of order-42 by themselves.
Durable business records and events answer authoritative state questions: did pay:order-42 authorize, did the order become confirmed, was a refund issued, and did a repair job record its decision? They are the source of truth for critical side effects. A trace may suggest that a payment call returned success; the durable payment record establishes whether the system has recognized an authorization it must reconcile.
The design goal is a joinable evidence set:
request/trace evidence -> timing and local path
queue evidence -> handoff and delay
business records -> durable state and repair decision
metrics -> whether the path is systematically under pressure
Worked Trace: A Timed-Out Checkout With A Late Authorization
At 12:00:00, the buyer presses Place order. The web service allocates the identities above and begins trace-81.
req-A, trace-81, order-42, pay:order-42
web span: checkout.request started
inventory record: sku-9 reserved for order-42
payment span: authorize pay:order-42 started
1. The Edge Deadline Ends The First Attempt
The payment provider is slow. The web service has a five-second deadline so it does not wait forever and consume all its request threads.
12:00:05.000
web log:
request_id=req-A
order_id=order-42
result=timeout_waiting_for_payment
response to buyer:
"We could not confirm your order yet. Do not retry payment manually."
This is a fact: the web layer stopped waiting and returned an uncertain outcome. It is not proof that payment failed. A correct user response should preserve that uncertainty instead of saying “payment rejected” without evidence.
2. Payment Finishes After The Caller Has Given Up
At 12:00:09, the provider responds. The payment service stores a durable authorization under pay:order-42 and emits a business event through its outbox or queue boundary.
12:00:09.240
payment record:
operation_id=pay:order-42
status=authorized
provider_reference=provider-774
outbox event:
payment_authorized for order-42
trace context: trace-81
This creates an important split: the buyer saw a timeout, but the system now has an authorized payment. Any subsequent action must use the same operation id rather than creating a second authorization.
3. The Queue Shows A Systemic Delay
The event reaches an order-confirmation queue, but workers are slow because their database dependency is saturated. Metrics show that the queue is not a one-off anomaly.
12:00:10
queue metric:
depth = 1,400
oldest message age = 8 minutes
queue record:
message_id=msg-551
order_id=order-42
state=waiting_for_worker
The metric alone cannot say what happened to order-42; it has no per-order identity. The queue record can. Together they explain both the individual delay and the operational pressure causing it.
4. The Buyer Retries Without Creating A New Charge
At 12:00:12, the buyer refreshes checkout. The web service creates req-B and perhaps trace-82, but it reuses pay:order-42 after finding the existing order attempt.
req-B -> payment service
payment_operation_id = pay:order-42
payment service response:
existing authorization found
do not contact provider again
This is where observability supports correctness. The log and trace for req-B are new because it is a new request attempt. The durable payment operation is the same because it represents the same business intent. An investigator can see that the buyer retried; the payment record can prove whether the retry was safely deduplicated.
5. A Worker Reconstructs And Repairs The State
At 12:08, a worker consumes msg-551, creates the order record, and records order_confirmed. If the reservation has expired or the order can no longer be completed, a repair workflow can instead release the reservation and initiate a refund or void using provider-774.
worker trace span linked to trace-81
order record:
order-42 = confirmed
or repair record:
order-42 = payment_authorized_but_unfulfillable
action = void_or_refund
The exact outcome is a design decision. What observability must guarantee is that the decision and the evidence behind it are queryable. The investigation should not end at “the page timed out.” It should answer where the workflow stopped, which side effect occurred, and which repair is safe.
Build A Timeline, Then Separate Facts From Inferences
An incident timeline organizes evidence without pretending that wall-clock order is perfect proof of causality.
12:00:00.000 fact: web received req-A for order-42
12:00:00.080 fact: inventory reserved sku-9
12:00:05.000 fact: web returned timeout to the buyer
12:00:09.240 fact: payment record shows pay:order-42 authorized
12:00:10.000 fact: msg-551 entered the confirmation queue
12:00:12.000 fact: req-B reused pay:order-42
12:08:00.000 fact: worker confirmed order-42
“The buyer probably retried because the page timed out” is an inference until req-B or client telemetry confirms it. “The provider charged the buyer” should be tied to the durable provider result, not only to a trace span. “The queue caused the delay” is a hypothesis supported by queue age, worker latency, and the message's timestamps; it can be strengthened by comparing other orders in the same period.
This discipline matters because distributed telemetry is incomplete. Clocks drift, traces are sampled, logs can be dropped, and messages can be delivered more than once. A good timeline labels what is known, what is missing, and what repair action is safe under that uncertainty.
Instrument Boundaries, Not Just Services
A dashboard showing normal CPU for the web service does not prove checkout is healthy. The workflow crosses boundaries where information and responsibility change hands: HTTP requests, downstream calls, transaction commits, queue publish, worker consume, retry, and repair.
At each boundary, preserve enough context to answer four questions:
what logical operation is this?
which business object does it affect?
what outcome or error occurred?
what evidence proves the next handoff or state transition?
For a queue, this usually means storing the business ids in the message body or headers, propagating trace context, recording enqueue and consume time, and making delivery attempts visible. For a retry, it means a new request id plus the same idempotency or operation id. For a repair job, it means a durable record that names the reason, source evidence, and chosen action.
Absence must be observable too. If there is no order_confirmed event after a payment authorization, the system should let an engineer query that gap. Otherwise the team cannot distinguish “the event was never written,” “it is waiting in a queue,” “a consumer failed,” and “the event name changed.” Missing evidence is often the actual debugging clue.
Failure Modes And Trade-offs
The representative failure is every component reporting health while the user workflow is broken. A queue can be growing, a provider can be slow, retries can inflate load, and a final business event can be missing even though CPU and error-rate dashboards look ordinary.
Another failure is lost identity at an asynchronous boundary. If msg-551 omits order-42, the worker starts a separate story. If a retry changes pay:order-42, the system may create a duplicate charge. Correlation data is a correctness aid, not just debugging decoration.
More telemetry is not automatically safer. High-cardinality metric labels can exhaust a monitoring system. Logging full request bodies can expose payment or personal data. Sampling can hide the rare failing trace that matters most. Retaining every payload forever creates cost and privacy obligations. Systems often sample routine success paths but retain errors, slow operations, or traces tied to high-value workflows more aggressively.
Useful operational signals include:
- success, timeout, and uncertain-outcome rates for the checkout promise;
- payment authorization age without a matching order or repair record;
- queue depth, oldest-message age, worker latency, and redelivery count;
- retries per payment operation and idempotency-key reuse rate;
- trace-context propagation failures at queue and worker boundaries; and
- cardinality, sampling, retention, and sensitive-field redaction health for telemetry itself.
The trade-off is evidence versus cost, but the decision should be tied to recovery. Preserve the smallest safe set of facts needed to detect a broken promise, identify side effects, and choose the repair path. Everything retained beyond that should justify its operational and privacy cost.
Design Check
Choose one workflow: checkout, password reset, file upload, message send, seat booking, or account deletion. Without looking back, write:
user-visible promise:
durable business id:
idempotency or operation id:
per-attempt request id:
trace context across synchronous and asynchronous boundaries:
durable state transitions that prove side effects:
queue or worker evidence:
metric that detects a systemic version of the failure:
missing fact that must be queryable:
safe repair action if the workflow stops midway:
Then imagine a caller retries after a timeout. If the retry cannot be connected to the original business intent, add the missing operation identity before relying on dashboards. If a repair cannot explain its evidence, add the durable transition or link it needs.
Resources
- [DOC] OpenTelemetry: Signals
- Focus: How traces, metrics, logs, and related signals complement rather than replace each other.
- [PAPER] Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
- Focus: Why distributed tracing exists and how request paths are reconstructed across service boundaries.
- [BOOK] Site Reliability Engineering: Monitoring Distributed Systems
- Focus: Practical monitoring signals, alerting, and the limits of aggregate summaries.
Key Takeaways
- Observability reconstructs a user-visible promise from joinable evidence across requests, queues, services, and durable records.
- Request ids, trace ids, operation ids, and business ids describe different things; preserving their relationships makes retries and repair safe to investigate.
- Logs, traces, metrics, and durable events have distinct jobs and should be designed together at system boundaries.
- The right telemetry set supports a safe repair decision without creating unnecessary cost, noise, or privacy exposure.