Distributed Logs and Ordering Guarantees
LESSON
Distributed Logs and Ordering Guarantees
The core idea: Distributed logs turn events into durable histories, but every ordering guarantee has a scope: one log, one partition, one key, or a deliberately expensive global boundary.
Core Insight
Imagine an e-commerce system publishing OrderPlaced, InventoryReserved, PaymentCaptured, and EmailScheduled events. Several consumers depend on that event history: billing, fulfillment, analytics, and fraud detection. The log is not just moving messages from producers to workers. It is creating a shared record that consumers can replay, resume, and use as evidence of what happened.
The trap is assuming that "the log is ordered" means "the whole platform has one global order." A single replicated log may provide one ordered history. A partitioned log usually provides order only within each partition. Across partitions, there is no free answer to which event came first unless the system adds more coordination.
That scope is the central design decision. If all events for one order_id stay in the same partition, that order can have a coherent sequence while the platform still scales across many orders. If related events are split across keys or logs, the application must not invent a global order the infrastructure did not promise.
The trade-off is direct: stronger global order gives simpler reasoning, but costs throughput, locality, and availability. Partitioned order gives parallelism, but pushes causal reasoning and cross-entity coordination back into the application.
A Log Is Shared History With Positions
A queue is often described as work waiting for workers. A log is better described as a durable sequence of facts with positions.
producers ---> [ distributed log ]
0 1 2 3 4 5 ...
^ ^
| |
consumer A consumer B
The same records can be read by many independent consumers. Each consumer tracks where it is in the history, often through an offset or equivalent position.
That gives the log several useful properties:
- consumers can resume after crashes
- new services can replay old history
- delayed consumers can catch up at their own pace
- audit and repair jobs can reread from a known point
- multiple systems can derive different views from the same facts
This is why logs sit close to coordination. A consumer can ask, "What have I processed?" and answer with an offset. A service can rebuild state by replaying the same sequence. A replicated state machine can use an ordered log to drive deterministic apply.
The cost is that the log becomes an important boundary. Retention, duplicate handling, producer retry behavior, consumer commits, and partitioning all affect what the application can safely infer from the history.
Ordering Scope Is the Real Contract
Ordering guarantees are real, but they are never floating in the abstract. They are attached to a scope.
One unpartitioned log can expose a sequence like:
e1 -> e2 -> e3 -> e4
A partitioned log exposes several local sequences:
partition 0: a1 -> a2 -> a3
partition 1: b1 -> b2 -> b3
partition 2: c1 -> c2 -> c3
Inside a partition, order is meaningful. Across partitions, the log usually does not promise a single interleaving. Asking whether a2 happened before b2 may be outside the contract.
This is why partition keys are architecture decisions. If the application needs all events for an order to be processed sequentially, key by order_id. If it needs all events for a customer account to be sequential, key by customer_id. If it needs a global ordering across all events, it is asking for a much tighter coordination boundary.
The useful habit is to say the guarantee out loud:
order per partition
order per entity key
order per stream
global order across everything
Each phrase implies a different cost profile.
Worked Example: One Order, Many Workflows
Suppose an order emits these events:
OrderPlaced(order-7)
InventoryReserved(order-7)
PaymentCaptured(order-7)
EmailScheduled(order-7)
If every event for order-7 goes to the same partition, consumers can process that order's sequence in order:
partition 4:
OrderPlaced -> InventoryReserved -> PaymentCaptured -> EmailScheduled
The system can still process other orders in parallel on other partitions:
partition 1: order-2 events
partition 4: order-7 events
partition 9: order-11 events
That design buys per-order ordering without forcing every event in the company through one global sequence.
Now change the design. Put inventory events in one log, payment events in another, and email events in a third. Each log may preserve its own order perfectly, but the application no longer has one authoritative sequence for the whole order workflow. If it needs to reason about cross-log causality, it must add explicit correlation, idempotency, state checks, or another coordination mechanism.
The log did not fail. The order guarantee changed because the scope changed.
Operations Can Distort What Order Feels Like
A correct log can still surprise teams during normal operations.
Retries can produce duplicates. Replay can make old events appear again to a consumer that reset its offset. Consumer group rebalances can pause ownership of a partition and resume from the last committed position. Leader failover can preserve the log prefix while changing latency. A slow consumer can lag behind without the log being out of order.
The practical ordering guarantee is a combination:
ordering guarantee
= log structure
+ partition key
+ producer retry discipline
+ consumer ownership model
+ offset management
+ recovery behavior
So "the broker guarantees order" is only the beginning. The application must also handle duplicates, choose keys carefully, commit offsets at the right time, and design consumers to be idempotent when replay is possible.
The trade-off repeats: logs make durable replay and ordered processing easier, but they make ordering scope, offset discipline, and duplicate handling part of the application contract.
Common Misreadings
"We use a distributed log, so events are globally ordered" is usually wrong. Ask whether order is per log, per partition, per key, or truly global.
"A duplicate means the log violated ordering" is also wrong. Duplicates can come from retries, replay, or consumer failure. Append order and delivery semantics are separate concerns.
"One global log would be simpler, so it is always better" ignores scale. A single sequence can be mentally clean while becoming a throughput, availability, and locality bottleneck.
Connections
The previous lesson compared replication models by where write authority lives. Distributed logs make that concrete: a leader-backed partition can offer a strong local append order, while partitioning spreads authority and parallelism across many local histories.
The next lesson on logical, vector, and hybrid logical clocks picks up where logs stop. Once work spans several partitions, streams, or regions, clocks and causality metadata help describe relationships the log no longer orders globally.
Resources
- [DOC] Apache Kafka Documentation: Design
- Focus: Study partitions, offsets, consumer groups, and replication as the practical shape of log guarantees.
- [PAPER] Kafka: a Distributed Messaging System for Log Processing
- Focus: Read for the original log-processing model and why durable ordered streams are useful.
- [PAPER] In Search of an Understandable Consensus Algorithm
- Focus: Use Raft's replicated log as a contrast for consensus-backed ordered state machine apply.
- [BOOK] Designing Data-Intensive Applications
- Focus: Read the log, replication, and stream-processing chapters with ordering scope in mind.
Key Takeaways
- A distributed log is a shared durable history with positions, replay, and recovery semantics, not just a message pipe.
- Order always has a boundary: per log, partition, key, or a deliberately coordinated global scope.
- The trade-off is stronger global order versus throughput and parallelism; scaling often means fragmenting order and making causality explicit elsewhere.