ZAB and Total Order Broadcast in Practice

LESSON

008 30 min intermediate

ZAB and Total Order Broadcast in Practice

The core idea: ZAB packages leader-based replication as total order broadcast, so ZooKeeper replicas deliver the same coordination updates in the same order after safe recovery.

Core Insight

Imagine a ZooKeeper ensemble storing coordination state for a control plane. Clients create znodes, update configuration, publish leader-election markers, and set watches. The system does not merely need those writes to appear on several machines. It needs every correct replica to deliver the same stream of state changes in the same order.

If one replica applies disable shard X before promote shard Y, while another observes the reverse order, clients can see contradictory control-plane behavior. Even if both updates eventually arrive everywhere, the service has broken the ordered story its users depend on.

ZAB is easiest to understand from that pressure. It is a leader-based atomic broadcast protocol for ZooKeeper's world: one leader sequences proposals, followers acknowledge them, committed updates are delivered in total order, and leader changes go through recovery before new broadcast continues.

The trade-off is that ZAB buys a coherent ordered history and a strong recovery story, but pays with leader dependence, synchronization complexity, and throughput shaped by ordered delivery.

Total Order Broadcast Is the Service Shape

Paxos is often introduced as choosing one value. Raft is often introduced as an understandable replicated log under strong leadership. ZAB is more naturally framed as total order broadcast:

all correct replicas deliver the same updates
in the same relative order

For ZooKeeper-style coordination, that is the right abstraction. The service is used for metadata, locks, naming, watches, and configuration. Those are control-plane objects where ordering carries meaning.

A simplified broadcast path looks like this:

client write
  -> leader sequences proposal
  -> followers acknowledge
  -> quorum evidence makes it safe
  -> replicas deliver the update in order

The important word is "deliver." Network messages can arrive in different physical orders. Total order broadcast is not a promise about raw packet arrival. It is a protocol guarantee about logical delivery order: once updates are delivered to the service state, every correct replica observes the same sequence.

This makes ZAB feel close to a replicated log, but the framing is useful. It keeps attention on the client-visible stream of coordination events rather than on one isolated consensus decision.

Recovery Before Broadcast

Leader-based ordering is not enough by itself. A leader can fail after sending some proposals to some followers. One follower may have seen more of the old suffix than another. If a new leader starts broadcasting immediately, it may extend an unsafe or inconsistent history.

ZAB therefore has two broad regimes:

recovery / synchronization
normal broadcast

During recovery, the newly elected leader has to establish a history that is safe to continue. The ensemble must converge on a valid prefix before new ordered updates resume.

A useful timeline:

old leader broadcasts proposals
old leader crashes mid-stream
new leader is elected
new leader synchronizes history with followers
normal ordered broadcast resumes

This is the same family of concern we saw in Raft: do not expose or extend unstable history casually after leadership changes. ZAB packages that concern around the needs of atomic broadcast. The new leader's job is not just "be in charge"; it must restore a safe sequence and then extend it.

Worked Example: Two Coordination Updates

Suppose a control plane needs to perform two writes:

u1: disable shard X
u2: promote shard Y

If all replicas deliver u1 before u2, clients see one coherent transition:

disable old path
then promote replacement

If some replicas deliver u2 first, clients may route traffic to a promoted shard while another part of the system still believes shard X is active. The problem is not only durability. It is the semantic order of control-plane state.

ZAB's leader sequences these updates so followers can deliver one ordered history:

zxid 100: disable shard X
zxid 101: promote shard Y

The exact implementation details are deeper than this lesson needs, but the idea of a monotonically ordered proposal stream matters. The protocol gives each update a place in the sequence, gathers enough acknowledgement, and then delivers updates in order.

When leadership changes, recovery protects that sequence. A new leader must not decide that zxid 101 happened before zxid 100, or invent a fresh suffix while followers still disagree about what was safely accepted.

Why ZAB Fits ZooKeeper

ZooKeeper is not mainly a bulk storage engine. It is a coordination substrate. Its users build higher-level behavior on top of ordered metadata:

leader election recipes
service discovery state
configuration changes
locks and barriers
watches over changing znodes

Those workloads are especially sensitive to ordering mistakes. A high-throughput data store may be able to tolerate temporary divergence for some data types. A coordination service often cannot, because other systems use its state to decide who is allowed to act.

ZAB's package fits that workload:

gain:
  one ordered stream of coordination updates
  leader-based sequencing
  explicit recovery before resumed broadcast

pay:
  dependence on a healthy leader regime
  synchronization work after leader changes
  latency/throughput constraints from ordered delivery

That trade-off is why ZAB is worth seeing beside Paxos and Raft. It shows that consensus-adjacent mechanisms are shaped by the service abstraction they expose. ZooKeeper exposes ordered coordination state, so its replication protocol is naturally explained as total order broadcast with recovery.

Common Misreadings

"ZAB is just Raft with different names" is too flat. Both are leader-based and both protect an ordered history, but Raft is usually presented as an understandable replicated log, while ZAB is packaged around atomic broadcast and ZooKeeper recovery.

"Total order means packets arrive in the same order" is wrong. Raw network arrival can differ. The guarantee is about the logical order in which updates are delivered to the replicated service.

"Once a new leader exists, broadcast can resume immediately" misses the recovery requirement. Leadership can change while a suffix is only partially propagated, so the new leader must first synchronize on safe history.

Connections

The previous Raft lessons focused on leader authority, commit, and membership. ZAB keeps the same broad safety concerns but reframes them around ZooKeeper's need for total ordered delivery.

The next lesson compares primary-backup, multi-leader, and leaderless replication. ZAB is a useful bridge because it shows one strong leader-based design where ordered coordination state is worth the cost.

Resources

[PAPER] ZooKeeper: Wait-free Coordination for Internet-scale Systems
- Focus: Understand the service model and why ordered coordination state matters.
[PAPER] Zab: High-performance Broadcast for Primary-Backup Systems
- Focus: Study the protocol's broadcast and recovery framing.
[DOC] ZooKeeper Recipes and Solutions
- Focus: Connect total ordered coordination state to locks, leader election, queues, and barriers.

Key Takeaways

ZAB is best understood as leader-based total order broadcast with explicit recovery before normal broadcast resumes.
ZooKeeper-style coordination workloads need one coherent order of metadata and control-plane updates, not just eventual replication of bytes.
The design trade-off is ordered safety and recovery clarity in exchange for leader dependence and synchronization cost.

← Back to Consensus and Coordination

← Back to Distributed Systems

← Back to Learning Hub