Change Data Capture from WAL Streams

LESSON

Consistency and Replication

026 30 min advanced

Day 425: Change Data Capture from WAL Streams

The core idea: WAL-based CDC works because it reads the database's own commit log instead of re-querying tables, so downstream systems inherit commit order and transaction boundaries, but only if you manage offsets, retention, and decoding semantics carefully.

Today's "Aha!" Moment

In 08.md, Harbor Point fixed the dual-write problem by inserting an outbox row in the same transaction that reserves inventory for bond CA-MUNI-77. That lesson solved the first correctness gap: Inventory can now say, truthfully, "the hold exists, and there is durable evidence that an event must be published." But the team still has to answer a hard operational question: how does that outbox row leave PostgreSQL and become a broker event without turning into another fragile polling script?

The tempting answer is "query the outbox table every second." That works in demos and becomes awkward in production. The poller needs indexes that stay hot under heavy inserts, it must guess how to batch rows without reordering them, and it can only infer commit order indirectly from timestamps or surrogate IDs. Once Harbor Point starts reserving thousands of bonds per minute at market open, the gap between "row committed" and "row published" becomes both a latency problem and a correctness-observability problem.

WAL-based CDC changes the perspective. Instead of asking tables what changed after the fact, the connector reads the database's own write-ahead log, the same durable stream used for crash recovery and replication. That stream already knows transaction boundaries, commit order, and log sequence numbers. When Inventory commits the reservation row and the outbox row for saga_id=8841, the CDC pipeline does not need to rediscover those facts from application tables. It can follow the commit stream the database itself trusts.

That is why WAL-based CDC is more than a faster poller. It is a different contract. The database becomes the source of truth not only for stored rows but also for the exact sequence in which committed changes became durable. The rest of this lesson explains what the connector is really reading, where the guarantees stop, and why the next lesson, 10.md, naturally builds incremental materialized views on top of this same stream.

Why This Matters

Harbor Point's reservation workflow now spans Inventory, Risk, and downstream settlement preparation. Inventory stores the hold and an outbox event in PostgreSQL. A polling relay is good enough while traffic is light, but the desk's 09:30 opening burst exposes its limits. Some events wait several seconds because the poller scans too conservatively. Other batches grow so large that operators cannot tell whether rising lag comes from slow publishing, a blocked database query, or a broker partition that fell behind. Worse, if engineers try to parallelize the poller aggressively, they risk publishing events out of transaction order for the same reservation.

WAL-based CDC matters because it moves the extraction point closer to the database's real commit path. The connector can observe committed inserts, updates, and deletes in log order, with transaction metadata attached, without constantly re-reading the tables that the application already uses for live traffic. For Harbor Point, that means the outbox row for inventory.reserved is emitted according to the same commit sequence that made the reservation durable in the first place.

This does not make the pipeline magically "exactly once." If the connector crashes after publishing but before checkpointing its position, it may emit the same event again. If a replication slot falls too far behind, PostgreSQL may retain WAL segments until storage pressure becomes a production issue. If downstream consumers ignore schema evolution, a clean CDC feed can still break the read model. The value of WAL-based CDC is not perfection. It is that the failure modes are tied to concrete log positions, decoder rules, and replay semantics instead of hidden inside application polling logic.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain why WAL-based CDC is structurally different from table polling - Show how reading the commit log preserves transaction order and visibility in a way polling must reconstruct imperfectly.
  2. Trace Harbor Point's outbox event from commit to broker - Follow the reservation transaction, logical decoding, connector offsets, and downstream publication path.
  3. Evaluate the operational trade-offs of WAL streaming - Reason about replication slots, lag, schema changes, large transactions, and idempotent consumers before choosing CDC in production.

Core Concepts Explained

Concept 1: The WAL is the database's authoritative story of committed change

When Inventory handles Reserve(CA-MUNI-77, desk=ALPHA, qty=2_000_000, saga_id=8841), PostgreSQL does not update the outbox table first and worry about durability later. It writes log records describing the row changes into the write-ahead log, flushes the relevant WAL records durably, and only then reports commit success. The table pages may reach disk later. That is the entire point of write-ahead logging: recovery can replay the log because the log is the durable history.

For CDC, that matters because the connector is reading the same durable history that recovery and replication trust. In a logical-decoding setup, PostgreSQL interprets WAL records into row-level changes that make sense outside the storage engine: "inserted an outbox row," "updated reservation status," "deleted a hold." Those decoded changes are still organized by transaction. A connector can therefore wait until it sees the commit record, then emit the full committed change set in order.

That is the structural advantage over polling. A poller repeatedly asks, "which rows look new?" Usually it answers with a query like WHERE published_at IS NULL or WHERE id > last_seen_id. That can be workable, but the query is reconstructing intent from table state. It must assume the chosen column reflects commit order, that long transactions do not surface late, and that concurrent updates will not create ambiguous visibility. WAL-based CDC starts from stronger evidence: the log sequence number, or LSN, that the database itself assigned as part of commit processing.

For Harbor Point, this means the reservation hold and the matching outbox row stay tied together. If the transaction aborts, logical decoding does not emit the row as committed business intent. If the transaction commits after a long lock wait, the connector still sees the final commit order rather than the order in which the application started work. That distinction becomes crucial under concurrency. Production incidents often come from confusing "request arrival order" with "commit order."

The trade-off is tight coupling to database internals. WAL formats, decoder plugins, and retention policies are not generic application concerns. Once Harbor Point leans on WAL-based CDC, the storage engine is no longer a black box. The team gets stronger ordering semantics, but it also inherits engine-specific operational responsibilities.

Concept 2: A CDC connector turns WAL positions into replayable downstream events

The end-to-end path for Harbor Point looks like this:

Inventory service
   |
   | 1. local transaction
   |    - insert reservation hold
   |    - insert outbox row
   v
PostgreSQL commit
   |
   | 2. WAL records + commit record become durable
   v
Logical decoding / replication slot
   |
   | 3. connector reads committed changes in LSN order
   | 4. connector stores its last processed position
   v
Kafka topic or event bus
   |
   | 5. consumers update risk view, notifications, or projections
   v
Downstream read models

Two pieces of state make this pipeline work: the database's replication position and the connector's checkpoint. PostgreSQL exposes a replication slot so the connector can ask, in effect, "continue streaming from the last LSN I have safely processed." The connector reads decoded changes, groups them by transaction, publishes the resulting events, and then records the latest durable offset on its own side. On restart, it resumes from that checkpoint instead of rescanning the outbox table.

That design is why WAL-based CDC is usually at-least-once across the database-to-broker boundary. Suppose Harbor Point's connector publishes inventory.reserved to Kafka, then crashes before persisting the new checkpoint. After restart, it may reread the same committed transaction from the previous LSN and publish it again. The right response is not wishful thinking about exactly-once delivery. The right response is to keep the event identity stable, such as event_id=8841:inventory-reserved, and make downstream consumers idempotent.

Initial snapshotting is another important mechanism. If Harbor Point enables CDC on a live system, the connector may need a starting picture of relevant tables before it can stream new WAL entries safely. Some tools take a consistent snapshot, note the corresponding log position, and then continue with WAL changes after that position. That snapshot boundary has to be chosen carefully; otherwise the system can miss rows inserted during initialization or replay them twice. The important mental model is "snapshot plus ordered tail," not "stream starts from nowhere."

This is also the point where the lesson from 08.md becomes concrete. The outbox pattern gives Harbor Point a durable row that says an event must exist. WAL-based CDC gives the team a disciplined way to externalize that row without bolting more custom logic onto the application process. Once the connector is stable, the same commit stream can feed not only the broker topic but also the projection systems discussed in 10.md.

Concept 3: The hard production problems are lag, retention, and semantic drift

WAL-based CDC sounds elegant because it follows the database's own machinery, but the sharp edges show up quickly in production. The first is lag. If Harbor Point's connector falls behind, PostgreSQL cannot discard old WAL segments that the replication slot still needs. The result is not merely "CDC is slow." The result can be disk growth on the primary until operators notice that an analytics consumer's lag is now threatening the write path of the source database.

Large transactions create another subtle problem. Imagine a bulk correction that updates fifty thousand reservation rows because a desk imported positions incorrectly. Logical decoding may hold that transaction until commit, then release a very large burst of change events at once. Downstream consumers see one giant wall of work, and any read model built on the stream experiences latency spikes even though the source database behaved correctly. Harbor Point therefore needs both application-level discipline around transaction size and operational metrics that distinguish steady lag from "one huge commit is decoding."

Schema evolution is the third recurring failure mode. If the outbox payload adds risk_bucket and one consumer still expects the old schema, the WAL stream itself is not wrong. The connector decoded exactly what committed. Semantic drift happens above the log layer. Teams often misdiagnose this as a database or CDC bug when the real issue is contract management between producer and consumer. A WAL stream preserves order, not meaning.

These trade-offs explain where WAL-based CDC fits. It is a strong choice when Harbor Point wants ordered, low-latency extraction of committed changes without repeatedly scanning hot tables. It is a weaker fit when the team cannot tolerate engine-specific operations, lacks idempotent consumers, or needs transformations so complex that a direct log tail becomes hard to reason about. The stream is powerful precisely because it is close to storage internals; that same closeness raises the operational bar.

Troubleshooting

Issue: CDC lag grows steadily, and PostgreSQL disk usage starts climbing even though application traffic is normal.

Why it happens / is confusing: A replication slot tells PostgreSQL that old WAL segments are still needed. If the connector stops advancing its LSN, the database preserves log files for replay, so a connector problem turns into source-database storage pressure.

Clarification / Fix: Monitor slot lag, retained WAL bytes, and connector checkpoint age together. If Harbor Point uses CDC for operational workflows, treat slot lag as a production SLO, not a background admin metric.

Issue: Consumers process the same inventory.reserved event twice after a connector restart.

Why it happens / is confusing: The connector can crash after publish but before persisting its last processed LSN. On restart it legitimately rereads the already committed transaction from the earlier checkpoint.

Clarification / Fix: Keep stable event IDs in the outbox payload and make every downstream side effect idempotent on that business key. WAL-based CDC improves ordering, but it does not eliminate replay.

Issue: A new consumer misses historical rows even though the CDC stream is healthy from this morning onward.

Why it happens / is confusing: Streaming only future WAL records does not reconstruct the older table state automatically. Without a controlled snapshot, the consumer begins from "now" and has no baseline for existing reservations.

Clarification / Fix: Decide explicitly whether a new consumer needs an initial snapshot, a backfill job, or a separate bootstrap topic. "Start reading the WAL" is not a substitute for state initialization.

Issue: Downstream read models stall after a bulk correction even though the connector is still running.

Why it happens / is confusing: One very large transaction may decode and publish as a burst after commit, overwhelming consumers and making end-to-end freshness look erratic.

Clarification / Fix: Keep source transactions bounded when possible, monitor event burst size alongside lag, and design downstream projections to absorb bursts without assuming perfectly smooth input.

Advanced Connections

Connection 1: 08.md defines what must be published; this lesson defines how the platform can publish it safely

The outbox pattern makes "inventory changed" and "an event must be emitted" one local fact. WAL-based CDC is the transport refinement that follows that fact through the database's own commit stream. Together they replace a fragile application-side relay with a pipeline grounded in storage semantics.

Connection 2: 10.md uses the same CDC stream to maintain derived state incrementally

Once Harbor Point can trust the ordered stream of committed changes, it can build read models and materialized views without re-reading the entire source tables every time. The next lesson extends this idea from "publish events correctly" to "maintain projections efficiently."

Resources

Optional Deepening Resources

Key Insights

  1. WAL-based CDC reads commit history, not table snapshots - That gives Harbor Point real transaction order and commit visibility instead of forcing a poller to reconstruct them from application rows.
  2. Offsets are part of the correctness contract - The connector's checkpoint and the database's replication position determine replay behavior, so idempotent consumers remain mandatory.
  3. The clean abstraction still has sharp operational edges - Slot lag, retained logs, large transactions, and schema drift are normal design concerns, not exceptional corner cases.
PREVIOUS Sagas and Outbox for Cross-Service Consistency NEXT Materialized Views and Incremental Maintenance

← Back to Consistency and Replication

← Back to Learning Hub