Day 495: Event Logs, Projections, and Stream Tables

The core idea: An event log becomes operationally useful only when deterministic projections turn ordered records into rebuildable stream tables, which gives fast queries and decoupled read paths at the cost of checkpointing, replay discipline, and carefully scoped ordering guarantees.

Today's "Aha!" Moment

In 063.md, Harbor Point turned PostgreSQL's commit log into a durable booking-changes stream. That solved the integration problem of "how do downstream systems hear about committed changes?" It did not solve the serving problem. The embarkation dashboard still needs answers like "which guests for the July 14 sailing are cleared to board right now?" Reading the raw log for every page load would mean replaying thousands of records just to answer one query, while hitting the write database directly would couple operational dashboards back to transactional tables that were never shaped for those reads.

This is where projections enter. A projection is a deterministic program that consumes the log in order, applies each event to derived state, and persists the result as a queryable table. Harbor Point's manifest-projector reads booking confirmations, cabin changes, payment holds, and accessibility notes, then maintains one row per guest per sailing in boarding_manifest. The non-obvious point is that a projection is not "just a cache." A cache is useful if it is warm. A projection is useful if it can crash, replay, rebuild, and still converge on the same state.

That distinction matters in production because the raw log and the stream table answer different questions. The event log answers, "What happened, in what durable order?" The stream table answers, "Given everything applied up to checkpoint X, what is the current state for this key?" The trade-off is deliberate: Harbor Point gets low-latency queries and isolated read models, but it also inherits stateful consumers, recovery checkpoints, lag monitoring, and the obligation to decide exactly which ordering guarantees each projection depends on.

Why This Matters

Harbor Point's write path is optimized for correctness at commit time. It stores bookings, payments, cabin assignments, and check-in holds in normalized tables. The embarkation staff, however, does not want normalized tables. They want a boarding view keyed by sailing and guest, with fields such as current cabin, payment status, wheelchair assistance, and "boardable yes/no." Finance wants a different shape: unsettled charges by sailing. Search wants another: the latest cabin inventory by itinerary. These are read problems, not write problems.

Without projections, teams usually pick one of two bad options. They either query the OLTP database directly with increasingly complex joins and filters, which pushes read pressure onto the source system and leaks internal schema into every downstream tool, or they build ad hoc caches whose update logic is only half replayable. Both options become painful when a consumer crashes, when logic changes, or when an incident forces a backfill. A dashboard that depends on mystery state is hard to trust during boarding.

Projections make the derived state explicit. Harbor Point can say, "boarding_manifest is the result of replaying approved booking and hold events through manifest-projector up to Kafka offset 918244." That sentence is operational gold. It gives the team a way to reason about freshness, rebuild after bugs, audit why a row looks wrong, and keep the write database focused on transactional work. The cost is that every projection now needs a recovery model, a checkpoint model, and a clear statement of whether it cares about global order, per-key order, or only eventual convergence.

Core Walkthrough

Part 1: Grounded Situation

Keep one Harbor Point flow in view:

PostgreSQL -> CDC / outbox -> booking-events topic
                           -> hold-events topic
                           -> payment-events topic

booking-events + hold-events + payment-events
    -> manifest-projector
    -> boarding_manifest stream table
    -> embarkation dashboard

Suppose booking 8841 for the July 14 sailing goes through four changes over ten minutes:

offset 9101 booking-confirmed   booking_id=8841 cabin=S12
offset 9108 payment-authorized  booking_id=8841 amount=220000
offset 9116 hold-added          booking_id=8841 hold_code=passport-mismatch
offset 9124 hold-cleared        booking_id=8841 hold_code=passport-mismatch

The dashboard does not want four events. It wants one answer:

booking_id=8841
sailing_id=2026-07-14
cabin_id=S12
payment_status=authorized
active_holds=[]
boardable=true

Trying to compute that row on demand from the raw log is wasteful. Trying to compute it from live transactional tables is risky because those tables are optimized for writes and source-of-truth integrity, not for every operational read pattern. Harbor Point therefore introduces a projection whose only job is to collapse the event history into current state for that specific query shape.

This is also where "stream table" becomes a useful term. The source is a stream: an append-only history of immutable records. The result is table-like: the latest derived value for each key. Harbor Point is not abandoning history. It is materializing the current answer that history implies.

Part 2: Mechanism

The mechanism is straightforward in outline and unforgiving in detail.

The projector reads ordered input records from one or more topics.
It selects the key for the derived table, such as (sailing_id, booking_id).
It loads the current derived row and the last applied source position for that key or partition.
It applies a deterministic reducer function to produce the next row state.
It stores the updated row and advances its checkpoint so a restart will not reintroduce ambiguity.

In Harbor Point's manifest-projector, the reducer might behave like this:

def apply_manifest_event(row, event):
    row = row or {
        "booking_id": event.booking_id,
        "sailing_id": event.sailing_id,
        "cabin_id": None,
        "payment_status": "pending",
        "active_holds": [],
        "boardable": False,
    }

    if event.type == "booking-confirmed":
        row["cabin_id"] = event.cabin_id
    elif event.type == "payment-authorized":
        row["payment_status"] = "authorized"
    elif event.type == "hold-added":
        row["active_holds"].append(event.hold_code)
    elif event.type == "hold-cleared":
        row["active_holds"] = [h for h in row["active_holds"] if h != event.hold_code]

    row["boardable"] = row["payment_status"] == "authorized" and not row["active_holds"]
    return row

The projector then needs a durable rule for duplicates and crashes. If the process writes the updated row but crashes before it records its source position, Kafka may redeliver the event. If Harbor Point merely "tries again," a non-idempotent reducer could duplicate holds, double-count charges, or regress state. The usual fix is to persist derived state and checkpoint together, or to store enough position metadata to ignore already-applied inputs.

A simplified sink transaction looks like this:

BEGIN;

SELECT last_offset
FROM projector_checkpoint
WHERE projector_name = 'manifest-projector'
FOR UPDATE;

-- If incoming offset <= last_offset, ignore as duplicate.

UPSERT INTO boarding_manifest (..., last_event_offset)
VALUES (..., 9124);

UPDATE projector_checkpoint
SET last_offset = 9124
WHERE projector_name = 'manifest-projector';

COMMIT;

The exact storage strategy varies. Kafka Streams keeps local state stores and backs them with changelog topics. Flink checkpoints operator state. A hand-rolled consumer may use PostgreSQL tables plus explicit checkpoints. The principle is the same: the stream table must represent "the log replayed through this reducer up to this position," not "whatever happened to be in memory before the pod restarted."

Ordering is where many designs quietly become wrong. Harbor Point can rely on stable order for records that share a key and a partition. It cannot assume one magical global order across every topic and partition. If a projection needs "payment and hold status for the same booking," keying by booking_id and aligning partitions is usually enough. If it needs a cross-booking aggregate such as "remaining cabins on a sailing," it may need repartitioning by (sailing_id, cabin_id) or a separate projection whose reducer is defined around that key. The system stays sane only when the derived table's key and the input ordering model match.

Part 3: Implications and Trade-offs

The first benefit is query isolation. Harbor Point can add a new read model for check-in supervisors without touching the booking write path or burdening PostgreSQL with more operational joins. The second benefit is replayability. If the team discovers that boardable should also require a completed safety waiver, it can deploy manifest-projector-v2, replay from the beginning into boarding_manifest_v2, compare outputs, and cut over once the new table catches up. That is much safer than patching rows by hand and hoping future events preserve the fix.

The price is operational discipline. Stream tables are disposable in principle but expensive in practice. Replaying millions of events can saturate storage, consume network bandwidth, and expose bugs that never appeared in the incremental path. Lag is not a cosmetic metric; it determines how stale the read model is. A boarding agent looking at boarding_manifest that is twenty minutes behind is making a different class of decision than one looking at a table that is five seconds behind.

Another trade-off is semantic scope. A stream table is excellent for current keyed state, recent aggregates, and rules that can be expressed as a deterministic fold over history. It is weaker for arbitrary ad hoc queries over full history unless you also preserve the underlying log or write historical snapshots elsewhere. Harbor Point keeps the event log for audit and replay, while using stream tables for the fast operational questions. That split is healthy because it stops one storage shape from pretending to satisfy every workload.

Finally, stream tables force a clean answer to "what is the source of truth?" The source of truth is still the event log or the committed system of record that feeds it. The stream table is authoritative for serving a derived query, but not for independent manual edits. If an agent flips boardable directly in boarding_manifest without emitting a corrective event, the next replay will erase that fix. Production teams do best when they treat stream tables as generated state with strict ownership and predictable rebuild paths.

Failure Modes and Misconceptions

"A projection is just a cache with a fancy name." A cache can be warmed from somewhere else and often tolerates silent loss. A projection is defined by replay semantics. If Harbor Point cannot rebuild boarding_manifest from the input log and recover to the same state, it has an opaque cache, not a trustworthy projection.
"Kafka preserves order, so my table logic can assume one total sequence." Kafka preserves order within a partition, not across all partitions or topics. Harbor Point must choose keys and partitioning that match the invariants of each derived table.
"Once the stream table exists, operators can patch it directly." Direct edits create drift from the log-derived contract. The lasting fix is to publish a corrective event, repair the reducer, or rebuild the table with new logic.
"Checkpointing only matters for disaster recovery." Checkpoints matter every time a consumer crashes between applying state and acknowledging input. Without a clear checkpoint boundary, duplicates and gaps become normal behavior rather than exceptional failures.
"Projection lag only affects freshness." Lag also affects operations. A stale boarding_manifest can block valid passengers, miss active holds, or show an already-reassigned cabin as available. Freshness is a business correctness issue when humans act on the table.
"A backfill should replay straight into the live table." If Harbor Point changes projection logic, replaying into the live serving table can mix old and new semantics during rebuild. A safer pattern is versioned tables plus cutover after validation.

Connections

Connection 1: 063.md provides the durable change stream that projections consume

CDC made committed database changes available as an integration backbone. This lesson explains how Harbor Point turns that backbone into read-optimized state without pushing every dashboard back onto raw tables or raw logs.

Connection 2: 061.md and 062.md define the reliability boundaries a projection must respect

Replay-safe consumers matter here because projectors are stateful sinks. Backpressure matters because a lagging projector can make a read model stale while also creating pressure on the stream-processing path that feeds it.

Connection 3: 065.md asks you to choose these boundaries deliberately

The capstone that follows is where the pieces meet: source-of-truth writes, change capture, projection design, replay cost, and operator-facing consistency promises all have to fit into one defendable architecture.

Resources

[BOOK] Designing Data-Intensive Applications
- Focus: Revisit the chapters on data integration, logs, derived data, and materialized views to place projections in a broader systems design context.
[DOC] Apache Kafka Streams Core Concepts
- Focus: Study the stream-table duality and how changelog replay reconstructs table state from ordered events.
[DOC] Apache Flink: Stateful Stream Processing
- Focus: Pay attention to checkpoints, operator state, and why replay plus state snapshots define recovery behavior.
[DOC] ksqlDB Materialized Views
- Focus: See a concrete implementation of continuously maintained stream tables and how stream/table duality appears in practice.

Key Takeaways

An event log is the durable history; a projection is the deterministic computation; a stream table is the queryable state produced by replaying that computation up to a known checkpoint.
Projections are only trustworthy when ordering assumptions, reducer logic, and checkpoint boundaries are explicit enough to survive duplicates, crashes, and rebuilds.
Stream tables decouple read workloads from write systems and make operational queries fast, but they introduce lag management, replay cost, and stricter rules about who is allowed to mutate derived state.
The safest production posture is to treat stream tables as generated artifacts of the log, version them when logic changes, and keep the underlying event stream available for audit and rebuild.

← Back to Consistency and Replication

← Back to Learning Hub