Day 501: Unified Batch + Stream Architecture

The core idea: A unified batch + stream architecture keeps one canonical history of facts and one semantic definition of derived state, so low-latency projections and historical recomputes converge on the same answer, trading duplicate pipelines for stricter event modeling, replay discipline, and correction workflows.

Today's "Aha!" Moment

In 028.md, PayLedger made its exposure stream crash-safe by checkpointing keyed state, timers, and source positions together. That solved one class of inconsistency: a worker could fail and recover without losing or double-counting events. The next problem is broader. Finance still needs a nightly recompute of merchant exposure for audit, and risk analysts still need to replay the last 90 days after an enrichment bug or a late settlement file arrives. If the real-time job and the historical recompute are built as separate interpretations of the same business process, the system can recover perfectly from crashes and still tell the company two different stories.

The non-obvious point is that "unified batch + stream" does not mean "every workload runs on one engine" and it does not mean "streaming replaces batch." It means the streaming path and the batch path derive results from the same canonical facts, with the same identity rules, time semantics, and correction model. A streaming projection gives PayLedger a payout decision in minutes. A batch replay gives it a repaired or backfilled answer over a larger horizon. If those answers are defined by different contracts, reconciliation becomes guesswork instead of engineering.

Why This Matters

PayLedger uses exposure totals to decide whether a merchant can receive same-day payout. The risk service needs the latest answer now, because holding a fraudulent payout for even ten minutes matters. The finance team needs the same answer later, because merchant statements, reserve calculations, and audit trails are built from it. Those two consumers naturally pull the data platform in different directions: one wants low latency and incremental updates, the other wants wide scans, correction handling, and reproducible history.

Without a unified architecture, teams usually split the problem into a speed layer and a batch layer that evolve independently. The streaming job writes a hot table keyed by merchant_id; the nightly warehouse job runs SQL over raw files and computes "authoritative" exposure totals. At first this feels pragmatic. Then a refund arrives 18 hours late, a schema change introduces a new event subtype, or a fraud feature backfill changes merchant risk labels for last week. The stream view, warehouse report, and downstream product behavior no longer match. The trade-off is explicit: unification reduces semantic drift and makes repairs tractable, but it forces the team to treat history, corrections, and derived tables as first-class design concerns instead of ad hoc cleanup work.

Core Walkthrough

Part 1: Grounded Situation

PayLedger ingests four event families into Kafka and an object-store archive:

authorization_created
capture_settled
refund_posted
chargeback_opened

The risk service needs a near-real-time projection called merchant_exposure_current. It is updated continuously and powers payout holds within two minutes of an event arriving. Finance also needs merchant_exposure_daily, a reproducible table for statements and reserve reports. That table must absorb late acquirer files, corrected FX rates, and fraud labels that are recomputed after the fact.

The naive architecture is two separate implementations:

stream path: Kafka -> Flink job -> operational table -> risk service
batch path: object storage -> nightly Spark SQL -> warehouse table -> finance reports

Both paths claim to compute "merchant exposure," but they quietly disagree on key rules. The stream job deduplicates by event_id; the nightly SQL deduplicates by (merchant_id, event_time, amount) because raw files once lacked IDs. The stream job closes a 15-minute window when the watermark passes; the batch job uses file arrival time because that was easier to express in SQL. The stream job treats a corrected settlement as a compensating event; the batch job overwrites the old row in place during normalization.

Now merchant m_482 asks why yesterday's dashboard showed exposure peaking at $85,000 while finance's report shows a maximum of $67,000. Both numbers came from the same business, but not from the same model. The issue is no longer processor uptime or checkpoint frequency. It is that the platform has two truths.

Part 2: Mechanism

A unified architecture starts by declaring what the source of truth actually is. For PayLedger, the canonical truth is an append-only fact history plus versioned reference data:

payment events + correction events + dimension snapshots

Everything else is a projection. The streaming system maintains the projection incrementally; the batch system rebuilds or repairs it by replaying history over a wider range. The crucial requirement is that both paths share semantic boundaries:

the same event identity and dedup rule
the same event-time fields and late-data policy
the same interpretation of corrections, tombstones, and dimension changes
the same projection definition for exposure

In practice the flow looks like this:

producers
   |
   v
canonical event log -----> archive / table snapshots
   |                                |
   |                                v
   |                       batch replay / backfill
   v                                |
stream processor -------------------+
   |
   v
materialized projections for serving and reporting

The real implementation can use different engines. PayLedger might use Flink for the incremental path, Spark for backfills, and Iceberg tables for durable snapshots. That is still unified if both engines consume the same canonical facts and apply the same semantics. The architecture is not unified if they merely happen to read similarly named data sets.

One useful way to make the contract concrete is to define the state transition once and reuse it conceptually everywhere:

def apply_exposure_event(state, event):
    if state.seen_event_ids.contains(event.event_id):
        return state

    signed_delta = classify(event)
    state.open_exposure += signed_delta
    state.seen_event_ids.add(event.event_id)
    state.last_event_time = max(state.last_event_time, event.event_time)
    return state

The stream processor applies that transition to each new event and writes an incremental view. The batch replay applies the same transition across a historical slice to regenerate the view after a bug fix, a backfill, or a new serving table bootstrap. Unified architecture is the insistence that both computations mean the same thing even if the runtime mechanics differ.

This also changes how corrections work. If an acquirer sends a repaired settlement amount, PayLedger appends a compensating correction event or a new versioned fact instead of mutating the original exposure table in place. That keeps replay meaningful. A backfill can reproduce the corrected result because the correction is in history, not hidden in an operational table with no lineage.

Part 3: Implications and Trade-offs

The main benefit is not elegance. It is operational clarity. When the stream projection and the batch replay disagree, the team can ask whether the canonical facts differ, whether a projection definition changed, or whether one engine violated the shared contract. That is a tractable debugging path. In a split architecture, the same disagreement often devolves into "the dashboard is using pipeline A and finance is using pipeline B," which is not a diagnosis at all.

The cost is real. Unified architectures retain more raw history, preserve event lineage longer, and require stronger schema discipline. They also force awkward business conversations. Should the product show provisional exposure immediately and accept that late corrections may revise yesterday's answer? Or should it wait for a slower authoritative layer and tolerate stale decisions? That trade-off is structural, not just UI polish.

There is another trade-off between physical unification and semantic unification. A single engine can reduce code drift, but it may be poor at either large replays or low-latency serving. Separate engines can be the right choice if they share a strict semantic contract. The failure mode is not using multiple systems. The failure mode is letting each system invent its own definition of truth.

For PayLedger, the practical policy becomes:

projections are disposable and rebuildable
canonical event history is retained long enough for replay and audit
corrections enter the system as new facts, not silent rewrites
reconciliation jobs compare incremental and replayed outputs before downstream consumers trust them blindly

That is what makes batch and stream feel unified in production: not a slogan, but a repairable system boundary.

Failure Modes and Misconceptions

"Unified means one physical platform." It does not. Flink plus Spark plus Iceberg can be unified; two Flink jobs with different event rules are not.
"The nightly batch table is automatically more correct because it is slower." Slower only helps if the replay uses the same canonical facts and semantics. Otherwise it is just a different approximation.
"Materialized views can be patched directly during incidents." Directly editing projections hides lineage and makes future replays diverge. Fix the fact history or append an explicit correction instead.
"Checkpoint recovery solved correctness, so historical replay is optional." Checkpoints protect the live stream path. They do not address late data, bug fixes, bootstrap of new tables, or audit reconstruction.
"Dedup and event-time policy can differ by consumer." They can, but then the system is intentionally exposing different products. If both outputs claim to be the same metric, those rules must be shared and explicit.

Connections

028.md made one streaming job recover to a consistent cut after failure. This lesson widens the scope: the batch replay and the streaming projection now need to agree on what that recovered state means over long horizons.
030.md picks up from here by separating storage and serving surfaces. Once canonical facts and projections are distinct, the next question is where warehouses, lakehouses, and serving layers each fit.
../consistency-and-replication/064.md is a useful parallel from the storage side. Replicated logs preserve ordered facts; unified batch + stream architecture decides how multiple computation modes derive stable views from those facts.

Resources

[BOOK] Designing Data-Intensive Applications
- Focus: Read the chapters on stream processing and dataflow for the difference between immutable facts, derived state, and repair through replay.
[PAPER] The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
- Focus: Pay attention to how bounded and unbounded data are handled with one semantic model rather than as unrelated architectures.
[DOC] Apache Beam Programming Guide
- Focus: Useful for seeing how one programming model can express both replay over bounded data and incremental processing over live streams.
[ARTICLE] Questioning the Lambda Architecture
- Focus: Read this as a critique of duplicating business logic across batch and speed layers; the point is not "batch is bad" but "semantic drift is expensive."
[DOC] Apache Iceberg Documentation
- Focus: Look at snapshots, partition evolution, and time travel as building blocks that let replay and incremental consumption share the same table history.

Key Takeaways

Unified architecture is about shared semantics, not a single engine. Batch replay and streaming updates can run on different systems and still converge if they derive from the same canonical facts.
Projections are not the source of truth. The repairable boundary is the append-only history plus explicit correction model, because that is what lets the platform recompute after bugs or late data.
The trade-off is semantic rigor for operational coherence. You spend more effort on schema design, lineage, and replay so incidents, backfills, and audits stop producing contradictory answers.
This lesson sets up the storage question next. Once facts and projections are separated cleanly, warehouses, lakehouses, and serving layers become a placement decision instead of an argument about which system owns truth.

← Back to Data Architecture and Platforms

← Back to Learning Hub