Unified Batch + Stream Architecture

LESSON

Data Architecture and Platforms

029 30 min advanced

Day 501: Unified Batch + Stream Architecture

The core idea: A unified batch + stream architecture keeps one canonical history of facts and one semantic definition of derived state, so low-latency projections and historical recomputes converge on the same answer, trading duplicate pipelines for stricter event modeling, replay discipline, and correction workflows.

Today's "Aha!" Moment

In 028.md, PayLedger made its exposure stream crash-safe by checkpointing keyed state, timers, and source positions together. That solved one class of inconsistency: a worker could fail and recover without losing or double-counting events. The next problem is broader. Finance still needs a nightly recompute of merchant exposure for audit, and risk analysts still need to replay the last 90 days after an enrichment bug or a late settlement file arrives. If the real-time job and the historical recompute are built as separate interpretations of the same business process, the system can recover perfectly from crashes and still tell the company two different stories.

The non-obvious point is that "unified batch + stream" does not mean "every workload runs on one engine" and it does not mean "streaming replaces batch." It means the streaming path and the batch path derive results from the same canonical facts, with the same identity rules, time semantics, and correction model. A streaming projection gives PayLedger a payout decision in minutes. A batch replay gives it a repaired or backfilled answer over a larger horizon. If those answers are defined by different contracts, reconciliation becomes guesswork instead of engineering.

Why This Matters

PayLedger uses exposure totals to decide whether a merchant can receive same-day payout. The risk service needs the latest answer now, because holding a fraudulent payout for even ten minutes matters. The finance team needs the same answer later, because merchant statements, reserve calculations, and audit trails are built from it. Those two consumers naturally pull the data platform in different directions: one wants low latency and incremental updates, the other wants wide scans, correction handling, and reproducible history.

Without a unified architecture, teams usually split the problem into a speed layer and a batch layer that evolve independently. The streaming job writes a hot table keyed by merchant_id; the nightly warehouse job runs SQL over raw files and computes "authoritative" exposure totals. At first this feels pragmatic. Then a refund arrives 18 hours late, a schema change introduces a new event subtype, or a fraud feature backfill changes merchant risk labels for last week. The stream view, warehouse report, and downstream product behavior no longer match. The trade-off is explicit: unification reduces semantic drift and makes repairs tractable, but it forces the team to treat history, corrections, and derived tables as first-class design concerns instead of ad hoc cleanup work.

Core Walkthrough

Part 1: Grounded Situation

PayLedger ingests four event families into Kafka and an object-store archive:

The risk service needs a near-real-time projection called merchant_exposure_current. It is updated continuously and powers payout holds within two minutes of an event arriving. Finance also needs merchant_exposure_daily, a reproducible table for statements and reserve reports. That table must absorb late acquirer files, corrected FX rates, and fraud labels that are recomputed after the fact.

The naive architecture is two separate implementations:

stream path: Kafka -> Flink job -> operational table -> risk service
batch path: object storage -> nightly Spark SQL -> warehouse table -> finance reports

Both paths claim to compute "merchant exposure," but they quietly disagree on key rules. The stream job deduplicates by event_id; the nightly SQL deduplicates by (merchant_id, event_time, amount) because raw files once lacked IDs. The stream job closes a 15-minute window when the watermark passes; the batch job uses file arrival time because that was easier to express in SQL. The stream job treats a corrected settlement as a compensating event; the batch job overwrites the old row in place during normalization.

Now merchant m_482 asks why yesterday's dashboard showed exposure peaking at $85,000 while finance's report shows a maximum of $67,000. Both numbers came from the same business, but not from the same model. The issue is no longer processor uptime or checkpoint frequency. It is that the platform has two truths.

Part 2: Mechanism

A unified architecture starts by declaring what the source of truth actually is. For PayLedger, the canonical truth is an append-only fact history plus versioned reference data:

payment events + correction events + dimension snapshots

Everything else is a projection. The streaming system maintains the projection incrementally; the batch system rebuilds or repairs it by replaying history over a wider range. The crucial requirement is that both paths share semantic boundaries:

  1. the same event identity and dedup rule
  2. the same event-time fields and late-data policy
  3. the same interpretation of corrections, tombstones, and dimension changes
  4. the same projection definition for exposure

In practice the flow looks like this:

producers
   |
   v
canonical event log -----> archive / table snapshots
   |                                |
   |                                v
   |                       batch replay / backfill
   v                                |
stream processor -------------------+
   |
   v
materialized projections for serving and reporting

The real implementation can use different engines. PayLedger might use Flink for the incremental path, Spark for backfills, and Iceberg tables for durable snapshots. That is still unified if both engines consume the same canonical facts and apply the same semantics. The architecture is not unified if they merely happen to read similarly named data sets.

One useful way to make the contract concrete is to define the state transition once and reuse it conceptually everywhere:

def apply_exposure_event(state, event):
    if state.seen_event_ids.contains(event.event_id):
        return state

    signed_delta = classify(event)
    state.open_exposure += signed_delta
    state.seen_event_ids.add(event.event_id)
    state.last_event_time = max(state.last_event_time, event.event_time)
    return state

The stream processor applies that transition to each new event and writes an incremental view. The batch replay applies the same transition across a historical slice to regenerate the view after a bug fix, a backfill, or a new serving table bootstrap. Unified architecture is the insistence that both computations mean the same thing even if the runtime mechanics differ.

This also changes how corrections work. If an acquirer sends a repaired settlement amount, PayLedger appends a compensating correction event or a new versioned fact instead of mutating the original exposure table in place. That keeps replay meaningful. A backfill can reproduce the corrected result because the correction is in history, not hidden in an operational table with no lineage.

Part 3: Implications and Trade-offs

The main benefit is not elegance. It is operational clarity. When the stream projection and the batch replay disagree, the team can ask whether the canonical facts differ, whether a projection definition changed, or whether one engine violated the shared contract. That is a tractable debugging path. In a split architecture, the same disagreement often devolves into "the dashboard is using pipeline A and finance is using pipeline B," which is not a diagnosis at all.

The cost is real. Unified architectures retain more raw history, preserve event lineage longer, and require stronger schema discipline. They also force awkward business conversations. Should the product show provisional exposure immediately and accept that late corrections may revise yesterday's answer? Or should it wait for a slower authoritative layer and tolerate stale decisions? That trade-off is structural, not just UI polish.

There is another trade-off between physical unification and semantic unification. A single engine can reduce code drift, but it may be poor at either large replays or low-latency serving. Separate engines can be the right choice if they share a strict semantic contract. The failure mode is not using multiple systems. The failure mode is letting each system invent its own definition of truth.

For PayLedger, the practical policy becomes:

That is what makes batch and stream feel unified in production: not a slogan, but a repairable system boundary.

Failure Modes and Misconceptions

Connections

Resources

Key Takeaways

PREVIOUS Stateful Operators and Checkpoint Recovery NEXT Warehouse, Lakehouse, and Serving Layers

← Back to Data Architecture and Platforms

← Back to Learning Hub