Materialized Views and Incremental Recompute

LESSON

Data Architecture and Platforms

031 30 min advanced

Day 503: Materialized Views and Incremental Recompute

The core idea: A materialized view stores the answer to an expensive query, but the production problem is not the first computation. It is keeping that answer correct as new events, late corrections, and logic changes arrive, which is why incremental recompute has to be a deterministic, replayable maintenance process instead of a series of ad hoc table updates.

Today's "Aha!" Moment

In 030.md, PayLedger gave its merchant dashboard a serving table called merchant_balance_current so /balance could return in well under the product latency budget. That solved the read path, not the data problem. Every capture_settled, refund_posted, chargeback_opened, and reserve-policy change now has to move that row toward the same answer finance would get by replaying canonical history. The materialized view is therefore not just a precomputed table. It is a promise that fast reads and durable facts will keep converging.

The non-obvious point is that incremental recompute does not mean "run the whole query more often" and it does not mean "apply deltas forever." It means the system can identify exactly which keys or partitions became invalid, recompute only that scope from trustworthy inputs, and publish the result without exposing a half-updated view. Teams get into trouble when they treat the view as a cache that can be patched manually during incidents. PayLedger needs the balance view to be deterministic enough that the same business logic works both event by event and during replay after a bug fix.

Why This Matters

The pressure is easy to make concrete. The merchant UI needs a balance answer on every page load, so scanning raw settlement and refund tables on demand is too slow and too expensive. Finance still needs an auditable explanation of that same balance at the end of yesterday, after late files and corrections are applied. If PayLedger recomputes the whole balance table every time a payment event lands, compute cost explodes. If it only issues narrow UPDATE statements against the serving table, the answer eventually drifts when retries, out-of-order events, or rule changes hit production.

The trade-off is direct: materialization buys predictable read latency, but it pushes complexity into dependency tracking, invalidation, replay tooling, and reconciliation. That extra machinery is worth paying for only if the view can be repaired from canonical data. Otherwise the platform has traded a slow query for a fast-but-untrustworthy product surface.

Core Walkthrough

Part 1: Grounded Situation

PayLedger maintains a materialized view keyed by (merchant_id, currency). Each row contains at least available_balance, pending_balance, reserve_hold, last_event_time, and the source watermark that produced the row. The view depends on two upstream inputs:

That dependency mix is what makes materialized views interesting. A newly settled capture for merchant m_482 usually affects one row and can be applied cheaply. A reserve-policy change for all high-risk marketplace merchants is different. It can invalidate thousands of rows, and those rows cannot be patched safely unless the recompute path can replay the affected history with the new policy version.

The naive designs fail in opposite directions. A full refresh every minute is mechanically simple, but it repeatedly scans far more history than the change warrants. Direct row updates look efficient, but they silently encode assumptions about ordering, deduplication, and policy history. The first duplicated Kafka delivery or retroactive rule change exposes that those assumptions were never modeled explicitly.

Part 2: Mechanism

For PayLedger, materialized-view maintenance starts with a dependency graph:

canonical payment events + policy snapshots
                 |
                 v
         change classifier
                 |
      +----------+-----------+
      |                      |
      v                      v
local delta path       scoped replay path
      |                      |
      +----------+-----------+
                 v
      versioned materialized view snapshot

The local delta path handles changes whose impact is narrow and well understood. When a refund_posted event arrives, the maintainer can map that event to one (merchant_id, currency) key, apply the balance transition, and advance the view watermark. That path is fast because the dependency surface is small and the fold function is deterministic.

def apply_event(row, event):
    if event.type == "capture_settled":
        row.pending_balance += event.amount
    elif event.type == "refund_posted":
        row.available_balance -= event.amount
    elif event.type == "chargeback_opened":
        row.available_balance -= event.amount
        row.reserve_hold += event.amount
    row.last_event_time = max(row.last_event_time, event.event_time)
    return row

That function is only safe if the inputs are already normalized around identity, ordering, and correction semantics. In practice PayLedger stores dedup keys, source offsets, and policy versions alongside the view state so a retry or failover does not double-apply a change. This is where the earlier lessons on event-time handling and checkpoint recovery matter: the view update and the consumed source position need to move forward as one atomic unit.

The scoped replay path exists for invalidations that are too wide or too subtle for direct deltas. Suppose risk changes the reserve formula from 10% to 15% for a merchant segment effective three days ago. The existing rows do not retain enough intermediate detail to patch every balance safely with a simple arithmetic adjustment, because some funds have since been paid out, refunded, or already moved into hold states. The correct response is to mark the affected merchants and time range as dirty, reload canonical events plus policy snapshots from the effective boundary, fold them again, and write replacement rows to a new snapshot. Readers switch only after the replacement slice is complete.

That is what incremental recompute means in practice: not a global rebuild, but a bounded replay over the exact part of the graph that lost validity. Good systems make the boundary explicit. PayLedger tracks which source tables and policy versions contributed to each published slice, so operators can answer a concrete question during an incident: "Which balances are stale, and what input change invalidated them?"

Part 3: Implications and Trade-offs

This design changes what "cheap" means. Reads become cheap because the dashboard is no longer paying the full aggregation cost. Writes and changes become more expensive because the platform now maintains lineage between inputs and view rows. Every new upstream dependency makes the serving path richer for reads and wider for recompute.

The granularity of the view is the first major trade-off. A very narrow view keyed only by merchant and currency keeps the read path simple, but any rule that depends on settlement age, payout state, or fraud status forces the maintainer to recover that context during replay. A wider view with extra dimensions can make recompute more precise, but it also increases storage cost and the number of rows that must be kept consistent.

The second trade-off is between pure delta maintenance and replayability. Delta application is efficient for append-only facts whose effects are local. Replay is safer when logic changes, late data lands outside the incremental window, or reference data fans out broadly. Mature teams keep both paths. If PayLedger can only process new events and cannot rebuild an affected slice, the materialized view eventually turns into operational debt disguised as performance optimization.

Operationally, the view also needs its own correctness signals. Freshness is not only "how old is the last update?" It is also "what is the oldest dirty key?" and "which policy snapshot is this row based on?" That is why production maintenance typically includes periodic reconciliation jobs that compare a sample or partition of the materialized view against a trusted recomputation from canonical history. Exactly-once delivery reduces one failure mode; it does not prove the projection logic is still correct.

Failure Modes and Misconceptions

Connections

Resources

Key Takeaways

PREVIOUS Warehouse, Lakehouse, and Serving Layers NEXT Serving Layer Patterns for Product Features

← Back to Data Architecture and Platforms

← Back to Learning Hub