Batch Processing and Throughput-Oriented Pipelines

LESSON

Data Architecture and Platforms

025 30 min advanced

Day 497: Batch Processing and Throughput-Oriented Pipelines

The core idea: A batch pipeline wins by fixing the input boundary, pushing large amounts of work through deterministic stages, and publishing one versioned result set, trading freshness for throughput, restartability, and predictable cost.

Today's "Aha!" Moment

In 024.md, PayLedger learned the hard way that repeated cross-shard reads are a bad fit for interactive paths. The merchant-operations dashboard wanted an exact summary of holds, available balance, failed payouts, and reserve flags, but serving that page by fan-out made latency track the slowest shard and made availability fall with every extra shard involved. The next move was not to keep tuning the coordinator. It was to stop asking the request path to recompute the same answer over and over.

That is where batch processing becomes concrete. At 01:00 UTC, PayLedger now launches a settlement-summary pipeline for the previous business day. The job reads a fixed snapshot of shard-local ledger entries, groups them by merchant, computes payout eligibility and reserve exposure, and publishes a versioned dataset that the dashboard and finance export both consume. The important shift is conceptual: the system is no longer trying to answer "what is the latest state right now?" on every request. It is answering "what was the reconciled state for this bounded slice of input?" once, then reusing the result many times.

The misconception to remove is that batch is just "streaming, but slower" or "a cron job that scans a table." A throughput-oriented batch pipeline has a different contract. It assumes bounded input, chooses work decomposition to maximize sequential scans and parallel aggregation, and treats retries as normal rather than exceptional. If the contract is designed well, rerunning one failed task does not change the published answer. If the contract is vague, backfills double count money, partial reruns leak mixed versions, and finance stops trusting the output.

Why This Matters

Teams reach for batch when the same expensive computation has to be repeated over a large dataset and freshness measured in minutes or hours is acceptable. PayLedger fits that profile exactly. Merchant settlement summaries, daily reserve calculations, and end-of-day reconciliation all need global views over many shard-local facts, but none of them need millisecond freshness. Recomputing them synchronously in the request path would spend CPU and network on the same join and aggregation work thousands of times per day.

Batch changes the economics by paying the coordination cost once per input slice instead of once per user request. Large sequential reads are usually cheaper than many small random reads, local aggregation before shuffle reduces network cost, and a published dataset can serve many downstream consumers. The trade-off is equally real: the answer is only as fresh as the schedule and the watermark the job used. If a support agent needs the state as of this second, a nightly dataset is the wrong tool. If finance needs a reproducible statement for yesterday's books, a versioned batch output is exactly the right one.

Core Walkthrough

Part 1: Grounded Situation

PayLedger stores authoritative events on the shard that owns each funding account. During the day, those events drive customer-visible balance checks and payout decisions. After the business day closes, finance needs something different: one reconciled record per merchant showing net inflow, settled payouts, chargeback reserves, and any account whose replica lag or missing ledger segments make the books incomplete.

The naive approach is to run a huge SQL query over the live system every night. That usually fails for boring reasons before it fails for interesting ones. It competes with foreground traffic, it mixes records written before and after the reporting cutoff, and it is hard to retry safely because the underlying data can change while the job is in progress. A real batch pipeline starts by freezing the input boundary.

For PayLedger, the scheduler first creates an immutable manifest:

settlement_date = 2026-04-05
input_epoch = ledger-snapshot-7811
source_partitions = shards 03, 11, 18, 27, 41
output_dataset = merchant_settlement_summary/v7811

Every later stage refers to that manifest, not to "whatever is in the database now." That one decision turns retries and backfills from guesswork into something a production team can reason about.

It also gives the team a clean answer when reality arrives late. If a bank partner sends a correction file at 03:15 UTC for a transfer that belonged to 2026-04-05, PayLedger does not patch yesterday's published dataset in place. It records the correction, decides whether finance needs a backfill or an adjustment entry, and produces a new versioned output with clear lineage. Without that discipline, "rerun the batch" quietly turns into "rewrite history in ways nobody can audit."

Part 2: Mechanism

Once the manifest exists, the pipeline behaves like a bounded dataflow graph. Each stage has a narrow responsibility, and each task can be retried because its input and output contract are explicit.

ledger snapshot manifest
  -> shard-local scan and validation
  -> merchant_id shuffle
  -> settlement and reserve aggregation
  -> data-quality checks
  -> atomic publish of versioned output

The shard-local scan stage is where throughput begins. Each worker reads a contiguous slice of ledger files for one shard and one settlement date, filters out incomplete segments, and emits compact intermediate rows such as (merchant_id, gross_inflow, payout_total, reserve_delta, anomalies). Doing this reduction close to storage matters because shipping raw ledger events across the network would multiply shuffle volume for no benefit.

The shuffle stage then repartitions those intermediate rows by merchant_id so one reducer sees the full settlement picture for a merchant. This is the expensive part of many batch jobs, which is why partition choice matters so much. If one merchant is disproportionately large, that reducer becomes a straggler and the whole job's wall-clock time stretches around it. Throughput-oriented design means planning for skew up front: use pre-aggregation, separate oversized merchants into dedicated partitions, and make reducers spill safely to disk instead of failing the whole run when memory pressure spikes.

Correctness comes from publication discipline, not from hoping the DAG engine is magical. PayLedger tasks write intermediate output under attempt-specific paths, and the job publishes only after validation succeeds for every required partition:

def publish_batch_run(run):
    if not run.all_required_partitions_succeeded():
        raise RuntimeError("cannot publish incomplete settlement dataset")

    version = f"v{run.input_epoch}"
    write_success_marker(dataset="merchant_settlement_summary", version=version)
    move_manifest_pointer(dataset="merchant_settlement_summary", version=version)

Consumers read the dataset version named by the manifest pointer. They never read half-written partitions. That is the operational difference between "the batch job finished enough tasks" and "the batch output is safe to treat as canonical for this input slice."

Part 3: Implications and Trade-offs

The win is easy to see once the pipeline is running. The support dashboard loads yesterday's settlement summary from one published table instead of triggering a fresh cross-shard fan-out. Finance exports can be regenerated by pointing at the same versioned dataset. Backfills are possible because the system can rerun 2026-04-05 against ledger-snapshot-7811 and compare the resulting checksum to the published version instead of arguing about which mutable source tables changed after the original close.

The cost is that freshness becomes scheduled rather than continuous. If an account enters reserve review at 14:03 UTC, the nightly settlement dataset will not reflect that until the next run unless PayLedger also maintains a lower-latency path. Batch pipelines also concentrate failure into fewer, larger jobs. A bad schema change, a skewed partition, or a missing snapshot can delay an entire reporting cycle instead of degrading one request at a time. That trade-off is often worth it, but only when the product contract admits stale-by-design outputs and the team invests in manifests, lineage, validation, and atomic publish semantics.

This also sets up the next lesson cleanly. Once the business asks for "the same derived view, but updated every few seconds" and still expects late corrections to land in the right logical window, the assumptions that make batch attractive start to break. The input is no longer bounded, scheduling overhead matters more, and event-time questions appear. That is the handoff into 026.md.

Failure Modes and Misconceptions

Connections

Resources

Key Takeaways

PREVIOUS Cross-Shard Queries and Fan-Out Costs NEXT Stream Processing and Event-Time Semantics

← Back to Data Architecture and Platforms

← Back to Learning Hub