LESSON
Day 497: Batch Processing and Throughput-Oriented Pipelines
The core idea: A batch pipeline wins by fixing the input boundary, pushing large amounts of work through deterministic stages, and publishing one versioned result set, trading freshness for throughput, restartability, and predictable cost.
Today's "Aha!" Moment
In 024.md, PayLedger learned the hard way that repeated cross-shard reads are a bad fit for interactive paths. The merchant-operations dashboard wanted an exact summary of holds, available balance, failed payouts, and reserve flags, but serving that page by fan-out made latency track the slowest shard and made availability fall with every extra shard involved. The next move was not to keep tuning the coordinator. It was to stop asking the request path to recompute the same answer over and over.
That is where batch processing becomes concrete. At 01:00 UTC, PayLedger now launches a settlement-summary pipeline for the previous business day. The job reads a fixed snapshot of shard-local ledger entries, groups them by merchant, computes payout eligibility and reserve exposure, and publishes a versioned dataset that the dashboard and finance export both consume. The important shift is conceptual: the system is no longer trying to answer "what is the latest state right now?" on every request. It is answering "what was the reconciled state for this bounded slice of input?" once, then reusing the result many times.
The misconception to remove is that batch is just "streaming, but slower" or "a cron job that scans a table." A throughput-oriented batch pipeline has a different contract. It assumes bounded input, chooses work decomposition to maximize sequential scans and parallel aggregation, and treats retries as normal rather than exceptional. If the contract is designed well, rerunning one failed task does not change the published answer. If the contract is vague, backfills double count money, partial reruns leak mixed versions, and finance stops trusting the output.
Why This Matters
Teams reach for batch when the same expensive computation has to be repeated over a large dataset and freshness measured in minutes or hours is acceptable. PayLedger fits that profile exactly. Merchant settlement summaries, daily reserve calculations, and end-of-day reconciliation all need global views over many shard-local facts, but none of them need millisecond freshness. Recomputing them synchronously in the request path would spend CPU and network on the same join and aggregation work thousands of times per day.
Batch changes the economics by paying the coordination cost once per input slice instead of once per user request. Large sequential reads are usually cheaper than many small random reads, local aggregation before shuffle reduces network cost, and a published dataset can serve many downstream consumers. The trade-off is equally real: the answer is only as fresh as the schedule and the watermark the job used. If a support agent needs the state as of this second, a nightly dataset is the wrong tool. If finance needs a reproducible statement for yesterday's books, a versioned batch output is exactly the right one.
Core Walkthrough
Part 1: Grounded Situation
PayLedger stores authoritative events on the shard that owns each funding account. During the day, those events drive customer-visible balance checks and payout decisions. After the business day closes, finance needs something different: one reconciled record per merchant showing net inflow, settled payouts, chargeback reserves, and any account whose replica lag or missing ledger segments make the books incomplete.
The naive approach is to run a huge SQL query over the live system every night. That usually fails for boring reasons before it fails for interesting ones. It competes with foreground traffic, it mixes records written before and after the reporting cutoff, and it is hard to retry safely because the underlying data can change while the job is in progress. A real batch pipeline starts by freezing the input boundary.
For PayLedger, the scheduler first creates an immutable manifest:
settlement_date = 2026-04-05
input_epoch = ledger-snapshot-7811
source_partitions = shards 03, 11, 18, 27, 41
output_dataset = merchant_settlement_summary/v7811
Every later stage refers to that manifest, not to "whatever is in the database now." That one decision turns retries and backfills from guesswork into something a production team can reason about.
It also gives the team a clean answer when reality arrives late. If a bank partner sends a correction file at 03:15 UTC for a transfer that belonged to 2026-04-05, PayLedger does not patch yesterday's published dataset in place. It records the correction, decides whether finance needs a backfill or an adjustment entry, and produces a new versioned output with clear lineage. Without that discipline, "rerun the batch" quietly turns into "rewrite history in ways nobody can audit."
Part 2: Mechanism
Once the manifest exists, the pipeline behaves like a bounded dataflow graph. Each stage has a narrow responsibility, and each task can be retried because its input and output contract are explicit.
ledger snapshot manifest
-> shard-local scan and validation
-> merchant_id shuffle
-> settlement and reserve aggregation
-> data-quality checks
-> atomic publish of versioned output
The shard-local scan stage is where throughput begins. Each worker reads a contiguous slice of ledger files for one shard and one settlement date, filters out incomplete segments, and emits compact intermediate rows such as (merchant_id, gross_inflow, payout_total, reserve_delta, anomalies). Doing this reduction close to storage matters because shipping raw ledger events across the network would multiply shuffle volume for no benefit.
The shuffle stage then repartitions those intermediate rows by merchant_id so one reducer sees the full settlement picture for a merchant. This is the expensive part of many batch jobs, which is why partition choice matters so much. If one merchant is disproportionately large, that reducer becomes a straggler and the whole job's wall-clock time stretches around it. Throughput-oriented design means planning for skew up front: use pre-aggregation, separate oversized merchants into dedicated partitions, and make reducers spill safely to disk instead of failing the whole run when memory pressure spikes.
Correctness comes from publication discipline, not from hoping the DAG engine is magical. PayLedger tasks write intermediate output under attempt-specific paths, and the job publishes only after validation succeeds for every required partition:
def publish_batch_run(run):
if not run.all_required_partitions_succeeded():
raise RuntimeError("cannot publish incomplete settlement dataset")
version = f"v{run.input_epoch}"
write_success_marker(dataset="merchant_settlement_summary", version=version)
move_manifest_pointer(dataset="merchant_settlement_summary", version=version)
Consumers read the dataset version named by the manifest pointer. They never read half-written partitions. That is the operational difference between "the batch job finished enough tasks" and "the batch output is safe to treat as canonical for this input slice."
Part 3: Implications and Trade-offs
The win is easy to see once the pipeline is running. The support dashboard loads yesterday's settlement summary from one published table instead of triggering a fresh cross-shard fan-out. Finance exports can be regenerated by pointing at the same versioned dataset. Backfills are possible because the system can rerun 2026-04-05 against ledger-snapshot-7811 and compare the resulting checksum to the published version instead of arguing about which mutable source tables changed after the original close.
The cost is that freshness becomes scheduled rather than continuous. If an account enters reserve review at 14:03 UTC, the nightly settlement dataset will not reflect that until the next run unless PayLedger also maintains a lower-latency path. Batch pipelines also concentrate failure into fewer, larger jobs. A bad schema change, a skewed partition, or a missing snapshot can delay an entire reporting cycle instead of degrading one request at a time. That trade-off is often worth it, but only when the product contract admits stale-by-design outputs and the team invests in manifests, lineage, validation, and atomic publish semantics.
This also sets up the next lesson cleanly. Once the business asks for "the same derived view, but updated every few seconds" and still expects late corrections to land in the right logical window, the assumptions that make batch attractive start to break. The input is no longer bounded, scheduling overhead matters more, and event-time questions appear. That is the handoff into 026.md.
Failure Modes and Misconceptions
- "Batch jobs are just slow online queries." They should not be. A proper batch pipeline runs against a fixed input boundary, uses stage-local contracts, and publishes a named output version rather than reading and writing live mutable state opportunistically.
- "Retries guarantee correctness." Retries only help when tasks are idempotent relative to an immutable manifest. If a rerun reads a newer snapshot or appends into a shared output path, retries can create duplicates and mixed-version datasets.
- "More workers always means higher throughput." Extra workers help only until shuffle contention, skew, or storage bandwidth becomes dominant. Beyond that point, the cluster gets more expensive without finishing earlier.
- "Batch is fine for anything that is not strictly real time." The real question is whether the consumer can tolerate the freshness interval and whether the cost of waiting is lower than the cost of maintaining a continuous pipeline.
- "Publishing partitions as they finish is good enough." It is good enough only if every consumer understands partial results. For canonical reporting datasets, atomic publish is what prevents half-complete runs from becoming accidental truth.
Connections
- 024.md explains why
PayLedgerneeded this move at all. Cross-shard fan-out made the request path too expensive for repeated merchant-wide reads, so the system shifted that work into a bounded offline pipeline. - 026.md is the natural follow-on. Batch keeps the input bounded and the cost predictable; stream processing keeps the derived view fresher when the business cannot wait for the next scheduled run.
- Data warehouses and search indexing pipelines use the same structural pattern: freeze an input set, do as much reduction as possible before the expensive exchange step, validate aggressively, and publish a versioned output that many readers can reuse.
Resources
- [BOOK] Designing Data-Intensive Applications
- Focus: Revisit the chapters on batch processing, derived data, and stream processing together; they explain why bounded recomputation is often the simplest way to make large-scale data products reproducible.
- [PAPER] MapReduce: Simplified Data Processing on Large Clusters
- Focus: Pay attention to the original execution model: input splits, local map work, shuffle, and deterministic reduce output. It remains the clearest explanation of why throughput-oriented systems are organized around retries and large sequential work.
- [DOC] Apache Beam Programming Guide
- Focus: Read the sections on bounded collections and windowing boundaries to see how one programming model can still distinguish batch-style bounded inputs from continuously arriving data.
- [DOC] Airflow Best Practices
- Focus: The useful production detail is task design, not orchestration syntax. Look for the guidance on idempotency, partition-scoped tasks, and keeping DAG runs reproducible.
Key Takeaways
- Batch pipelines work by turning "compute over live state" into "compute over a named input slice." That boundary is what makes retries, audits, and backfills tractable.
- Throughput comes from pushing reduction close to storage and treating shuffle as a scarce step. Cluster size helps only after partitioning and skew are under control.
- The publish step is part of correctness. A versioned output plus an atomic pointer swap is what makes downstream readers see one coherent dataset instead of a half-finished run.
- The central trade-off is throughput and reproducibility versus freshness. When freshness becomes the dominant requirement, batch gives way to streaming rather than simply scaling harder.