LESSON
Day 444: Vectorized Execution and CPU Cache Efficiency
The core idea: Once a distributed plan has put the right rows on the right worker, the next bottleneck is usually CPU time spent materializing and branching over individual tuples. Vectorized execution changes the unit of work from one row at a time to cache-sized batches, so the executor spends more cycles doing useful comparisons and fewer cycles chasing pointers and missing cache lines.
Today's "Aha!" Moment
In 042.md, Harbor Point's compliance query finally got a sensible join plan: issuer_limits was broadcast because it was small, and each worker kept its local shard of reservations. The team expected that fix to solve the latency problem. It did remove the big network shuffle, but the query still ran hot on CPU. Worker profiles showed millions of tiny function calls, branch-heavy predicate checks, and repeated tuple reconstruction just to discard most rows a few instructions later.
That is where vectorized execution changes the picture. Instead of asking the executor to pull "the next row" through every operator, the engine processes a batch such as 1,024 values from status, trading_day, issuer_id, and notional_usd at once. The filter creates a selection vector saying which positions survived, the projection reads only the surviving positions, and the aggregate updates a small hash table in tight loops over contiguous memory. The important win is not "SIMD magic" by itself. The bigger win is that the CPU sees predictable loops over data that is already close together in memory.
That corrects a common misconception. Vectorized execution is not just "the engine uses AVX instructions now." SIMD helps when the data and operators cooperate, but the architectural shift is broader: fewer virtual calls, less tuple materialization, better cache locality, and the option to defer wide or variable-length columns until the query actually needs them. A good distributed plan can still waste a core if the local executor treats every row like a separate object graph.
Why This Matters
Harbor Point's morning exposure report scans tens of millions of open reservation rows, joins them to a small limit table, filters out inactive desks, and groups the survivors by issuer. By the time the plan reaches one worker, the interesting question is no longer "which node owns this row?" The question is "how many useful predicate checks and aggregate updates can this core finish before its cache and branch predictor become the bottleneck?"
In a row-at-a-time executor, each operator repeatedly asks its child for one tuple, decodes fields, evaluates a predicate, constructs another tuple, and passes it onward. That design is simple and composable, but modern CPUs dislike it. The executor touches more memory than necessary, follows more pointers, and pays interpreter overhead on every row. When concurrency rises, those inefficiencies turn into tail-latency spikes because each query ties up a worker for longer.
Vectorized execution makes the same query plan more cache-friendly. The filter can scan a compact batch of status values, produce a mask or selection vector, and avoid touching comment_text or other cold columns entirely. Aggregation can update hash table entries for a dense set of keys before the working set falls out of L1 or L2 cache. The result is not just better benchmark throughput. It is also more predictable CPU consumption, which matters directly for scheduler fairness, queueing, and the admission-control decisions in 044.md.
The trade-offs are real. Batches that are too small throw away the locality benefit. Batches that are too large can hurt latency to first row, increase temporary memory pressure, and overflow caches instead of fitting inside them. Some operators, especially complex UDFs or highly irregular string processing, do not vectorize cleanly. Production engines succeed with vectorization only when they treat CPU caches, memory layout, and operator boundaries as first-class design constraints.
Learning Objectives
By the end of this session, you will be able to:
- Explain why row-at-a-time execution leaves CPU performance on the table - Describe how tuple materialization, pointer chasing, and branch-heavy loops waste cache bandwidth after data placement is already solved.
- Trace the mechanics of a vectorized pipeline - Follow how batches, selection vectors, and late materialization move through a real query on a worker node.
- Evaluate vectorization trade-offs in production - Judge when larger batches improve throughput, when they hurt latency or memory behavior, and why this matters for cluster-level workload governance.
Core Concepts Explained
Concept 1: The CPU does not experience a query plan as SQL, it experiences it as memory access patterns
Harbor Point's exposure query looks simple at the SQL layer: scan reservations, keep only today's open rows, join to issuer_limits, and sum notional_usd by issuer. On a worker, though, the CPU never sees SQL. It sees loads from memory, branches, hash lookups, and stores back into caches. That is why a row-at-a-time executor often disappoints even when the optimizer picked a good join strategy in the previous lesson.
In the classic iterator model, every operator exposes a next() interface. The scan emits one tuple, the filter checks one tuple, the join probes one tuple, and the aggregate updates state for one tuple. That modularity is elegant, but it creates overhead at every row boundary. Each row may be wrapped in a tuple object or slot array, fields may be decoded repeatedly, and the CPU has fewer chances to predict branches because each predicate decision is isolated.
CPU cache behavior is the hidden constraint. A cache line typically brings in 64 bytes at a time. If the executor stores a row as a mixed object containing hot fields, cold fields, pointers, and null metadata, fetching status may also drag unrelated bytes into cache. Worse, the next row may live somewhere else entirely. The worker spends time waiting on memory and reloading metadata instead of comparing values.
Vectorized execution tries to keep the active working set inside cache by switching to small columnar batches:
Batch 12 on worker-7
issuer_id [A12, A12, B04, C99, ...]
status [open, open, cxl, open, ...]
trading_day [2026-04-02, 2026-04-02, 2026-04-02, 2026-04-01, ...]
notional_usd [5M, 2M, 1M, 8M, ...]
positions [0, 1, 2, 3, ... 1023]
Now the filter can run one tight loop over status and trading_day, mark surviving positions, and leave the rest of the columns untouched until needed. The CPU sees contiguous arrays instead of scattered row objects. Even without explicit SIMD code, that change often improves instruction cache use, data cache locality, and branch predictability enough to matter.
Concept 2: A vectorized pipeline moves batches, masks, and selection vectors instead of fully materialized rows
Return to Harbor Point's worker after the broadcast join from 042.md. The local executor receives batches from the reservations scan and a compact in-memory hash table built from issuer_limits. The first operator reads only the columns needed for the filter: status, trading_day, and perhaps desk_state. It produces a selection vector of the surviving positions:
Input positions: [0, 1, 2, 3, 4, 5, 6, 7]
status == "open": [1, 1, 0, 1, 1, 0, 1, 1]
trading_day == today: [1, 1, 1, 0, 1, 1, 1, 1]
selection vector: [0, 1, 4, 6, 7]
The next operator does not need to copy those surviving rows into a new table immediately. It can carry the selection vector forward and interpret "row 4 of this batch is live" without moving all the underlying values. That is one of the quiet advantages of vectorization: the engine can avoid materialization work until it reaches an operator that truly needs a compact output layout.
Late materialization matters when a table has wide columns. Harbor Point stores free-form compliance notes alongside numeric facts. The report does not need those notes, so a vectorized scan can leave that varlen column on the side while filtering and grouping on narrow columns first. A row executor tends to pay more decoding cost up front because each tuple boundary encourages early assembly of a full logical row.
The batch loop also gives the engine a natural place to use SIMD-friendly and branch-light code paths. Comparing 1,024 status values to the dictionary code for "open" is something compilers and handcrafted kernels can optimize well. The same is true for null checks via bitmaps and for arithmetic over packed numeric arrays. The executor does not need every operator to become heroic assembly; it needs enough of the hot path to operate over simple vectors.
One simplified sketch looks like this:
def process_batch(batch, limits_hash_table, today):
selected = []
for i in range(batch.count):
if batch.status[i] == OPEN and batch.trading_day[i] == today:
selected.append(i)
for i in selected:
limit = limits_hash_table.get(batch.issuer_id[i])
if limit is not None:
aggregate(batch.issuer_id[i], batch.notional_usd[i], limit)
Real engines optimize this much further, but the structure is the key point: run the same operation over many adjacent values, then carry only the surviving positions to the next step.
Concept 3: Vectorization is a throughput strategy with clear failure modes, not a free speed button
Harbor Point's team could take the wrong lesson here and assume that "bigger batch equals faster query." In practice, vectorization is a negotiation with the memory hierarchy. A batch should be large enough to amortize function-call and interpretation overhead, but small enough that hot columns, hash tables, and selection vectors stay cache-resident. Engines therefore pick batch sizes such as a few hundred to a few thousand rows, then tune around operator behavior and data types.
Selective filters create another trade-off. If only 2 out of 1,024 rows survive, carrying a sparse selection vector through many downstream operators can become inefficient. Some engines periodically compact the survivors into a new dense vector; others keep the indirection because copying would cost more than it saves. There is no universal answer. The right choice depends on selectivity, operator mix, and whether the next operator needs random access or streaming access.
Variable-length data is the usual spoiler. Strings, JSON blobs, and nested structures break the illusion that every column is a neat fixed-width array. Engines respond with dictionary vectors, offset buffers, and separate varlen areas so they can still keep the hot control path compact. But once a query starts doing heavy string normalization or UDF calls on those values, the benefits of vectorization shrink because the workload becomes dominated by pointer chasing or opaque function calls again.
The production consequence is broader than single-query speed. A well-vectorized engine can retire more rows per CPU second, which is excellent until too many such queries run at once and saturate shared caches or memory bandwidth. That is why the next lesson on 044.md matters. Admission control is not only about protecting disks or locks. It is also about preventing "efficient" vectorized scans from collectively destroying CPU locality for the whole cluster.
Troubleshooting
Issue: A query plan looks healthy, but worker CPU is high and instructions per cycle stay low.
Why it happens / is confusing: The optimizer solved data movement, but the local executor is still row-oriented or repeatedly materializing wide tuples. The cluster appears CPU-bound, yet the cores are mostly stalled on memory and branch mispredictions.
Clarification / Fix: Profile at the operator level. Look for tuple decoding, virtual dispatch, and cache-miss-heavy scans. Narrow the projected columns earlier, prefer late materialization where the engine supports it, and verify that hot operators are actually using vectorized kernels.
Issue: Enabling vectorization helps scan speed but hurts latency to first row.
Why it happens / is confusing: The engine now waits to fill a batch before pushing work forward, so startup latency can rise even while total CPU time falls.
Clarification / Fix: Distinguish throughput queries from interactive queries. Smaller batches or hybrid execution modes can improve first-row latency for dashboards, while larger batches remain appropriate for long analytical scans.
Issue: Highly selective filters still run slowly after vectorization.
Why it happens / is confusing: Sparse selection vectors, repeated compaction, or downstream operators that cannot handle indirection efficiently can eat the expected gains.
Clarification / Fix: Inspect where the batch stops being dense. If a selective predicate appears early, let the engine compact once before expensive downstream work, or restructure the plan so the highly selective filter runs before wide projections and joins.
Advanced Connections
Connection 1: 042.md decides which rows arrive together; this lesson decides how efficiently a worker uses them
The previous lesson focused on exchange strategy: broadcast, repartition, or colocated join execution. Vectorized execution starts after that routing decision is made. It asks how the worker processes the rows it already owns without turning each tuple boundary into overhead.
Connection 2: 044.md turns local executor efficiency into a cluster policy problem
Once vectorized execution raises per-query throughput, the next operational question is how many such queries should run at once. Admission control and workload governance exist partly because CPU caches and memory bandwidth are shared resources, not isolated per query.
Resources
Optional Deepening Resources
- [PAPER] MonetDB/X100: Hyper-Pipelining Query Execution - Boncz, Zukowski, and Nes (CIDR 2005)
- Link: https://www.cidrdb.org/cidr2005/papers/P19.pdf
- Focus: Read Sections 3 and 4 for the original argument that cache-conscious vectors outperform tuple-at-a-time interpretation on analytical workloads.
- [DOC] DuckDB Documentation: Execution Format
- Link: https://duckdb.org/docs/stable/internals/vector.html
- Focus: See how a modern analytical engine represents flat, dictionary, and constant vectors while keeping batch processing explicit.
- [DOC] Apache Arrow Columnar Format
- Link: https://arrow.apache.org/docs/format/Columnar.html
- Focus: Connect columnar memory layout, padding, and contiguous buffers to the cache-friendly execution patterns described in this lesson.
- [DOC] Velox Documentation: Vectors
- Link: https://facebookincubator.github.io/velox/develop/vectors.html
- Focus: Study how dictionary vectors and encodings let a production execution engine avoid copying data while carrying selection-style indirection through operators.
Key Insights
- A fast distributed plan can still be CPU-inefficient locally - Broadcast or repartition decisions only decide where rows meet; vectorization decides whether the worker burns cycles on useful comparisons or on row-by-row overhead.
- Selection vectors and late materialization are as important as SIMD - The major win comes from avoiding unnecessary decoding and copying, not from treating "vectorized" as a synonym for one specific instruction set.
- Vectorization changes capacity planning - Better per-core throughput is valuable, but it also makes shared CPU caches and memory bandwidth a cluster-level resource that admission control must protect.