Record Formats and Variable-Length Columns

LESSON

Database Engine Internals and Implementation

010 30 min advanced

Day 387: Record Formats and Variable-Length Columns

The core idea: A row format turns one tuple body into a decoding contract: which columns are present, where fixed-width values live, how variable-length values are located, and when long data must move out of line.

Today's "Aha!" Moment

In 02.md, the page learned how to keep a stable record identity even while tuple bytes move around. That still leaves a harder question: once slot 7 points at a tuple body, how does the engine know what those bytes mean? Harbor Point's municipal_quotes table makes that question concrete. Every row has fixed-width fields such as quote_id, cusip, bid_cents, ask_cents, and quote_ts. Some rows also carry an optional desk_note, and a smaller subset carries a much larger vendor_payload_json blob from the market-data feed.

If the engine reserved the maximum possible space for every row, page density would collapse. If it stored raw bytes back to back with no header, it would not know whether a missing desk_note is NULL, an empty string, or a corrupted record tail. The row format exists to solve exactly that problem. It tells the executor and recovery code where the fixed fields begin, which nullable columns are actually present, how to find each variable-length value, and whether a long attribute is stored inline or behind an overflow pointer.

That means record format is not an afterthought after page layout. It is the page's inner contract. A slotted page tells the system which tuple it is looking at. The tuple format tells the system how to interpret the bytes inside that tuple without ambiguity.

The production consequence shows up quickly. When Harbor Point's feed handler starts attaching longer desk notes during a storm-driven selloff, the change is not just "strings got bigger." Row width changes page occupancy, buffer-pool pressure, WAL volume, and update behavior. That is why this lesson naturally bridges 02.md and 04.md: the shape of one row determines how efficiently whole pages move through memory.

Why This Matters

Suppose Harbor Point spends most mornings doing two things at once: traders append fresh quotes, and analysts scan the latest page range for widening spreads. On Monday the average desk_note is absent and pages stay dense. On Tuesday a volatile market causes many rows to include 150-byte notes plus vendor metadata. The logical schema did not change, but the physical workload did. The same "fetch the latest 10,000 quotes" query may now touch far more pages because each page holds fewer rows.

There is also a correctness angle that does not show up in a schema diagram. Risk logic may treat desk_note IS NULL as "no human override," while desk_note = '' means "override present but blank." Recovery code must know exactly how many bytes belong to each field after a crash. Replication and checksum validation depend on deterministic decoding. If the row format is vague, the engine can produce silent semantic errors long before anyone notices a corrupted page.

Once you understand record layout, storage behavior stops feeling mysterious. You can explain why adding one nullable text column barely changes some rows but makes others spill to overflow storage, why a "small" text update rewrites much more than expected, and why the next lesson's buffer-pool policies care deeply about tuple width rather than just page count.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain what a row format must encode - Describe how headers, null metadata, fixed-width fields, and variable-length metadata let the engine decode a tuple correctly.
  2. Trace how a variable-length row is read and updated - Follow the path from slot entry to tuple header to inline bytes or overflow pointers.
  3. Evaluate storage-engine trade-offs around wide attributes - Connect row layout choices to page density, update cost, and memory pressure in production.

Core Concepts Explained

Concept 1: A record format is the tuple-level contract inside a slotted page

Return to Harbor Point's quote page from 02.md. The slot directory can tell the engine that slot 7 currently points to byte offset 8016 with a tuple length of 96 bytes. That is enough to find the tuple body, but not enough to interpret it. The engine still needs to answer a series of questions: which columns are NULL, where is quote_ts, how long is desk_note, and does vendor_payload_json live inside this tuple or somewhere else?

That is why most row stores begin each tuple with a small header. The exact fields differ by engine, but the header usually carries some combination of tuple length, status bits, visibility metadata, and a header-size marker that tells the decoder where the fixed-width area begins. Immediately after that, many engines place a null bitmap: one bit per nullable column. A bitmap sounds minor until you consider semantics. NULL is not the same as an empty string, zero, or an absent delimiter. The storage layer must preserve that distinction because SQL semantics, indexes, and recovery rules all depend on it.

After the header and bitmap comes a fixed-width region for columns whose size is known in advance, followed by metadata or bytes for variable-length attributes. One common design stores fixed-width columns at predictable offsets and appends variable-length payloads at the end. Another stores small length prefixes beside each variable field. Both approaches are trying to preserve the same decoding invariant: given the tuple bytes and schema metadata, there must be exactly one correct interpretation.

slot 7 tuple body
+---------------------------------------------------+
| header | null bitmap | padding                    |
| quote_id | cusip | bid_cents | ask_cents | ts     |
| note offset/len | payload offset/len or pointer   |
| "storm widening after 10:32 auction"              |
| {"venue":"MUNI-X","confidence":0.71}              |
+---------------------------------------------------+

Padding belongs in that picture too. Engines often align 4-byte and 8-byte values so the CPU can load them efficiently and so the decoder can compute offsets predictably. Padding wastes a few bytes per row, but it keeps hot fields cheap to read. As soon as you care about scanning millions of records, that trade-off becomes rational rather than cosmetic.

Concept 2: Variable-length columns turn one tuple into a miniature layout problem

Variable-length columns are where row formats stop being obvious. desk_note might be 18 bytes in one row and 240 bytes in the next. vendor_payload_json may be absent most of the time and then suddenly expand to kilobytes when the feed provider includes diagnostic metadata. The engine therefore needs a mechanism that lets it find each value without pretending every row is equally wide.

In a typical row-store read path, decoding a tuple works like this:

  1. Read the tuple header to determine header length and status bits.
  2. Consult the null bitmap before touching any nullable attribute.
  3. Read fixed-width columns from stable offsets.
  4. Use a length prefix, offset table, or varlena header to find each inline variable-length value.
  5. If the field is stored out of line, follow an overflow pointer to a separate storage area.

The update path reveals why these choices matter. Suppose Harbor Point revises a quote and the new desk_note grows from 24 bytes to 180 bytes. If there is enough contiguous free space on the page, the engine may rewrite the tuple locally and update the slot entry if the payload moved. If there is not enough space, the engine may move part of the value off-page, create a redirect-like structure, or force the update into another page depending on the engine design. What looked like "just a text edit" becomes a page-management decision because the record format and page layout are coupled.

Inline storage and overflow storage are the main trade-off. Keeping variable-length values inline improves locality for point reads and range scans that actually need the value. Moving large values out of line keeps the base tuple compact, which can preserve page density and reduce movement on updates. The cost is extra pointer chasing, more I/O for rows that need the large value, and more complicated maintenance when overflow pages are vacuumed or reclaimed.

This is also where many row formats add thresholds and compression rules. PostgreSQL's TOAST mechanism and InnoDB's long-column handling are concrete examples: small values stay inline because locality matters, while large values are compressed or spilled because a giant inline tuple would damage the rest of the system.

Concept 3: Row width is a storage-engine performance policy, not just a schema detail

Once you look at Harbor Point's rows physically, page density becomes a direct consequence of tuple format. Imagine an 8 KB page that can hold about 70 recent quotes when desk_note is usually NULL. If market stress pushes the average inline row size high enough that the same page now holds only 20 quotes, the executor needs roughly three times as many page reads to scan the same logical window. Nothing about SQL changed. The physical row format changed the cost model underneath it.

That cost propagates into the next layer. Fewer rows per page means a buffer pool must cache more pages to cover the same hot working set. Wider rows increase WAL traffic during updates because more bytes move or more overflow metadata changes. Fragmentation becomes harder to avoid because variable-length growth leaves awkward holes inside pages, exactly the problem 02.md described. Record format is therefore one of the earliest places where "storage" and "memory behavior" stop being separable topics.

This is why production schema advice has to be mechanism-driven. If analysts constantly filter on cusip, bid_cents, ask_cents, and quote_ts but only occasionally inspect vendor_payload_json, keeping the hot fields in a compact fixed-width prefix and pushing cold, wide data out of the base tuple can materially improve scans. If the application reads the note on every request, aggressively externalizing it may save pages while making the dominant query slower. The correct layout depends on access patterns, update frequency, and how expensive an extra page fetch is in the current engine.

That trade-off sets up 04.md. Replacement policy is not just about "which page is old." It is about whether each cached page holds enough useful rows to justify the memory it consumes. Record format decides that before the buffer manager ever sees the page.

Troubleshooting

Issue: desk_note IS NULL and desk_note = '' behave differently, but an internal export or debugging tool treats them as the same thing.

Why it happens / is confusing: The tool is likely reading decoded values without preserving the null bitmap semantics. A zero-length varlen field still occupies logical space in the tuple format; NULL is represented separately.

Clarification / Fix: Inspect how the engine encodes nullability. When debugging storage, always distinguish null bitmap state from value length or delimiter presence.

Issue: A table suddenly fits far fewer rows per page after a vendor rollout, even though only one optional JSON column changed.

Why it happens / is confusing: Variable-length columns can cross an inline-storage threshold. Once enough rows start carrying larger inline payloads, tuple width expands and page density drops sharply.

Clarification / Fix: Measure average inline row width before and after the change. Check whether the engine compressed or externalized large values, and consider whether the wide attribute belongs in the base row at all.

Issue: Small text edits create larger-than-expected WAL or page churn.

Why it happens / is confusing: Growing a variable-length field often relocates tuple bytes or changes overflow metadata. The update dirties more of the page than the logical column change suggests.

Clarification / Fix: Look for tuple relocation, overflow-page writes, and page fragmentation. Fillfactor, overflow thresholds, and column placement can matter as much as the application update itself.

Advanced Connections

Connection 1: Record formats <-> WAL and crash recovery

Recovery systems replay byte-level or field-level changes against pages that must still decode into valid tuples. PostgreSQL's heap tuple header, null bitmap, and TOAST pointers are a good example: redo is only safe because the storage engine knows exactly how tuple headers and variable-length attributes are structured on disk.

Connection 2: Record formats <-> buffer pools and CPU cache behavior

Tuple width determines how many logical rows each cached page can serve. InnoDB's compact and dynamic row formats, for example, are not just disk-layout choices; they change how much useful data fits into memory and how often a query must chase overflow storage. That is why 04.md follows naturally from this lesson.

Resources

Optional Deepening Resources

Key Insights

  1. A slot is not enough; the tuple also needs a decoding contract - Page layout finds the record, but record format explains the bytes inside it.
  2. Variable-length columns force explicit locality decisions - Inline values make common reads cheap, while overflow storage protects page density and update stability.
  3. Row width shapes the rest of the engine - Page occupancy, WAL volume, fragmentation, and buffer-pool efficiency all start with the tuple format.
PREVIOUS Page Layouts and Slotted Pages NEXT Buffer Pool Internals and Replacement Policies

← Back to Database Engine Internals and Implementation

← Back to Learning Hub