Storage Engine Observability and SLOs

LESSON

Consistency and Replication

031 30 min advanced

Day 430: Storage Engine Observability and SLOs

The core idea: A storage engine SLO is credible only when it is expressed in engine-native state such as WAL durability, replay position, checkpoint debt, and restore readiness, not just in API latency.

Today's "Aha!" Moment

In 13.md, Harbor Point designed a real point-in-time recovery path for the bond_reservations cluster: nightly physical backups, continuous WAL archiving, and a restore procedure that can stop just before a destructive commit. That design looks solid on paper. The uncomfortable question the next morning is simpler: how would the team know, at 09:37 on a busy trading day, that the recovery promise is still intact?

Suppose the API dashboard is green. Median write latency is normal, the load balancer sees no errors, and traders keep placing reservations. Meanwhile, the WAL archive command has been failing for twelve minutes because the object-store credentials expired, the replica used for compliance reads is 1.8 GB behind on replay, and the next checkpoint is forcing a burst of dirty-page writes that will soon stretch commit latency. From the application's point of view, the service is still "up." From the storage engine's point of view, Harbor Point is already outside its recovery and freshness promises.

That is the key shift for this lesson. Storage engine observability is not a larger pile of metrics. It is a map from business promises to internal state transitions. "A committed reservation is durable within one second" is not a sentiment. It means Harbor Point can observe WAL generation, fsync completion, archive success, and replay progress closely enough to prove that the promise is still true. Once the team thinks that way, SLOs stop being generic uptime slogans and become concrete claims about the engine's hidden machinery.

This also corrects a common misconception. Teams often assume the database is healthy when query latency stays below threshold. That is sometimes true for the request path and false for the storage path. A system can serve traffic smoothly while its backup window is already broken, its replay backlog is growing, or its maintenance debt is quietly setting up the next incident.

Why This Matters

Harbor Point's reservation service is the system of record for open positions during the market session. If the cluster accepts writes at 09:40 but cannot restore to 09:39 because WAL archival has been broken since 09:28, the business does not really have the durability posture it thinks it bought. The incident will not start when the first HTTP 500 appears. It will start when an operator asks for a recovery point the engine can no longer provide.

Black-box monitoring does not catch that kind of degradation early enough. CPU, query throughput, and endpoint latency tell the team whether the database is currently serving work. They do not tell the team whether the write-ahead log is safely off-host, whether replicas are converging, whether checkpoints are accumulating dangerous flush debt, or whether a restore drill still finishes inside the promised recovery window.

Once Harbor Point instruments those internal boundaries directly, production decisions get sharper. On-call can distinguish "slow because the workload is heavy" from "slow because durability work is falling behind." Capacity planning can ask whether more write volume increases archive lag or replica replay delay. Leadership can set an RPO and RTO that correspond to measured engine behavior rather than optimistic policy text.

Learning Objectives

By the end of this session, you will be able to:

  1. Explain why storage observability must follow engine state transitions - Map commits, log durability, checkpoints, archival, and replay into concrete observability points.
  2. Translate business promises into storage-engine SLIs and SLOs - Define measures for durability, freshness, and recovery readiness that a production team can actually verify.
  3. Diagnose degradation using backlog and drill signals - Separate transient load from structural risk by reading archive lag, replay debt, checkpoint pressure, and restore-test outcomes together.

Core Concepts Explained

Concept 1: Observe the write path as a sequence of state transitions

Harbor Point's traders care about a simple sentence: "when I place a reservation, it is durable." The engine cannot satisfy that sentence all at once. It satisfies it in stages. A transaction appends records to WAL, the WAL is flushed to durable local media, dirty data pages remain in memory for a while, a checkpoint later writes those pages back, WAL segments are archived for disaster recovery, and replicas replay the same history on their own schedule.

That means storage observability has to ask a specific question at each boundary: what has happened, and what has not happened yet? If Harbor Point only graphs end-to-end transaction latency, it collapses all of those stages into one blurry number. A commit can be acknowledged quickly while the archive is failing. A replica can answer reads while replay is behind the point that compliance assumes. The observability model has to preserve those distinctions.

For this cluster, the engine flow looks like this:

client commit
    |
    v
WAL appended in memory
    |
    v
WAL fsync completes  ---> commit may now be acknowledged
    |
    v
dirty pages accumulate in buffer cache
    |
    v
checkpoint/background writer flushes pages
    |
    v
WAL segment archived off-host
    |
    v
replica receives and replays WAL

Each arrow suggests an observability boundary. WAL write and sync timing say whether the commit path is healthy. Checkpoint write volume and duration reveal whether page flush work is arriving smoothly or in bursts. Archive success timestamps tell Harbor Point whether a local durable commit is also durable against host loss. Replay position tells the team whether standbys are fresh enough for reads or failover.

The trade-off is specificity. Engine-native metrics are less portable than generic host dashboards, and they require operators to understand terms like LSN, checkpoint, or replay lag. The payoff is that those metrics answer the actual production question: which promise is degrading, and at which internal boundary is it happening?

Concept 2: A storage SLO is a bundle of narrower promises, not one number

Harbor Point's executives may ask for "99.95% database availability," but the storage team cannot operate the engine from that phrase alone. The lesson from 13.md is that recovery and durability are staged mechanisms. The observability consequence is that the SLO must also be staged. Otherwise the team reports success while missing the exact failure mode that matters.

For Harbor Point, a useful SLO bundle might look like this:

Commit durability:
  99% of writes acknowledged in < 40 ms

Recovery point objective:
  archived WAL gap < 60 s for 99.9% of minutes

Replica freshness:
  compliance replica replay lag < 5 s during market hours

Recovery time objective:
  weekly restore drill completes to serving state in < 15 min

Notice what changed. The promises are still business-facing, but each one has an engine-native measurement behind it. Commit durability might rely on WAL sync latency histograms and commit wait time. The recovery-point promise depends on successful WAL archival, not on query latency. Replica freshness should be measured with replay position or commit timestamp lag, not merely TCP connectivity to the standby. Recovery time is not inferred from backup success; it is measured through drills that execute the full restore path.

Those measurements also force trade-offs into the open. If Harbor Point tightens the archive gap from sixty seconds to ten, it may need more archive bandwidth, faster object-store uploads, or a different WAL segment size. If it insists on a five-second replay lag during peak trading, it may need more network headroom or less read traffic on the replica. A better SLO is not free. It is a costed operating posture.

The important design move is to reject proxies that do not prove the claim. "CPU below 70%" is not an SLI for durability. "Database port answered health checks" is not an SLI for PITR readiness. Storage SLOs should be built from signals that correspond to the engine step the promise depends on.

Concept 3: The most valuable signals are usually backlogs, gaps, and rehearsed outcomes

Storage engines rarely fail as a clean binary switch. More often, risk accumulates as debt. WAL is generated faster than it is archived. A replica receives WAL faster than it can replay. Dirty buffers grow until a checkpoint turns into a write storm. Vacuum or compaction falls behind until read amplification rises and bloat distorts latency. If Harbor Point alerts only on instantaneous thresholds, it notices the problem too late and has trouble telling cause from effect.

That is why backlog-oriented observability is so effective. Instead of asking only "what is the lag right now?", Harbor Point asks "is the lag shrinking, stable, or growing under current load?" A replay gap of 400 MB may be harmless during a burst if the standby is closing it quickly. A replay gap of 80 MB can be dangerous if it has been growing steadily for twenty minutes while compliance is still reading from that node. The same logic applies to checkpoint pressure and archive delay.

Restore drills belong in the same observability system. A weekly drill that restores the latest base backup, replays WAL to a target point, and runs a validation query is not separate from monitoring. It is the only way Harbor Point can measure the full recovery pipeline as it actually exists. Backup completion logs are component health. Drill duration and validation success are system health.

This is where the lesson points toward 15.md. Observability can show Harbor Point that replay lag spikes during failover tests or that a partition causes WAL sender timeouts, but it cannot prove on its own that the engine preserves correctness claims under adversarial failures. Metrics tell the team where the boundaries are. Failure testing verifies whether those boundaries hold when the world is hostile.

Troubleshooting

Issue: The API dashboard is green, but Harbor Point is already missing its recovery point objective.

Why it happens / is confusing: Request-path metrics can stay healthy while WAL archival is broken. The cluster is still serving from local durable storage, so nothing looks wrong until a host loss or destructive transaction demands off-host recovery.

Clarification / Fix: Track the time since the last successfully archived WAL segment and alert on sustained gaps. Treat archival health as part of the durability SLO, not as a background maintenance metric.

Issue: Replica lag shown in seconds jumps around and does not explain whether the standby is actually catching up.

Why it happens / is confusing: Time lag alone hides the amount of outstanding WAL and whether replay throughput is improving or collapsing. A low-volume minute and a heavy burst can produce the same "seconds behind" headline with very different recovery risk.

Clarification / Fix: Pair time-based lag with an LSN or byte gap and observe the trend. Harbor Point should know both how far behind the replica is and whether the backlog is shrinking under current load.

Issue: Commit latency spikes every few minutes, and teams blame application lock contention.

Why it happens / is confusing: Checkpoints and background flush bursts can surface as commit slowdowns even when the SQL layer is unchanged. If the engine is forcing many dirty pages out at once, WAL and data-file I/O compete for the same device budget.

Clarification / Fix: Correlate commit latency with checkpoint start and end times, dirty-buffer counts, and WAL sync timing. If the spike aligns with flush debt, tune checkpoint cadence or write smoothing before rewriting application code.

Advanced Connections

Connection 1: 13.md designed the PITR mechanism; this lesson turns it into a monitored promise

Base backups, WAL archives, and restore targets define how Harbor Point would recover. Observability decides whether that design is still true at 09:37 on a normal weekday, before an incident exposes silent drift in the recovery pipeline.

Connection 2: 15.md uses failure injection to test the invariants that observability exposes

Once Harbor Point knows which boundaries matter, such as replay progress, archive continuity, and commit durability, the next step is to break the system deliberately and verify that those signals still correspond to real consistency behavior under partitions, crashes, and failovers.

Resources

Optional Deepening Resources

Key Insights

  1. Storage observability must mirror engine state transitions - If a metric does not tell you which durability or replay boundary has been crossed, it is a weak foundation for operating the database.
  2. A storage SLO is really a set of narrower promises - Commit latency, archive continuity, replica freshness, and restore time are different commitments and should not be flattened into one health number.
  3. Backlog growth and restore drills reveal risk earlier than generic uptime dashboards - The engine usually tells you it is accumulating debt long before the application tells you it is down.
PREVIOUS Backup, Snapshots, and Point-in-Time Recovery NEXT Failure Testing and Jepsen-Style Validation

← Back to Consistency and Replication

← Back to Learning Hub