Day 418: Synchronous vs Asynchronous Replication

The core idea: The real difference between synchronous and asynchronous replication is the commit acknowledgment boundary: which replica milestone must happen before the primary is allowed to tell the client "your write is durable."

Today's "Aha!" Moment

In 01.md, Harbor Point separated three standby milestones: receiving WAL, flushing it durably, and replaying it into query-visible state. This lesson adds the question that turns those milestones into a production policy: when trader R-88421 is approved, which of those milestones must happen before the API returns 201 Created?

An asynchronous primary answers that question narrowly. It fsyncs its own WAL, acknowledges the client, and lets replicas catch up afterward. That keeps the reservation path fast, but it leaves a window in which the client has been told the approval committed even though no standby has durably stored it yet. If the primary dies inside that window and the standby is promoted, the acknowledged reservation can disappear.

A synchronous primary moves at least one remote milestone into the commit path. The client does not get success until some standby has reached the configured point, usually remote flush and sometimes remote apply. That shrinks the data-loss window across failover, but the write is now paying for network latency, remote disk latency, and standby availability. The key design question is not "Do we prefer sync or async in theory?" It is "What failure are we willing to expose to the caller, and what latency or availability cost are we willing to pay to avoid it?"

Why This Matters

Harbor Point's compliance team treats a reservation approval as a legally meaningful event. If support tells a trader that reservation R-88421 was accepted, the database cannot quietly lose it just because the primary crashed two milliseconds later. At the same time, the desk will not accept an approval path that stalls every time a remote replica has a brief I/O hiccup. The replication mode is therefore not a storage footnote. It is part of the product's contract.

Without an explicit acknowledgment rule, teams talk past each other. Application engineers say "the write committed" because the primary returned success. Database operators say "the standby was almost caught up" because replay lag looked small. Incident commanders later discover that both statements were true and the data was still lost on failover, because "committed" only meant local durability. Once the team names the commit boundary precisely, latency, RPO, failover behavior, and read freshness stop being vague aspirations and become measurable properties.

This lesson connects directly to 01.md: log shipping gave Harbor Point a way to reproduce history on a standby, but it did not decide when a history prefix becomes part of the durability promise made to clients. The next lesson, 03.md, shows what happens when that durability promise is no longer "one primary plus replicas" but "a quorum commits a log entry together."

Learning Objectives

By the end of this session, you will be able to:

Explain the commit boundary in synchronous and asynchronous replication - Identify which local and remote milestones a database waits for before acknowledging a write.
Analyze failover outcomes from the acknowledgment policy - Predict when an acknowledged write can still be lost after primary failure.
Choose a replication mode for a concrete workload - Match reservation-critical writes, standby-read needs, and latency budgets to the right durability policy.

Core Concepts Explained

Concept 1: The commit path changes when replica acknowledgment enters the critical path

Harbor Point's approval service already writes to WAL on the primary before reporting success. The difference between asynchronous and synchronous replication is what happens after that local WAL flush. In asynchronous mode, the primary treats local durability as sufficient for the client contract and pushes the new WAL to standbys in the background. In synchronous mode, local durability is necessary but not sufficient: the primary must also wait for one or more standby milestones before it may return success.

That makes the commit path materially different:

Asynchronous commit
client -> primary WAL flush -> ACK to client -> ship WAL -> standby flush -> standby replay

Synchronous commit (remote flush)
client -> primary WAL flush -> ship WAL -> standby flush -> ACK to primary -> ACK to client -> standby replay

The important detail is that "synchronous" is not one universal milestone. Some systems wait until a standby confirms the WAL bytes were received. Others wait for the standby to flush them durably. A stricter variant waits until the standby has replayed the change so standby reads can observe it immediately after commit. Harbor Point usually cares most about remote flush, because the regulatory requirement is "do not lose approved reservations on failover," not "make every standby query instantly current."

This is why synchronous replication is a semantics choice before it is a performance choice. The setting answers a precise question: when the API says "committed," does that mean "durable on one machine" or "durable on more than one machine"? Only after that answer is clear does it make sense to measure the latency cost.

Concept 2: Failover risk is the real difference, not just average write latency

Suppose Harbor Point receives approval for reservation R-88421 at 09:30:00.120. The primary writes the commit record locally at 09:30:00.123 and returns success at 09:30:00.124. In asynchronous mode, that success may still outrun replication. If the primary host fails at 09:30:00.125 and the standby had not yet flushed the WAL for R-88421, promotion will recover to an older history prefix. Support now sees a contradiction: the client was told the reservation succeeded, but the new primary has no trace of it.

In synchronous mode with remote flush, the same crash produces a different result. The primary would not have acknowledged R-88421 until a standby confirmed that the commit record was durably stored. Promotion can then continue recovery from that durable prefix even if replay was slightly behind. The standby may need a brief moment to finish applying the WAL before queries show the reservation, but the event is not lost.

That distinction also corrects a common misconception. Synchronous replication does not automatically mean "no stale standby reads." If Harbor Point waits for remote flush but not remote apply, an acknowledged reservation is failover-safe without necessarily being visible on a hot standby query at the same instant. Read freshness and failover durability are related, but they are not the same guarantee. The difference matters when teams build trader dashboards on standbys and assume that a synchronous commit setting solved every read-after-write problem.

The same mechanism explains why asynchronous replication is still a rational choice for some data. If Harbor Point emits a derived analytics event or refreshes a low-value search projection, accepting a tiny failover loss window may be cheaper than putting another network round-trip into the user's critical path. The mode should follow the consequence of loss, not a blanket preference for "faster" or "safer."

Concept 3: Production systems usually implement a policy spectrum, not a pure binary

Real deployments rarely stay at the textbook endpoints. Harbor Point might require one nearby standby to flush every reservation approval before acknowledging the trader, while allowing additional reporting replicas to lag asynchronously. Another team might configure quorum-based synchronous replication so any one of two local standbys can satisfy the commit rule. Yet another might run "semi-sync," where the primary waits for a receipt or flush from one standby but falls back to asynchronous mode after a timeout. All of those are really policies about what durability guarantee survives degraded conditions.

Those policy choices create explicit trade-offs:

Strict synchronous replication reduces the chance of acknowledged-write loss during failover, but it ties write availability to replica health and network quality.
Asynchronous replication preserves write throughput and keeps the primary available when replicas are unhealthy, but it accepts a non-zero RPO window for acknowledged writes.
Intermediate policies such as quorum sync or semi-sync reduce some risk without paying the full cost of waiting for every replica, but they also make the guarantee more conditional and therefore harder to explain operationally.

Harbor Point should therefore document the acknowledgment rule in product language, not only database language. "Reservation approvals survive loss of the primary because one standby must flush the WAL before commit returns" is an operationally useful promise. So is "search suggestions may roll back a few seconds after failover because they replicate asynchronously." If the team cannot state the rule that clearly, it probably has not chosen the mode deliberately enough.

This lesson also prepares the ground for 03.md. Synchronous leader-follower replication adds a remote acknowledgment to a primary's commit path. Raft goes further by making quorum acknowledgment the definition of commit itself. The same design pressure appears in both: durability improves when multiple nodes must confirm ordered history, but latency and failure handling become part of the write path.

Troubleshooting

Issue: Enabling synchronous replication caused approval latency spikes even though the primary's own disk is healthy.

Why it happens / is confusing: The extra delay is no longer coming only from local WAL flush. The commit path now includes network transport, standby disk flush, and any queuing on the designated synchronous replica.

Clarification / Fix: Measure each hop separately. Verify whether Harbor Point is waiting for remote receive, remote flush, or remote apply, and whether the synchronous standby is in the same region and provisioned for peak WAL volume.

Issue: A failover lost a reservation that the API had already confirmed, even though replication lag dashboards looked small.

Why it happens / is confusing: Small replay lag does not prove that the acknowledged transaction had already been flushed on a standby when the primary died. If the system was running asynchronously, the acknowledgment only guaranteed local durability.

Clarification / Fix: Compare the configured commit policy with the business RPO. If acknowledged approvals must survive failover, require at least one standby flush before success or move that workload onto a quorum-commit design.

Issue: The primary stopped accepting writes when a replica link flapped for a few seconds.

Why it happens / is confusing: In strict synchronous mode, the primary is not allowed to complete commits without the configured replica acknowledgment. What looks like a "replica problem" has become a write-availability problem by design.

Clarification / Fix: Decide whether the business prefers blocking writes, failing over quickly, or temporarily degrading to asynchronous behavior. Then encode that policy explicitly instead of relying on an implicit timeout or an undocumented operator decision.

Advanced Connections

Connection 1: 01.md supplies the replica milestones that synchronous commit can wait for

The previous lesson separated receive, flush, and replay positions on a standby. This lesson turns those positions into a client-facing contract. "Async" means the commit boundary stops at local durability. "Sync" means the boundary extends to one of those remote positions.

Connection 2: 03.md replaces single-primary acknowledgment with quorum commit

Primary-standby systems ask, "Should this leader wait for a follower before acknowledging?" Raft-based systems ask, "Has this log entry reached a majority, making it committed by protocol definition?" The machinery changes, but the core trade-off is the same: stronger failover guarantees come from waiting for more nodes to confirm ordered history.

Resources

Optional Deepening Resources

[DOC] PostgreSQL Documentation: Synchronous Replication
- Focus: The exact standby milestones PostgreSQL can wait for and how those choices affect latency and failover durability.
[DOC] MySQL Reference Manual: Semisynchronous Replication
- Focus: A production example of a middle ground between fully asynchronous replication and stricter synchronous policies.
[BOOK] Designing Data-Intensive Applications
- Focus: Leader-based replication, failover data loss windows, and why acknowledgment policy is the real semantic boundary.

Key Insights

Replication mode is a commit-contract choice - The important question is which remote milestone must happen before a client can be told the write is durable.
Async is fast because it accepts a failover loss window - An acknowledged write can still disappear if the primary dies before any standby has durably stored it.
Sync improves failover safety by adding coordination to the write path - The benefit is real, but so are the latency and availability costs of waiting for replicas.

← Back to Consistency and Replication

← Back to Learning Hub