Synchronous and Asynchronous Replication
LESSON
Synchronous and Asynchronous Replication
The core idea: Synchronous and asynchronous replication differ at the acknowledgment boundary: which local or remote durability milestone must happen before the primary can tell the client that a write committed.
Core Insight
Imagine Harbor Point approving reservation R-88421 at market open. The primary database writes the commit record locally and is ready to return 201 Created. A standby is receiving the same WAL stream, but the team now has to answer a business question hidden inside a storage setting: can the API say "accepted" before any other machine has durably stored that commit?
Asynchronous replication says yes. The primary can acknowledge the client after local durability and let standbys catch up in the background. That keeps the write path short and available when replicas are slow, but it creates a window where an acknowledged reservation can disappear if the primary dies before a standby has flushed the WAL.
Synchronous replication moves a remote milestone into the commit path. The primary waits for a standby to receive, flush, or sometimes apply the relevant log record before it answers the client. That shrinks the acknowledged-write loss window, but it spends latency and makes replica health part of write availability.
The trade-off is not "fast versus safe" in the abstract. It is: what failure can this operation expose to the caller, and what latency or availability cost is the product willing to pay to avoid it?
The Acknowledgment Boundary
Lesson 004 separated three standby watermarks: received, flushed, and replayed. This lesson turns those watermarks into a policy.
For Harbor Point, the primary's local path looks like this:
client -> primary generates WAL -> primary fsyncs WAL -> primary can recover locally
After that point, replication mode decides whether the client gets success immediately or whether the primary must wait for a remote milestone.
Asynchronous replication
client
-> primary fsync
-> ACK to client
-> ship WAL later
-> standby flush
-> standby replay
Synchronous replication, remote flush
client
-> primary fsync
-> ship WAL
-> standby flush
-> standby ACK to primary
-> ACK to client
-> standby replay later
The word "synchronous" is not specific enough by itself. A database may wait for remote receipt, remote flush, or remote apply. These are different promises:
Remote milestone What it buys What it does not buy
----------------- ------------------------------------- ------------------------------
receive another node saw the bytes durability after standby crash
flush another node stored durable WAL immediate standby read freshness
apply standby queries can see the change low write latency
Harbor Point usually cares about remote flush for reservation approvals. The legal promise is not "every dashboard can query this immediately." The promise is "if we tell the trader the reservation was accepted, losing the primary alone should not erase it."
Failover Outcomes
The difference becomes concrete in a crash timeline.
09:30:00.120 client submits reservation R-88421
09:30:00.123 primary fsyncs commit locally
09:30:00.124 primary returns success
09:30:00.125 primary host dies
In asynchronous mode, success at 09:30:00.124 may have outrun replication. If no standby flushed the commit before the crash, the promoted standby recovers to an older prefix of history. The client has a confirmation, but the new primary does not have the reservation.
In synchronous mode with remote flush, the primary would not have returned success until a standby had durably stored the commit record. If the primary dies after success, promotion can recover from that durable remote WAL. The standby may still need to replay the record before reads show it, but the committed event is not lost.
This distinction corrects a common confusion: failover durability and standby read freshness are related but separate. Waiting for remote flush can make an acknowledged write survive primary loss without guaranteeing that a hot-standby query can see the write at the exact moment the client receives success.
That separation matters for Harbor Point's dashboards. A dashboard stale by two seconds may be acceptable. A confirmed reservation disappearing after failover may not be acceptable at all.
Policy Spectrum in Practice
Production systems rarely offer only a pure binary. They usually expose a spectrum of acknowledgment policies:
Policy Client success waits for...
----------------------------- ----------------------------------------------
async local primary durability
semi-sync / remote receive at least one standby receives the record
sync remote flush at least one standby durably stores the record
sync remote apply at least one standby replays the record
quorum commit enough replicas accept the ordered entry
Each policy changes which failures become visible.
Asynchronous replication is good for high-throughput paths where a small recovery point objective is acceptable. Harbor Point might use it for derived search projections, analytics feeds, or dashboard caches.
Remote flush is a stronger fit for reservation approvals. It adds a network hop and a standby disk flush to the commit path, but it makes the acceptance response mean something more durable than "one machine had it for a moment."
Remote apply is useful when the application needs immediate reads from the standby after commit, but it is expensive. It waits for the standby to perform replay work, not merely store the history safely.
Quorum commit, which the next lesson approaches through tunable consistency, changes the shape again. Instead of "one primary plus optional followers," the write is considered successful only after enough replicas participate. The same pressure remains: stronger durability or freshness comes from waiting for more of the replicated system.
Choosing the Mode for One Workload
Harbor Point can document its policy per operation:
Operation Acknowledgment rule Reason
--------------------------- -------------------------------- -------------------------
reservation approval primary fsync + one standby flush avoid acknowledged loss
dashboard projection update primary fsync only stale or lost cache is ok
audit event write primary fsync + one standby flush preserve confirmed record
bulk analytics refresh async throughput matters more
This table is more useful than a global slogan like "we use synchronous replication." It names which user promises deserve remote durability and which workloads can accept a recovery window.
The choice also needs a degraded-mode rule. If the synchronous standby is unavailable, should Harbor Point block reservation approvals, fail over, switch to another standby, or temporarily accept asynchronous risk? That decision should be explicit before the incident. Otherwise an operator under pressure becomes the real consistency policy.
The strongest answer is not always the right answer. Strict synchronous replication can turn a replica I/O hiccup into a write outage. Asynchronous replication can keep the product moving but accept data loss on failover. The engineering work is matching the mode to the cost of being wrong.
Failure Modes
Treating small lag as proof of safety. A low replay-lag graph after the fact does not prove the lost transaction had reached a standby before the primary died. The commit policy defines the promise.
Assuming synchronous means fresh reads everywhere. If the system waits for remote flush but not remote apply, a standby can be failover-safe while still briefly stale for queries.
Letting replica health accidentally control writes. Strict synchronous settings make the selected standby part of the write availability path. That may be correct, but it should be intentional and monitored.
Hiding degraded-mode behavior. If the system silently falls back from synchronous to asynchronous after a timeout, the client contract changes. Operators and callers need to know when that happens.
Resources
- [DOC] PostgreSQL Documentation: Synchronous Replication
- Focus: Compare remote write, flush, and apply settings as concrete acknowledgment boundaries.
- [DOC] MySQL Reference Manual: Semisynchronous Replication
- Focus: Study a production middle ground between fully asynchronous and stricter synchronous behavior.
- [BOOK] Designing Data-Intensive Applications
- Focus: Review leader-based replication, failover windows, and why acknowledgment policy defines the durability semantics.
- [PAPER] PacificA: Replication in Log-Based Distributed Storage Systems
- Focus: Read for how log replication systems reason about primary/secondary roles, durability, and failover.
Key Takeaways
- Replication mode is an acknowledgment contract: it says what local or remote milestone must happen before success reaches the client.
- Asynchronous replication keeps writes fast and available, but accepts a window where acknowledged writes can be lost if the primary fails.
- Synchronous replication improves failover durability by waiting for replicas, but that adds latency and can make replica health affect write availability.
- Remote flush protects durability; remote apply protects standby read freshness. They are related, but they are not the same guarantee.