Synchronous and Asynchronous Replication

LESSON

Consistency and Replication

005 30 min advanced

Synchronous and Asynchronous Replication

The core idea: Synchronous and asynchronous replication differ at the acknowledgment boundary: which local or remote durability milestone must happen before the primary can tell the client that a write committed.

Core Insight

Imagine Harbor Point approving reservation R-88421 at market open. The primary database writes the commit record locally and is ready to return 201 Created. A standby is receiving the same WAL stream, but the team now has to answer a business question hidden inside a storage setting: can the API say "accepted" before any other machine has durably stored that commit?

Asynchronous replication says yes. The primary can acknowledge the client after local durability and let standbys catch up in the background. That keeps the write path short and available when replicas are slow, but it creates a window where an acknowledged reservation can disappear if the primary dies before a standby has flushed the WAL.

Synchronous replication moves a remote milestone into the commit path. The primary waits for a standby to receive, flush, or sometimes apply the relevant log record before it answers the client. That shrinks the acknowledged-write loss window, but it spends latency and makes replica health part of write availability.

The trade-off is not "fast versus safe" in the abstract. It is: what failure can this operation expose to the caller, and what latency or availability cost is the product willing to pay to avoid it?

The Acknowledgment Boundary

Lesson 004 separated three standby watermarks: received, flushed, and replayed. This lesson turns those watermarks into a policy.

For Harbor Point, the primary's local path looks like this:

client -> primary generates WAL -> primary fsyncs WAL -> primary can recover locally

After that point, replication mode decides whether the client gets success immediately or whether the primary must wait for a remote milestone.

Asynchronous replication

client
  -> primary fsync
  -> ACK to client
  -> ship WAL later
  -> standby flush
  -> standby replay

Synchronous replication, remote flush

client
  -> primary fsync
  -> ship WAL
  -> standby flush
  -> standby ACK to primary
  -> ACK to client
  -> standby replay later

The word "synchronous" is not specific enough by itself. A database may wait for remote receipt, remote flush, or remote apply. These are different promises:

Remote milestone   What it buys                          What it does not buy
-----------------  -------------------------------------  ------------------------------
receive            another node saw the bytes             durability after standby crash
flush              another node stored durable WAL         immediate standby read freshness
apply              standby queries can see the change      low write latency

Harbor Point usually cares about remote flush for reservation approvals. The legal promise is not "every dashboard can query this immediately." The promise is "if we tell the trader the reservation was accepted, losing the primary alone should not erase it."

Failover Outcomes

The difference becomes concrete in a crash timeline.

09:30:00.120  client submits reservation R-88421
09:30:00.123  primary fsyncs commit locally
09:30:00.124  primary returns success
09:30:00.125  primary host dies

In asynchronous mode, success at 09:30:00.124 may have outrun replication. If no standby flushed the commit before the crash, the promoted standby recovers to an older prefix of history. The client has a confirmation, but the new primary does not have the reservation.

In synchronous mode with remote flush, the primary would not have returned success until a standby had durably stored the commit record. If the primary dies after success, promotion can recover from that durable remote WAL. The standby may still need to replay the record before reads show it, but the committed event is not lost.

This distinction corrects a common confusion: failover durability and standby read freshness are related but separate. Waiting for remote flush can make an acknowledged write survive primary loss without guaranteeing that a hot-standby query can see the write at the exact moment the client receives success.

That separation matters for Harbor Point's dashboards. A dashboard stale by two seconds may be acceptable. A confirmed reservation disappearing after failover may not be acceptable at all.

Policy Spectrum in Practice

Production systems rarely offer only a pure binary. They usually expose a spectrum of acknowledgment policies:

Policy                         Client success waits for...
-----------------------------  ----------------------------------------------
async                          local primary durability
semi-sync / remote receive      at least one standby receives the record
sync remote flush               at least one standby durably stores the record
sync remote apply               at least one standby replays the record
quorum commit                   enough replicas accept the ordered entry

Each policy changes which failures become visible.

Asynchronous replication is good for high-throughput paths where a small recovery point objective is acceptable. Harbor Point might use it for derived search projections, analytics feeds, or dashboard caches.

Remote flush is a stronger fit for reservation approvals. It adds a network hop and a standby disk flush to the commit path, but it makes the acceptance response mean something more durable than "one machine had it for a moment."

Remote apply is useful when the application needs immediate reads from the standby after commit, but it is expensive. It waits for the standby to perform replay work, not merely store the history safely.

Quorum commit, which the next lesson approaches through tunable consistency, changes the shape again. Instead of "one primary plus optional followers," the write is considered successful only after enough replicas participate. The same pressure remains: stronger durability or freshness comes from waiting for more of the replicated system.

Choosing the Mode for One Workload

Harbor Point can document its policy per operation:

Operation                    Acknowledgment rule               Reason
---------------------------  --------------------------------  -------------------------
reservation approval          primary fsync + one standby flush  avoid acknowledged loss
dashboard projection update   primary fsync only                 stale or lost cache is ok
audit event write             primary fsync + one standby flush  preserve confirmed record
bulk analytics refresh        async                              throughput matters more

This table is more useful than a global slogan like "we use synchronous replication." It names which user promises deserve remote durability and which workloads can accept a recovery window.

The choice also needs a degraded-mode rule. If the synchronous standby is unavailable, should Harbor Point block reservation approvals, fail over, switch to another standby, or temporarily accept asynchronous risk? That decision should be explicit before the incident. Otherwise an operator under pressure becomes the real consistency policy.

The strongest answer is not always the right answer. Strict synchronous replication can turn a replica I/O hiccup into a write outage. Asynchronous replication can keep the product moving but accept data loss on failover. The engineering work is matching the mode to the cost of being wrong.

Failure Modes

Treating small lag as proof of safety. A low replay-lag graph after the fact does not prove the lost transaction had reached a standby before the primary died. The commit policy defines the promise.

Assuming synchronous means fresh reads everywhere. If the system waits for remote flush but not remote apply, a standby can be failover-safe while still briefly stale for queries.

Letting replica health accidentally control writes. Strict synchronous settings make the selected standby part of the write availability path. That may be correct, but it should be intentional and monitored.

Hiding degraded-mode behavior. If the system silently falls back from synchronous to asynchronous after a timeout, the client contract changes. Operators and callers need to know when that happens.

Resources

Key Takeaways

  1. Replication mode is an acknowledgment contract: it says what local or remote milestone must happen before success reaches the client.
  2. Asynchronous replication keeps writes fast and available, but accepts a window where acknowledged writes can be lost if the primary fails.
  3. Synchronous replication improves failover durability by waiting for replicas, but that adds latency and can make replica health affect write availability.
  4. Remote flush protects durability; remote apply protects standby read freshness. They are related, but they are not the same guarantee.
PREVIOUS Log Shipping and Ordered Apply NEXT Quorum Reads, Writes, and Tunable Consistency