LESSON
Day 282: Database Replication Topologies and Failover Behavior
The core idea: replication creates more copies of the same authority domain. It improves availability, read capacity, and recovery options, but it does not remove the need to choose who is authoritative and what happens when that authority changes hands.
Today's "Aha!" Moment
The insight: Replication is not just "copy the database to another machine." It is a continuing protocol about where writes go, how followers catch up, what lag means, and who is allowed to become primary when failure happens.
Why this matters: Many outages happen not because replication is missing, but because teams assume that "having replicas" automatically means safe failover, fresh reads, and zero ambiguity about leadership.
Concrete anchor: A primary fails during a traffic spike. You have replicas, but one is lagging, another is healthy but isolated, and clients still have stale connection targets. The problem is no longer storage. It is authority transfer under uncertainty.
The practical sentence to remember:
Replication gives you copies. Failover is the logic that decides which copy can safely become the source of truth.
Why This Matters
The problem: A single database node is a single authority point. If it fails, you lose write availability and may lose service entirely. Replication exists to reduce that fragility, but it introduces a new question: which copy is current enough and legitimate enough to serve writes after failure?
Without this model:
- Teams conflate replication with high availability.
- Read replicas are treated as magically consistent.
- Failover plans ignore lag, fencing, and client routing.
With this model:
- You can compare topologies by where authority lives and how it moves.
- You can predict which failure modes create stale reads, write downtime, or split-brain risk.
- You can separate "extra copies for reads" from "extra copies for recovery and failover."
Operational payoff: Better failover design, better expectations around replica lag, and fewer incidents where a promoted replica turns out not to contain the state everyone assumed it had.
Learning Objectives
By the end of this lesson, you should be able to:
- Explain the main replication topologies in terms of authority, lag, and failover behavior.
- Describe what failover actually requires beyond just having a replica available.
- Reason about read freshness, write availability, and split-brain risk under different designs.
Core Concepts Explained
Concept 1: Replication Topologies Are Really Authority Topologies
Concrete example / mini-scenario: A transactional database has one primary in Madrid and two followers in Frankfurt and Paris. Reads may go to followers, but all writes still flow through the primary.
Intuition: The shape of replication matters because it determines where the authoritative write stream begins and how other nodes consume it.
Common topology families:
-
Primary-replica
- One node accepts writes
- Replicas follow the primary's change stream
- Operationally simple, very common
-
Multi-primary / multi-leader
- More than one node can accept writes
- Requires conflict handling or scoped ownership
- Useful in some geo or partitioned scenarios, but much harder
-
Leaderless / quorum-style
- Authority is spread across replicas
- Reads and writes rely on quorum overlap
- Often shifts more reconciliation complexity to the client or datastore logic
Why this lesson focuses on failover: In practice, many systems start with primary-replica because it gives the clearest operational story. The hardest production questions then become: how fresh are the replicas, and how do we promote one safely?
Connection to later sharding: Replication adds copies of the same data domain. Sharding later divides the authority domain itself across multiple shards.
Concept 2: Replica Lag Changes What Reads Actually Mean
Concrete example / mini-scenario: A user updates their profile, then immediately loads a page that reads from a follower. The old profile appears for a few seconds.
Intuition: Replication is often asynchronous or partially asynchronous, so a follower is not "another primary." It is a copy that is catching up.
What lag changes:
- Read-after-write expectations
- Whether analytics can tolerate stale results
- Whether failover loses recent acknowledged writes
Important distinctions:
- Synchronous replication
- Stronger durability/freshness
- Higher write latency
- Asynchronous replication
- Lower write latency
- Replica lag and possible data loss window on failover
- Semi-synchronous variants
- Try to balance latency and survivability
Operational reality: Replica lag is not just a metric. It is an application behavior contract.
If reads route to lagging replicas, users may observe stale state. If failover promotes a lagging replica, recent acknowledged data may disappear unless durability rules prevented that.
Mental model:
Replication lag is the distance between "what the primary has decided" and "what this replica can currently prove."
Concept 3: Failover Is a Controlled Transfer of Authority
Concrete example / mini-scenario: The primary becomes unreachable. Monitoring fires. An orchestration system chooses a follower to promote and reroutes clients.
Intuition: Failover is not just a restart action. It is a sequence of safety decisions:
- Is the primary truly down or just partitioned?
- Which replica is most up to date?
- How do we prevent the old primary from continuing to accept writes if it comes back?
- How do clients discover and trust the new primary?
This is where incidents usually happen:
- False failure detection during network partitions
- Promotion of a stale replica
- Split-brain when two primaries think they are authoritative
- Applications continuing to write to the old endpoint
Critical failover ingredients:
- Reliable failure detection
- Promotion policy based on replication state
- Fencing or leader invalidation for the old primary
- Client routing update
- Recovery flow for the demoted or failed node
Trade-off:
- Faster failover improves availability
- But aggressive failover under uncertainty raises split-brain or stale-promotion risk
This is why failover timing is never "just tune it lower." It is a policy about how much uncertainty you are willing to tolerate before moving authority.
Troubleshooting
Issue: Reads from replicas sometimes show old data.
Why it happens: Replica lag is part of the design. The replica has not yet applied the latest committed changes from the primary.
Clarification / Fix: Decide which paths require fresh reads and route them accordingly, or use read-after-write techniques where necessary.
Issue: Failover succeeds technically, but recent writes disappear.
Why it happens: The promoted replica was behind the failed primary, and the system's durability policy allowed that gap to exist.
Clarification / Fix: Revisit synchronous/semi-synchronous policy, promotion eligibility, and the application's tolerance for write loss windows.
Issue: Failover sometimes creates two writable primaries.
Why it happens: Failure detection and authority revocation are not strong enough under partition or delayed network conditions.
Clarification / Fix: Add fencing, stronger leadership control, or clearer orchestration rules. The goal is not just promotion, but safe promotion.
Issue: Replica lag spikes during heavy load.
Why it happens: Followers may be bottlenecked by apply speed, I/O, network bandwidth, or replay serialization.
Clarification / Fix: Measure the replication pipeline end to end instead of blaming the primary only. The bottleneck may be in apply, not ship.
Advanced Connections
Connection 1: Replication <-> WAL / Durable Logs
The parallel: Many replication systems stream from the same durable change log that crash recovery depends on. Local recovery and remote replication often start from the same ordered source of truth.
Why this matters: This is why WAL and replication are so close conceptually: first record changes durably, then let other nodes catch up.
Connection 2: Replication <-> Sharding
The contrast: Replication copies the same authority domain. Sharding splits the authority domain across nodes. You usually replicate each shard later, but that is a different design decision.
Why this matters: Teams often reach for sharding when replication was enough, or expect replication alone to solve problems that really require partitioning data ownership.
Resources
Suggested Resources
- [DOC] PostgreSQL Replication Overview - Documentation
Focus: practical mental models for replication, standby nodes, and failover concerns. - [DOC] MySQL Replication - Documentation
Focus: concrete operational details of primary-replica replication and lag. - [BOOK] Designing Data-Intensive Applications - Book site
Focus: strong conceptual grounding for replication, failover, and distributed data trade-offs.
Key Insights
- Replication topology is really about write authority and how other copies follow it.
- Replica lag changes the meaning of reads and failovers, not just dashboard metrics.
- Failover is a safety protocol, not merely a restart or promotion button.