Kafka Replication, ISR, and Leader Election

LESSON

Event-Driven and Streaming Systems

018 30 min intermediate

Day 262: Kafka Replication, ISR, and Leader Election

In Kafka, having replicas is not enough. What matters is which replica leads, which followers are sufficiently caught up to count, and when the system should refuse progress rather than lie about durability.


Today's "Aha!" Moment

The insight: Kafka replication is not just "store several copies." It is a leadership and freshness problem. One replica leads writes, followers copy the log, and only the replicas that are close enough to the leader belong to the in-sync replica set, or ISR.

Why this matters: Teams often hear "replication factor 3" and assume durability is straightforward. It is not. The important questions are: which replica is currently authoritative, how many replicas are genuinely up to date, and whether the system should keep accepting writes when too few good replicas remain.

The universal pattern: partition leader accepts writes -> followers replicate the leader's log -> replicas that stay sufficiently current remain in ISR -> writes and failover semantics depend on ISR size and leader election policy.

Concrete anchor: A partition has three replicas. One is leader, two are followers. If one follower falls behind, you still have three copies on paper, but fewer replicas are truly current. The system's real durability now depends on the leader, the remaining in-sync followers, and the producer's acknowledgement settings.

How to recognize when this applies:

Common misconceptions:

Real-world examples:

  1. Healthy partition: Leader and followers are in sync, so acks=all can wait for a stronger durability boundary.
  2. Degraded partition: Replicas exist but some are out of sync, so the cluster may need to choose between keeping writes available and preserving stronger guarantees.

Why This Matters

The problem: Replication sounds simple until a broker slows down or fails. Then the system must decide whether a follower is fresh enough to count, whether writes should still be accepted, and who becomes leader without corrupting the partition's continuity guarantees.

Before:

After:

Real-world impact: This understanding prevents false confidence, improves failure response, and makes producer durability settings much more meaningful.


Learning Objectives

By the end of this session, you will be able to:

  1. Explain what Kafka replication is protecting - Understand that partitions are replicated logs with a single leader at a time.
  2. Describe how ISR shapes write guarantees - Distinguish existing replicas from replicas that are current enough to count for stronger acknowledgements.
  3. Evaluate leader election trade-offs - Reason about when failover preserves continuity and when the system may need to prefer safety over write availability.

Core Concepts Explained

Concept 1: Every Partition Has One Leader and Several Followers

Kafka replicates partitions, not whole topics as one monolithic structure.

For each partition:

The leader is the authoritative entry point for:

Followers pull data from the leader and append it to their own local copy of the partition log.

This matters because Kafka does not try to accept concurrent writes on every replica. It chooses:

That simplifies ordering and replication logic.

The cost is that leader health matters a lot. If the leader fails:

So the key design pattern is:

This is why replication and leader election naturally belong together. They are not separate features. Election only matters because the data structure being replicated has an active leader.

Concept 2: ISR Means "Replicas Current Enough to Count"

A replica can exist without being sufficiently up to date to deserve trust in durability decisions.

That is what ISR captures:

This is the key distinction:

That difference becomes operationally important during lag or failure.

Suppose replication factor is 3:

Then the real picture is not "we have 3 good copies." It is:

This is why producer durability settings interact with ISR:

That is the core trade-off:

So Kafka's write contract is not just a producer flag. It is a negotiation between producer settings and current ISR health.

Concept 3: Leader Election Chooses Availability Boundaries Under Failure

When the leader fails, Kafka must elect a new leader from the remaining replicas.

The subtle part is not election itself. The subtle part is:

If leadership moves to a replica that was not sufficiently caught up, the partition may appear available but continuity and acknowledged-write guarantees can degrade.

That is why leader election policy is a correctness question, not just an uptime question.

The practical lesson is:

This is also why ISR health matters so much before failure happens. The safest leader election is the one where there are healthy in-sync followers ready to take over.

This framing produces the right mental model:

And it sets up the next lessons cleanly:


Troubleshooting

Issue: "Replication factor is 3, so why are writes failing?"

Why it happens / is confusing: Teams equate configured replicas with currently healthy replicas.

Clarification / Fix: Check ISR size and min.insync.replicas. Replicas may exist but not be in sync enough to satisfy the configured write guarantee.

Issue: "A follower exists, so failover should always be harmless."

Why it happens / is confusing: Existence of a copy is confused with freshness of that copy.

Clarification / Fix: A lagging replica is not equivalent to a current replica. Leader election safety depends on how caught up the candidate really is.

Issue: "Why did producer latency increase even before anything failed?"

Why it happens / is confusing: Replication is often thought of only as a failure feature.

Clarification / Fix: With stronger acknowledgement settings, producers may wait for more in-sync replicas to confirm the write path. Durability usually costs some write latency.


Advanced Connections

Connection 1: Kafka Replication, ISR, and Leader Election <-> Log-Structured Storage

The parallel: The previous lesson explained what Kafka is replicating: ordered partition logs stored in segments. This lesson explains how those logs stay durable and authoritative across brokers.

Real-world case: Segment files and offsets define the storage model; ISR and leader election define whether that storage model survives broker failure cleanly.

Connection 2: Kafka Replication, ISR, and Leader Election <-> Delivery Semantics

The parallel: Delivery semantics later in the month only make sense once replication semantics are clear. Producer acknowledgements depend on how many in-sync replicas confirm a write and whether leadership changes preserve committed progress.

Real-world case: At-least-once or stronger producer behavior is meaningful only because there is a replicated partition log and a policy about who counts as safely current.


Resources

Optional Deepening Resources


Key Insights

  1. Replication factor is not the whole story - What matters operationally is which replicas are current enough to belong to ISR.
  2. Leader election is a correctness boundary - Promoting a replica is not just about uptime; it is about preserving a safe authoritative log.
  3. Durability and availability trade against each other - Stronger write guarantees require enough healthy in-sync replicas and can reduce availability under degradation.

PREVIOUS Kafka Log-Structured Storage and Segment Lifecycle NEXT Kafka Partitioning, Keys, and Ordering Guarantees

← Back to Event-Driven and Streaming Systems

← Back to Learning Hub