Exactly-Once Delivery Myths

LESSON

Consistency and Replication

061 30 min advanced

Day 492: Exactly-Once Delivery Myths

The core idea: "Exactly once" is not a free property of a network hop. A production system gets one logical effect only by naming a boundary and then making retries, replays, and crash recovery converge on the same durable outcome inside that boundary.

Today's "Aha!" Moment

In 060.md, Harbor Point made POST /bookings retry-safe with an idempotency key. That solved the edge problem: when guest #8841 loses the response after reserving cabin S12, a retry can map back to the original booking instead of charging the card twice. But the booking is only the start of the story. After that write commits, Harbor Point still has to publish booking-confirmed, project the passenger manifest, notify the embarkation team, and send the guest a confirmation email.

The myth appears when a team compresses all of that into one sentence: "our pipeline is exactly once." Suppose manifest-projector reads booking-confirmed, writes the guest into the sailing manifest table, and crashes before it commits its Kafka offset. On restart, the broker redelivers the same event. Nothing about that sequence is exotic. It is what a healthy at-least-once system does when it refuses to lose data after an ambiguous crash.

The non-obvious point is that delivery, processing, and observable business effect are different questions. Harbor Point can deduplicate producer retries at the log, atomically checkpoint consumer offsets with internal state, and still send the same email twice if the email vendor call sits outside that atomic boundary. Once you separate those layers, "exactly once" stops being a slogan and becomes a design claim that must always finish with "where?" The trade-off is clear: tighter semantics require more state, more coordination, and narrower promises, but they also give the team a guarantee it can actually defend during incidents.

Why This Matters

Harbor Point's booking platform feeds operational systems that people act on immediately. The embarkation dashboard decides who can board. Finance decides which deposits to settle. Customer support decides whether a guest needs manual cleanup after a complaint. If duplicate events create two manifest rows, two onboard-credit adjustments, or two confirmation emails, the support team sees a "random" failure while the platform team argues about whether the broker, consumer, or downstream service is at fault.

The cost of vague language is real. If engineers believe a broker feature means end-to-end exactly-once delivery, they will skip dedupe tables, sink idempotency keys, and replay drills because the hard part looks solved already. Then the first redelivery after a crash creates visible damage and the postmortem uncovers an unspoken assumption: the system had exactly-once semantics only for records written back into the log, not for the external side effect that mattered to the business.

That is why production teams have to state the guarantee in operational terms. Are duplicates acceptable if the sink collapses them? Is message loss worse than duplicate work? Can the business effect participate in the same transaction as the offset commit? Those questions decide whether Harbor Point should build at-most-once, at-least-once, or effectively-once behavior at each boundary. "Exactly once" without that boundary is usually marketing, not engineering.

Core Walkthrough

Part 1: Grounded Situation

Keep one Harbor Point flow in view. The booking-api commits the cabin reservation and an outbox row in the same database transaction:

booking_id=8841
event_type=booking-confirmed
message_id=evt-8841-confirmed

An outbox-relay publishes that record to the booking-events topic. The manifest-projector consumes it and writes:

INSERT INTO passenger_manifest (sailing_id, booking_id, guest_id, cabin_id)
VALUES ('2026-07-14', 8841, 8841, 'S12');

Now the failure: the database commit succeeds, but the consumer crashes before it records "I have processed offset 912441." When the process comes back, Kafka quite reasonably hands it offset 912441 again. If Harbor Point acknowledges before the database write, it can lose the manifest update. If it acknowledges after the write, it can see the same event again. The system must choose which risk to carry and what mechanism will absorb it.

The producer side has the same ambiguity. If the relay sends evt-8841-confirmed, the broker appends it, and the acknowledgment packet is lost, the relay cannot tell whether the record is missing or merely unacknowledged. A resend is rational. Without producer deduplication, the topic may contain two copies of the same logical event before the consumer even starts.

Exactly-once delivery sounds like it should remove all of this. In practice, it cannot remove uncertainty from a crash-prone network. What Harbor Point can do is move the uncertainty into a durable mechanism that makes repeated delivery converge on one committed result.

Part 2: Mechanism

The first step is to name the boundary precisely. Harbor Point has at least four different claims it could make:

  1. The broker appends each producer record once per partition.
  2. The consumer updates its internal database once per message.
  3. The consumer emits each follow-up event once.
  4. The guest receives one confirmation email.

Those claims are related, but they are not the same. A modern log like Kafka can help with the first claim by assigning a producer identity and sequence numbers so duplicate retries can be discarded. A stream processor can help with the second and third claims by committing source offsets, internal state, and produced records as one recovery unit. None of that automatically solves the fourth claim if the email API is an external side effect outside the transaction.

Harbor Point therefore needs two layers of protection:

Layer 1: Transport and log safety
producer_id + sequence -> collapse duplicate appends

Layer 2: Business-effect safety
message_id + durable dedupe record -> collapse duplicate reprocessing

For the manifest-projector, the safe pattern is not "trust the broker." It is "treat redelivery as normal and make it harmless." In pseudocode:

def handle_booking_confirmed(event):
    with db.transaction():
        existing = db.fetch_one(
            "SELECT booking_id FROM processed_messages WHERE message_id = %s FOR UPDATE",
            [event.message_id],
        )
        if existing:
            return "already-applied"

        db.execute(
            """
            INSERT INTO passenger_manifest (sailing_id, booking_id, guest_id, cabin_id)
            VALUES (%s, %s, %s, %s)
            ON CONFLICT (booking_id) DO NOTHING
            """,
            [event.sailing_id, event.booking_id, event.guest_id, event.cabin_id],
        )
        db.execute(
            "INSERT INTO processed_messages (message_id, booking_id) VALUES (%s, %s)",
            [event.message_id, event.booking_id],
        )

    commit_offset(event.offset)

This code does not prevent the handler from starting twice. It makes the second run converge on the same durable state as the first. That is the important distinction. Many production systems call this "exactly once" because the observable effect in the projector database becomes single-application even when the transport is at-least-once.

The boundary gets tighter when Harbor Point can atomically couple more pieces. If the projector reads from Kafka, updates local state, and emits a downstream manifest-updated event back into Kafka using transactions or checkpointed state, then the source offset, state mutation, and produced record can recover together. But the moment the flow crosses into an external email provider or payment gateway, Harbor Point is back in idempotency territory. The sink must accept a stable operation key such as evt-8841-confirmed, or Harbor Point must keep a local send ledger and tolerate retries against it. The infrastructure can narrow the duplicate window; it cannot wish the boundary away.

Part 3: Implications and Trade-offs

Once the team speaks precisely, architecture decisions get much easier. Harbor Point can promise "exactly-once projection into the manifest database" if message_id dedupe and offset discipline make replays converge there. It can promise "exactly-once append into Kafka output topics" if producer transactions cover the consume-transform-produce loop. It should not promise "exactly-once delivery to every downstream system" unless those systems share the same atomic commit boundary, which they usually do not.

The trade-offs are concrete. Durable dedupe tables consume storage and require retention policies. Transactional consume-transform-produce pipelines increase latency and reduce peak throughput because more work has to commit together. Long-lived producer identities, checkpoint metadata, and poison-message handling add operational complexity. Engineers also lose some freedom to do ad hoc side effects in the handler, because every side effect now has to fit the replay model or accept duplicates safely.

That cost is usually worth paying because the alternative is hidden semantic debt. A system that pretends duplicates cannot happen will produce brittle cleanup scripts, confusing postmortems, and vendor integrations that fail only under retries. A system that assumes duplicates will happen can turn them into a normal recovery path. That is the practical meaning behind the myth: exactly-once delivery is rarely an end-to-end network fact, but exactly-once observable effect can be engineered inside a well-defined boundary.

This sets up the next lesson naturally. Once Harbor Point has a bounded semantic story for duplicates and replays, the next production pressure is not "what if the event is repeated?" but "what if the consumer cannot keep up?" That is where 062.md picks up with backpressure and flow control.

Failure Modes and Misconceptions

Connections

Connection 1: 060.md handles ambiguity at the API edge

The previous lesson gave Harbor Point one durable identity for a guest action. This lesson extends the same idea downstream: once that action becomes a stream event, consumers still need a durable identity and replay-safe commit point.

Connection 2: 055.md explains why global atomicity is expensive

An end-to-end exactly-once claim would require one commit boundary across broker, database, and external sinks. The lesson on distributed transactions shows why that boundary is rare, slow, and operationally fragile in real systems.

Connection 3: 062.md turns from semantics to load regulation

After Harbor Point makes redelivery safe, consumer lag becomes the next bottleneck. Backpressure decides whether the system degrades predictably when manifest updates arrive faster than they can be applied.

Resources

Key Takeaways

  1. "Exactly once" is only meaningful when Harbor Point names the boundary where repeated delivery must collapse to one durable result.
  2. Brokers and stream processors can make some replay paths atomic, but external side effects still need their own idempotency or dedupe mechanism.
  3. A crash-safe consumer does not assume duplicate delivery away; it records enough state that redelivery becomes a normal recovery case.
  4. After duplicate effects are bounded, the next systems question is how to regulate load and lag when consumers fall behind.
PREVIOUS Idempotency and Retry-Safe APIs NEXT Backpressure and Flow Control

← Back to Consistency and Replication

← Back to Learning Hub