Health Checks and Circuit Breakers

Day 078: Health Checks and Circuit Breakers

These two patterns both deal with failure, but they act at different boundaries: health checks control who should receive traffic, and circuit breakers control whether a dependency call should keep happening at all.


Today's "Aha!" Moment

When systems start failing in production, teams often reach for the same vague language: "the service is unhealthy" or "we need resilience." That usually hides two different questions. First: should this instance still receive new traffic? Second: should this caller keep trying the same dependency call when recent attempts are already failing?

Keep one example throughout the lesson. The learning platform runs several API instances behind a load balancer. Those instances depend on Redis for session and rate-limit state, and they also call an external payment provider during checkout. If one API instance loses access to Redis, it may still be alive as a process but unsafe to route traffic to. If the payment provider starts timing out across the fleet, every API instance may still be healthy locally, but continuing to hammer the provider can exhaust threads, queues, and patience everywhere.

That is the aha. Health checks and circuit breakers are not interchangeable resilience gadgets. Health checks help the traffic layer make better routing decisions about a specific instance. Circuit breakers help the caller decide when repeated dependency calls have become wasteful or dangerous. One is about admitting work into an instance. The other is about refusing to extend failure into a dependency path.

Once you see that separation, the design becomes much clearer. A good readiness check does not try to solve downstream retry storms. A circuit breaker does not tell the load balancer whether a pod should be in rotation. They solve adjacent but different problems, and the system becomes more understandable when each pattern stays in its proper role.


Why This Matters

The problem: Partial failure is normal, but systems without clear control points tend to amplify it. Unready instances keep receiving traffic, callers keep waiting on already failing dependencies, and one localized problem spreads across the fleet.

Before:

After:

Real-world impact: Safer rollouts, fewer cascading outages, lower tail latency during incidents, and much clearer operational behavior under partial failure.


Learning Objectives

By the end of this session, you will be able to:

  1. Distinguish health checks from circuit breakers - Explain what each pattern controls and why they are not substitutes.
  2. Reason about readiness versus dependency failure - Separate instance-level traffic fitness from caller-side protection against bad downstream calls.
  3. Explain graceful failure behavior - Connect probes, breakers, timeouts, and fallbacks into one coherent resilience story.

Core Concepts Explained

Concept 1: Health Checks Tell the Platform Whether an Instance Should Be Trusted with New Traffic

The first thing to clarify is that "healthy" is not a single yes/no concept. An instance may be alive enough that the orchestrator should not restart it, yet not ready enough to receive user traffic.

That is why modern systems often separate at least these ideas:

In the learning platform example, an API pod may still answer a simple /healthz endpoint while its Redis dependency is disconnected or its cache warmup is incomplete. Sending live checkout traffic to that pod anyway only turns a local problem into visible user failure.

load balancer / orchestrator
        |
        +--> pod A: alive + ready      -> keep in rotation
        +--> pod B: alive + not ready  -> remove from rotation
        +--> pod C: dead               -> restart or replace

This is why readiness is the operationally important probe for traffic control. It answers, "Should this instance get more work right now?" That is a very different question from "Is the process technically still running?"

The trade-off is accuracy versus simplicity. Richer readiness checks make routing safer, but they also need careful design so transient blips or over-eager dependencies do not cause needless flapping.

Concept 2: Circuit Breakers Protect the Caller When Repeated Dependency Calls Stop Making Sense

Now shift from the traffic edge to the call path inside the service. Suppose checkout requests begin timing out against the payment provider. Without a circuit breaker, every API instance keeps trying, each request waits, retries pile up, and the fleet spends more and more of its time doing work that has little chance of succeeding.

A circuit breaker is a caller-side policy that says, roughly:

recent calls healthy      -> keep allowing calls
recent calls failing hard -> open circuit, stop calling for a while
probe after cool-down     -> half-open, test recovery carefully
recovery observed         -> close circuit and resume normal flow

The key point is not the metaphor of the "circuit." The key point is conserving caller resources and containing blast radius. Once repeated calls are mostly producing latency and failure, a fast local refusal is often better than another doomed remote wait.

This is also why circuit breakers are not just for microservices. Any dependency with meaningful latency or failure behavior can justify them: a payment API, a search cluster, a database proxy, a cache tier, or even a remote file store.

The trade-off is false positives versus protection. Opening too aggressively may reject calls that could have succeeded. Opening too slowly means the caller keeps burning resources on a dependency that is already in trouble. Good breaker policy therefore depends on reasonable windows, thresholds, and recovery probes.

Concept 3: Detection Alone Is Not Enough; the System Needs a Safe Failure Mode

A readiness probe can remove bad instances from traffic. A circuit breaker can stop unproductive calls. Neither one is useful by itself unless the system also knows what behavior should replace the failing path.

For the learning platform, the right behavior depends on the workflow:

That is where resilience becomes an actual product decision. The system must define which dependencies are optional, which are mandatory, and what "degraded but acceptable" looks like for each path.

dependency unhealthy
      |
      +--> optional feature -> degrade gracefully
      +--> mandatory feature -> fail fast, clearly, and cheaply

Health checks and breakers are therefore pieces of a larger control loop that also includes timeouts, retry policy, and fallback design. A breaker without a clear fallback still protects resources, but it may leave the user experience vague. A readiness check without explicit degraded behavior may route traffic correctly while the service still responds in confusing ways.

The trade-off is implementation complexity versus predictable behavior under failure. Clear degraded modes require more thought upfront, but they turn production incidents from improvised chaos into bounded, understandable service behavior.

Troubleshooting

Issue: Treating a liveness endpoint as enough for the load balancer.

Why it happens / is confusing: A simple 200 OK endpoint is easy to add, and it feels like "health" has been covered.

Clarification / Fix: Use readiness for routing decisions. Liveness only tells you whether the process should keep existing; readiness tells you whether it should receive more user traffic.

Issue: Expecting a circuit breaker to solve every dependency problem by itself.

Why it happens / is confusing: The breaker has a clear state machine, so it can feel like the whole resilience design.

Clarification / Fix: Pair breakers with sensible timeouts, retry policy, and explicit fallback or fail-fast behavior. Otherwise you only change the shape of failure, not its usability.

Issue: Putting downstream dependency checks blindly into readiness probes.

Why it happens / is confusing: Teams want readiness to reflect real serving ability, which is reasonable.

Clarification / Fix: Include only dependencies that truly determine whether the instance should receive traffic. If every optional dependency can evict the instance from rotation, the service may flap more than it helps.


Advanced Connections

Connection 1: Health Checks ↔ Load Balancing

The parallel: A load balancer is only as good as the signals it uses to decide which backends deserve more traffic.

Real-world case: An API fleet with bad readiness checks can keep routing to pods that are technically alive but practically broken, destroying the benefit of the balancer.

Connection 2: Circuit Breakers ↔ Backpressure and Retry Control

The parallel: Once a dependency is failing, breaker policy, timeout policy, and retry policy all shape how much extra pressure the caller adds to the incident.

Real-world case: A payment outage is much easier to survive when the caller fails fast and reduces retry amplification instead of turning the outage into a fleet-wide resource drain.


Resources

Optional Deepening Resources


Key Insights

  1. Health checks and circuit breakers act at different boundaries - Probes influence routing to an instance, while breakers influence whether a caller should keep attempting a dependency call.
  2. Readiness is the traffic-facing notion of health - A process can be alive and still be unsafe to expose to new requests.
  3. Failure handling needs a behavior policy, not just detection - Fast failure, graceful degradation, and resource protection matter once trouble has been detected.

Knowledge Check (Test Questions)

  1. What is the main difference between readiness checks and circuit breakers?

    • A) Readiness controls whether an instance should receive traffic, while a circuit breaker controls whether a caller should keep attempting a failing dependency call.
    • B) They are two names for the same resilience pattern.
    • C) Circuit breakers only matter for databases.
  2. Why can a liveness-only health check be misleading?

    • A) Because the process may still be running even when the instance is not actually safe to serve real traffic.
    • B) Because liveness checks are only useful for frontend code.
    • C) Because readiness always replaces the need for liveness completely.
  3. Why is a fast local failure sometimes better than another remote call attempt?

    • A) Because once a dependency is clearly failing, more calls may only consume caller resources and amplify the outage.
    • B) Because remote dependencies should never be retried under any circumstances.
    • C) Because circuit breakers guarantee immediate downstream recovery.

Answers

1. A: Readiness is a routing decision about the instance. A circuit breaker is a call-policy decision about a dependency path.

2. A: Liveness only tells you that the process exists. It does not prove the instance is ready for user-facing traffic.

3. A: Fast failure can protect threads, queues, and latency budgets when another remote attempt is likely to be expensive and unproductive.



← Back to Learning