Health Checks, Load Balancing, and Traffic Steering

LESSON

005 30 min intermediate

Health Checks, Load Balancing, and Traffic Steering

Core Insight

Imagine the learning platform has three progress-service replicas behind a load balancer. One replica is alive as a process, but its database connection pool is exhausted. Another replica is healthy but draining because a deploy is rolling through. A third can serve read-only progress summaries, but should not accept completion writes because it cannot reach the authoritative storage path.

From outside, all three machines may look "up." From the user's point of view, only some of them are safe destinations for some kinds of traffic. That gap is why health checks and load balancing are not just traffic distribution features. They are how the system turns partial, local health signals into routing decisions while failures are still ambiguous.

A load balancer cannot magically know whether a service is semantically correct. It can only act on signals the system exposes: readiness checks, liveness checks, error rates, latency, connection state, deployment metadata, and operator policy. If those signals are too shallow, traffic steering will confidently send requests to the wrong place.

The useful design shift is to treat health as a contract between the service and the routing layer. The trade-off is speed of reaction versus correctness of diagnosis: aggressive checks can remove capacity too quickly, while weak checks can keep broken replicas in rotation until users discover the problem for you.

Health Is More Than Process Liveness

A service can be running and still be unsafe for user traffic. A process may respond to a simple HTTP ping while its queue is full, its dependency credentials are expired, or its local data is too stale for the operation being routed to it.

That is why production systems usually separate liveness from readiness:

liveness: should the platform restart this process?
readiness: should the load balancer send user traffic here?

For the progress service, liveness might only prove that the process can answer a local endpoint. Readiness should prove more: the service can accept the class of requests being routed to it, reach required dependencies, and stay inside latency or error bounds that make it a responsible traffic target.

def readiness(dependencies_ok, queue_depth, draining):
    if draining:
        return False
    if not dependencies_ok:
        return False
    return queue_depth < 1000

The exact thresholds depend on the workload. The important point is that readiness encodes a traffic promise. "Ready for any request" is a stronger claim than "ready for cached reads." If the service cannot express that distinction, the router has to choose between overconfidence and overreaction.

A shallow readiness check makes the router overconfident. An overly strict readiness check can remove too many replicas during a noisy incident and make overload worse. A good check is specific enough to protect users but conservative enough that one flaky dependency probe does not empty the pool.

Load Balancing Is A Policy Choice

Load balancing sounds like spreading requests evenly, but the real policy is more specific: choose a destination using imperfect information. Round-robin, least-connections, weighted routing, locality-aware routing, and request-class routing each optimize for a different pressure.

round-robin: simple distribution
least-connections: avoid already-busy replicas
weighted routing: shift traffic during rollout or migration
locality-aware routing: prefer nearby healthy capacity
request-class routing: separate reads, writes, or expensive operations

Suppose the catalog service can serve cached lesson metadata from any zone, but progress writes need stricter routing to replicas near the authoritative storage path. Sending both request types through the same policy may be easy, but it hides different failure consequences. A stale catalog read is often tolerable. A duplicated or lost progress write changes the user's state.

Load balancing also interacts with retries. If a gateway retries a timed-out request, it may choose another replica. That can be useful when one instance is slow. It can be dangerous for side-effecting writes unless the operation is idempotent or deduplicated. The router sees endpoints; the application still owns operation meaning.

The trade-off is simplicity versus precision. One generic policy is easy to operate, but it may treat unlike requests as if they had the same risk. More precise routing can improve behavior under failure, but it adds configuration, testing, and debugging surface area. The extra precision is only worth it when the request classes really do have different safety or latency needs.

Draining Protects Change

Healthy services still need to leave rotation during deploys, scale-downs, and maintenance. Connection draining is the controlled version of disappearance: stop sending new work to a replica, let in-flight work finish if possible, then remove the instance.

mark draining
  -> stop new requests
  -> finish in-flight requests
  -> close long-lived connections or let deadlines expire
  -> terminate safely

Without draining, a rolling deploy can create avoidable failures. A replica disappears while users are mid-request. A long upload loses its connection. A retry starts against a new version with slightly different behavior. Draining does not remove all risk, but it gives the system a smaller and more predictable transition.

The hard part is that not all work drains cleanly. Streaming responses, WebSockets, long-running jobs, and transactions may outlive the deploy budget. The system needs deadlines and fallback behavior, not infinite patience. Draining is a change-management policy, not a promise that every old connection can live forever.

Common Design Mistakes

One mistake is using a single shallow health endpoint for every decision. "Process is up" is not enough for readiness, routing, or deploy safety. Health checks should match the traffic the replica is being asked to serve.

Another mistake is letting health checks synchronize failure. If every replica probes the same dependency at the same time, health checking can add load during an incident. Checks need sensible intervals, jitter, and failure thresholds.

A third mistake is treating load balancers as if they understand product semantics. They can enforce routing policy, but they cannot decide whether a duplicate write is safe unless the application exposes that meaning through methods, headers, idempotency keys, or explicit route classes.

This is also where the previous lesson on partitions matters. A replica can be healthy from one side of the network and unsafe from another. Health checks should therefore be read as observations from particular paths, not as universal truth about the whole system.

Resources

[DOC] Kubernetes Probes
- Link: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
- Focus: Compare liveness, readiness, and startup probes as separate operational signals.
[DOC] Envoy Load Balancing
- Link: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/overview
- Focus: Study load-balancing policy as a configurable decision under imperfect endpoint knowledge.
[BOOK] Site Reliability Engineering
- Link: https://sre.google/sre-book/table-of-contents/
- Focus: Revisit overload, cascading failure, and service interaction as operational design problems.

Key Takeaways

Liveness asks whether a process should be restarted; readiness asks whether it should receive traffic.
Load balancing is a routing policy built from incomplete health, latency, locality, request-class, and rollout signals.
Connection draining makes deploys and scale-downs less abrupt, but it still needs deadlines.
Traffic steering improves resilience only when the service exposes health signals that match real request semantics.

← Back to Networking and Failure Models

← Back to Distributed Systems

← Back to Learning Hub