Day 077: Load Balancing Fundamentals

Load balancing is not mainly about splitting traffic evenly. It is about keeping one logical service responsive and available while the real fleet underneath is uneven, busy, or partially broken.

Today's "Aha!" Moment

Teams often talk about load balancers as if they were simple traffic splitters: take incoming requests and spread them across several instances. That description is not false, but it is too weak to be useful. The hard part is not dividing requests. The hard part is deciding where a request should go when requests cost different amounts, instance health changes over time, and one slow backend can poison the latency seen by users.

Keep one example throughout the lesson. The learning platform serves its API through several identical-looking application instances. Some requests are cheap reads for course pages. Others trigger expensive personalized queries or long-lived streaming responses. One instance may be warming its cache after a deploy. Another may be healthy but already busy. From the client's point of view, though, there should still be one service: api.learn.example.

That is the aha. A load balancer is a control point that turns a messy fleet into one logical service endpoint. It hides topology from clients, routes around bad instances, and applies a policy for how traffic should interact with finite backend capacity. "Equal distribution" is only one possible policy, and often not the most useful one.

Once you think about load balancing this way, several design questions become sharper. Is the workload uniform or highly variable? Can any instance serve any request, or does state force stickiness? Do you want to minimize average utilization, tail latency, or blast radius during failures? Those are the real fundamentals. The load balancer matters because it is where those trade-offs become operational reality.

Why This Matters

The problem: Horizontal scale is mostly imaginary until the system has a safe way to present many backend instances as one stable, resilient service.

Before:

Clients are too aware of individual instances or network topology.
Adding servers does not reliably improve latency or availability.
One slow or broken instance can still capture traffic and degrade the whole service.

After:

Clients talk to one stable entry point instead of individual nodes.
Traffic can move toward healthy capacity and away from bad or overloaded backends.
Extra instances become real usable capacity rather than just more machines in the fleet.

Real-world impact: Better availability, smoother deploys, more useful autoscaling, and much safer behavior when instances differ in health or active load.

Learning Objectives

By the end of this session, you will be able to:

Explain what a load balancer actually controls - Connect routing, failure containment, and latency behavior.
Compare balancing policies usefully - Distinguish simple even splitting from strategies that react to active load or capacity differences.
Reason about state behind a balancer - Explain why stateless backends and shared state make routing safer and more flexible.

Core Concepts Explained

Concept 1: A Load Balancer Turns Many Backends into One Logical Service

From the client's perspective, a replicated service should not feel like a list of machines. The client should call one name and let the system decide which backend instance is currently the best target.

clients
   |
   v
load balancer
   |
   +--> api-1
   +--> api-2
   +--> api-3
   +--> api-4

That indirection matters for three reasons. First, it hides topology, so instances can be added, removed, or replaced without changing clients. Second, it contains failures by letting the traffic layer stop sending requests to obviously bad instances. Third, it creates one place where routing policy can evolve as the workload evolves.

This is why a load balancer is more than a convenience alias. It is the boundary that says, "There is one service here, even though the fleet behind it is fluid." Without that boundary, clients become coupled to infrastructure details and horizontal scaling becomes much harder to operate.

The trade-off is flexibility versus another critical control point. The balancer simplifies the client side and improves routing control, but it also becomes infrastructure that must itself be reliable and observable.

Concept 2: Routing Policy Is Really a Scheduling Policy Under Variability

The most misleading mental model is that all requests cost roughly the same. If that were true, round-robin would solve most problems. Real systems are messier. One request may hit a warm cache and finish in milliseconds. Another may hold a connection open, trigger an expensive query, or stream data for several seconds.

That is why balancing policy should be read as a scheduling choice:

round robin: assumes requests are similar enough that simple spreading is fine
least connections or least requests: reacts to uneven active load
weighted routing: sends more traffic to instances with more known capacity
locality or zone-aware routing: prefers closer or cheaper backends when architecture requires it

same 4 requests:

round robin        -> each backend gets one request
least connections  -> next request avoids backend already stuck with a long one
weighted           -> larger instance gets more share over time

What the balancer is really trying to protect is not mathematical fairness. It is service quality: latency, throughput, and resilience under uneven conditions. This is why two services with the same number of instances may need different balancing policies. A uniform static-content fleet and a mixed-cost API fleet do not want the same scheduling rule.

The trade-off is simplicity versus responsiveness to real load. Simple algorithms are easier to reason about and often good enough. Smarter policies can reduce tail latency and hot spots, but they require better health signals and a clearer understanding of workload shape.

Concept 3: Stateless Backends Give the Balancer Freedom; Sticky State Takes It Away

Load balancing works best when any healthy instance can serve any request correctly. That usually means the important state lives in shared systems such as a database, cache, or object store rather than in one server's memory.

If session or workflow state is trapped inside one application instance, the balancer loses freedom. Suddenly requests must keep returning to the same backend, or the request path breaks. This is where sticky sessions often appear: they preserve correctness for a stateful backend, but they also reduce the balancer's ability to react cleanly to failures and uneven load.

shared state design:
request -> any healthy app instance -> shared session/data store

instance-local state design:
request -> must return to same app instance -> routing flexibility shrinks

Sticky routing is not always wrong. It can be pragmatic for legacy systems or for protocols with connection affinity. But it should be understood as a constraint and a trade-off, not as the default shape of a scalable service.

The trade-off is architectural convenience versus routing freedom. Instance-local state can be simpler in the short term, but shared or externalized state makes the backend fleet much more disposable, which is exactly what load balancing and horizontal scaling want.

Troubleshooting

Issue: Thinking load balancing means "equal traffic for everyone."

Why it happens / is confusing: Introductory examples usually show round-robin because it is easy to visualize.

Clarification / Fix: Ask what the system is optimizing for: low tail latency, graceful failure handling, or use of uneven capacity. Equal distribution is only one possible strategy.

Issue: Assuming extra instances automatically improve the service.

Why it happens / is confusing: More servers feels like more capacity by definition.

Clarification / Fix: Capacity only becomes useful if the routing layer can direct traffic toward it safely and stop sending traffic to unhealthy or overloaded nodes.

Issue: Using sticky sessions to hide avoidable state design problems.

Why it happens / is confusing: Sticky routing can make a stateful service appear to scale for a while.

Clarification / Fix: Prefer shared or externalized state when feasible. Use stickiness intentionally when protocol constraints or legacy architecture truly require it.

Advanced Connections

Connection 1: Load Balancing ↔ Health Checks and Circuit Breaking

The parallel: Routing policy only works if the traffic layer has a credible view of which backends are healthy enough to receive new work.

Real-world case: A service may have plenty of instances, yet still fail badly if the balancer keeps feeding traffic to nodes that are technically alive but practically unhealthy.

Connection 2: Load Balancing ↔ Horizontal Scaling

The parallel: Autoscaling adds or removes capacity, but the balancer is what turns that changing fleet into useful throughput.

Real-world case: An autoscaled API without sane routing policy can still produce hot spots, poor cache behavior, and bad latency under bursty traffic.

Resources

Optional Deepening Resources

These resources are optional and are not required for the core 30-minute path.
[DOC] NGINX HTTP Load Balancing
- Link: https://docs.nginx.com/nginx/admin-guide/load-balancer/http-load-balancer/
- Focus: Review common balancing policies and how they reflect different traffic assumptions.
[DOC] HAProxy Load Balancing Algorithms
- Link: https://www.haproxy.com/documentation/haproxy-configuration-tutorials/proxying-essentials/load-balancing/load-balancing-algorithms/
- Focus: Compare routing strategies in a production-grade traffic proxy.
[DOC] AWS Elastic Load Balancing User Guide
- Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/what-is-load-balancing.html
- Focus: Connect balancing concepts to real cloud deployment and health-aware routing.
[BOOK] Monitoring Distributed Systems
- Link: https://sre.google/sre-book/monitoring-distributed-systems/
- Focus: Reinforce why routing quality depends on good health and latency signals, not just static topology.

Key Insights

A load balancer creates one logical service out of a changing fleet - It hides topology from clients and makes instance churn operationally manageable.
Balancing policy is a scheduling policy - The right algorithm depends on request variability, active load, and capacity differences.
Stateless backends give routing more freedom - Shared state lets any healthy instance serve the request, which makes scaling and failure handling much cleaner.

Knowledge Check (Test Questions)

Why is a load balancer useful even before traffic is huge?
- A) Because it decouples clients from individual instances and lets the service route around failures or uneven backend conditions.
- B) Because it eliminates the need for backend health signals.
- C) Because it guarantees that latency will always be equal across instances.
When is a policy like least-connections often better than simple round-robin?
- A) When requests vary a lot in duration or some backends are already busy with long-running work.
- B) When every request has the same cost and every backend is identical in practice.
- C) When there is only one backend instance.
Why do stateless backends usually work better behind a load balancer?
- A) Because routing can stay flexible when any healthy instance can serve the request using shared state.
- B) Because stateless services never need databases or caches.
- C) Because stickiness becomes physically impossible.

Answers

1. A: Even a small replicated service benefits from a stable entry point and from being able to move traffic away from bad instances without changing clients.

2. A: Least-connections is often better when active load differs meaningfully across backends. It tries to avoid sending new work to instances that are already tied up.

3. A: Statelessness keeps the fleet interchangeable. That gives the balancer much more freedom to optimize routing and contain failures cleanly.

← Back to Learning