Service Discovery, Naming, and Routing Control

LESSON

Networking and Failure Models

006 30 min intermediate

Service Discovery, Naming, and Routing Control

Core Insight

Suppose the learning platform splits progress, catalog, certificates, and recommendations into separate services. The frontend no longer talks to one backend, and the services no longer live at one fixed address. Replicas appear, disappear, move zones, roll versions, drain during deploys, and fail readiness checks.

The tempting story is that service discovery is just a lookup table: ask for progress-service, receive an address, connect. The real problem is subtler. A name is a promise about identity and intent, while an address is only a current place to send traffic. In a dynamic system, those two facts drift unless discovery, routing, and health information stay coordinated.

Discovery decides what endpoints exist. Routing decides which one gets this request. Identity decides whether the endpoint is allowed to answer as that service. Health decides whether the endpoint should still be considered a responsible destination. If any of those meanings are blurred, failures become confusing: stale DNS records, traffic to old versions, requests sent across the wrong region, or clients trusting an endpoint that should no longer represent the service.

The useful design shift is to treat naming as a control surface. The trade-off is freshness versus stability: clients need endpoint updates quickly enough to avoid broken paths, but not so eagerly that every small change becomes a fleet-wide storm.

Names Are Stable, Endpoints Are Not

A service name should express intent: "the progress API that owns lesson completion state." An endpoint expresses a current routing option: "this IP and port currently accepts progress traffic." The name should outlive any particular replica.

service name: progress-service
current endpoints:
  10.0.1.7:8080
  10.0.2.4:8080
  10.0.3.9:8080

When a replica fails readiness, the endpoint list should change. When a deployment adds a new version, the endpoint set may include both old and new replicas while traffic shifts. When a zone partitions, different clients may temporarily see different reachable endpoints. The service name stays the same while the safe destinations behind it change.

The discovery mechanism can be DNS, a service registry, an orchestrator API, a sidecar proxy, or a platform-specific control plane. The mechanism matters, but the core question is the same: how does the client learn where safe destinations are, and how stale can that knowledge become?

The trade-off is caching versus freshness. Caching discovery data reduces load and smooths transient control-plane problems. Stale discovery data can keep sending traffic to endpoints that are no longer healthy, no longer local, no longer on the intended version, or no longer authorized for a request class.

Discovery And Routing Are Different Decisions

Discovery answers "what could receive this service's traffic?" Routing answers "where should this specific request go now?" Mixing them makes systems harder to reason about.

discovery: progress-service has endpoints A, B, C
routing: send this write to B because it is healthy, local, and in the active rollout group

For the learning platform, a catalog read can often go to any healthy local replica. A progress write may need a region with access to authoritative storage. A certificate issuance request may need stricter routing because duplicate issuance is expensive. All three may discover service endpoints the same way while routing requests differently.

This separation also helps during deploys. Discovery may show both version v1 and v2 endpoints. Routing policy decides whether canary traffic gets v2, whether only internal users see it, or whether the system rolls back because error rates crossed a threshold.

The trade-off is central control versus client simplicity. Putting routing in a gateway or proxy can standardize policy. Putting too much policy there can hide application meaning and make every route depend on a complex control plane.

Stale Discovery Is A Failure Mode

Discovery systems are themselves distributed systems. They have caches, watchers, registries, DNS TTLs, control-plane outages, delayed health updates, and clients with old endpoint lists. That means stale discovery is not an edge case. It is one of the normal ways networked systems fail.

Imagine the progress service starts draining a replica before a deploy. The orchestrator removes the endpoint from the service list, but one client cached the old address for thirty seconds. That client may still send writes to a replica that is trying to leave rotation.

t0: replica P3 starts draining
t1: registry removes P3 from ready endpoints
t2: client with stale cache sends write to P3
t3: P3 must reject, forward, or safely finish the request

The service still needs defensive behavior. Discovery should reduce bad routing, not become the only protection. A draining replica can reject new work, a proxy can refresh endpoint state, and a write operation can still require idempotency or authoritative routing before changing state.

The trade-off is efficiency versus safety. Longer caches reduce discovery load and make clients less dependent on the control plane. Shorter caches react faster but can create more lookup traffic and more sensitivity to control-plane instability.

Identity Prevents Accidental Trust

Finding an endpoint is not the same as trusting it. A client needs confidence that the endpoint answering as progress-service is actually allowed to represent that service. In modern service fleets, that usually means identity is tied to certificates, workload identity, service accounts, or another authenticated channel.

Without identity, discovery can become a dangerous source of trust. A stale registry entry, misconfigured DNS record, or compromised network path can send traffic to the wrong place. With identity, the client or proxy can reject an endpoint that does not prove it belongs to the intended service.

name says: progress-service
endpoint says: 10.0.2.4:8080
identity proves: this workload is allowed to serve progress-service traffic

The trade-off is security and clarity versus operational complexity. Strong identity helps contain misrouting and impersonation, but certificate rotation, policy configuration, and debugging failed handshakes become part of the networked system's everyday work.

Common Design Mistakes

One mistake is treating DNS or the registry as if it were always fresh truth. It is a cacheable control signal. TTLs, client caching, resolver behavior, and failure modes decide how quickly discovery reacts.

Another mistake is letting service names encode physical placement too early. A name like progress-service-zone-a-v2 may be useful for rollout plumbing, but it can leak temporary routing concerns into application code if used carelessly.

A third mistake is ignoring identity until after discovery works. The system may route successfully and still be unsafe if clients cannot verify which service actually answered.

Connections

The previous lesson treated health and load balancing as traffic decisions under uncertainty. Discovery supplies the candidate destinations those policies act on. If discovery is stale or naming is sloppy, even a good load-balancing policy starts from bad inputs.

The next lesson focuses on observability because discovery and routing bugs are hard to diagnose after the fact. A trace or log should be able to show which name was resolved, which endpoint was selected, which version answered, and whether identity verification succeeded.

Resources

Key Takeaways

PREVIOUS Health Checks, Load Balancing, and Traffic Steering NEXT Observability Across Network Boundaries