Service Discovery and Dynamic Topology

LESSON

010 30 min intermediate

Service Discovery and Dynamic Topology

The core idea: Service discovery is a runtime topology trade-off: callers depend on stable service identity while membership, health, and concrete endpoints keep changing underneath them.

Core Insight

Suppose the learning platform runs several enrollment-service instances. Billing needs to call enrollment during a purchase flow, but enrollment instances are being replaced during deploys, one replica is unhealthy, and autoscaling adds two more. Billing should depend on the logical service, not on a fragile list of IP addresses that changes underneath it.

That is the point of service discovery. It is not merely "find an address." It is a live view of which instances currently belong to a logical service, which of those instances are healthy enough to receive traffic, and how callers or infrastructure should route to them.

The misconception to correct is that topology is a static configuration problem. In an orchestrated system, topology is alive. Deployments, restarts, node failures, scale changes, and health transitions constantly change the concrete endpoint set. Hard-coded locations turn that normal churn into production breakage.

Discovery is therefore the runtime companion to communication patterns. The previous lesson described what one service wants from another. This lesson explains how a caller finds a healthy instance of the logical service while the physical system keeps moving.

Logical Identity, Moving Endpoints

The first job of discovery is to separate service identity from instance coordinates. Billing should call enrollment-service, not 10.0.14.8:3000, because the latter is only one temporary incarnation of the former.

billing
   |
   v
"enrollment-service"
   |
   +--> 10.0.14.8
   +--> 10.0.14.9
   +--> 10.0.15.2

That indirection is what lets the platform replace instances without forcing every caller to change. A rollout can remove one endpoint and add another. Autoscaling can grow or shrink the set. A failed node can make one instance disappear. Callers still express the stable dependency: "I need enrollment."

This is why discovery is not a convenience detail. It is the mechanism that lets dynamic infrastructure remain usable by application code. Without it, each caller becomes its own inventory of peers and every topology change becomes a coordination problem.

The trade-off is flexibility versus another runtime dependency. Discovery reduces topology coupling, but the system now depends on the freshness and reliability of the resolution path.

Membership and Health Are One Problem

A registry full of addresses is not enough. The system also needs to know which instances should receive traffic right now. An endpoint that still exists but points to a pod halfway through shutdown, a process with broken database connectivity, or a replica failing readiness checks is often worse than no endpoint at all.

Think of discovery as membership plus health:

service registry
   -> members: [a, b, c]
   -> health:  [ok, ok, bad]
   -> routing set: [a, b]

That model turns several operational details into one mechanism. Registration says an instance joined. Deregistration or expiration says it left. Health checks and readiness probes say whether it deserves traffic. Caches and TTLs decide how long callers or proxies may rely on a view before refreshing it.

The hard part is freshness. If the system removes unhealthy endpoints too slowly, traffic goes to dead or draining instances. If it reacts too aggressively, transient blips can create flapping and churn. The trade-off is freshness versus stability and overhead.

Worked Topology Change

Trace a normal rollout of enrollment-service. At the start, billing resolves the logical service name and sees three healthy endpoints:

enrollment-service
  a: 10.0.14.8  ready
  b: 10.0.14.9  ready
  c: 10.0.15.2  ready

During the rollout, instance a starts draining, instance d starts on another node, and instance d is running but not ready yet. A healthy discovery view should not simply list every process that exists. It should reflect which endpoints are eligible for traffic at this moment.

transition:
  a: draining       -> remove from routing set
  b: ready          -> keep
  c: ready          -> keep
  d: running only   -> do not route yet

routing set: [b, c]

After d passes readiness, the routing set changes again:

stable after rollout:
  b: ready
  c: ready
  d: ready

routing set: [b, c, d]

This is the mechanism in miniature. Registration says an instance exists. Readiness says whether it should receive traffic. Draining says an instance may still be alive but should stop receiving new work. Expiration protects the system when an instance disappears without clean deregistration. Load balancing then chooses among the eligible set.

The caller's experience depends on where discovery knowledge lives. With client-side discovery, billing or its client library must refresh the endpoint set and stop selecting a. With server-side discovery, billing calls a stable virtual address and the proxy or platform routing layer updates the backend set. In both models, the important invariant is the same: stable service identity above, changing endpoint membership below.

Where Topology Knowledge Lives

Once discovery exists, the architecture still has to decide who uses it directly. In client-side discovery, the caller or a client library asks the registry for healthy instances and chooses one. In server-side discovery, the caller talks to a stable proxy, load balancer, or platform abstraction, and infrastructure resolves the healthy instance.

client-side:
caller -> registry -> choose instance -> call instance

server-side:
caller -> proxy/load balancer -> resolved healthy instance

The important difference is where topology awareness lives. Client-side discovery gives application code or libraries more direct control over selection, retry, and load balancing behavior. It also means every caller must carry some discovery logic consistently.

Server-side discovery centralizes that responsibility in infrastructure. Application code can stay simpler, but the proxy or load-balancing layer becomes more important to latency, availability, and debugging.

Neither model is universally better. The useful design question is operational ownership: where do you want resolution, load balancing, retries, and topology logic to live, and who will maintain that behavior safely?

Operational Failure Modes

Issue: Treating discovery as one-time registration.

Clarification / Fix: Treat discovery as a lifecycle problem. Registration, health updates, expiration, and deregistration all matter when topology changes continuously.

Issue: Embedding fixed instance locations in client configuration.

Clarification / Fix: Once services are replicated, replaced, or autoscaled, move callers to logical names and runtime resolution before fixed assumptions become expensive to unwind.

Issue: Letting stale endpoints remain routable too long.

Clarification / Fix: Tune health, readiness, TTLs, and expiry so the routing view stays current enough without causing unnecessary flapping.

Issue: Confusing discovery with load balancing.

Clarification / Fix: Discovery answers which instances exist and are healthy enough to consider. Load balancing decides which eligible instance receives a particular request. Some systems combine them, but the responsibilities are different.

Issue: Forgetting drain behavior during deploys.

Clarification / Fix: An instance can be alive while it should stop receiving new traffic. Discovery and routing should respect readiness, termination, and graceful shutdown signals so in-flight work can finish cleanly.

Close the lesson and reconstruct one topology change from memory. Name the logical service, the old endpoints, the new endpoint, which endpoints are ready, which are draining, who refreshes the view, and who chooses the final instance. If those roles are unclear, the system will be hard to debug during deploys and failures.

Connections

The previous lesson described coordination semantics across service boundaries: sync, async, query, command, and event. Discovery sits underneath synchronous service calls especially clearly, because a caller must locate a healthy destination before it can ask for an immediate answer.

The next lesson introduces API gateways. A gateway can hide internal topology from external clients, but internal services still need discovery or platform routing to handle dynamic membership behind that edge.

This lesson also connects back to Kubernetes and scheduling. Orchestration deliberately creates topology churn through self-healing, scaling, and rolling replacement. Discovery is what keeps that churn from leaking into every caller.

Resources

[DOC] Consul Service Discovery
- Focus: Review a concrete registry model that combines service identity with health-aware membership.
[DOC] Kubernetes Services
- Focus: See one common infrastructure-managed discovery model for dynamic workloads.
[DOC] Kubernetes EndpointSlices
- Focus: Study how Kubernetes represents the changing endpoint set behind a service.
[BOOK] Building Microservices, 2nd Edition
- Focus: Connect discovery choices to communication style, routing, and operational ownership.

Key Takeaways

Service discovery lets callers depend on logical service identity while concrete endpoints change underneath them.
Discovery is membership plus health; stale or unhealthy instances must stop looking routable soon enough.
Client-side and server-side discovery differ mainly in where topology knowledge, selection, and routing responsibility live.
Rollouts and drains make discovery a lifecycle mechanism, not a one-time lookup.

← Back to Cloud Platform and Microservices

← Back to Architecture And Platforms

← Back to Learning Hub