Service-to-Service Network Policy
LESSON
Service-to-Service Network Policy
The core idea: Once a platform has several services, reliability depends less on whether calls are possible and more on whether every call carries explicit policy for timeouts, retries, identity, observability, and failure containment.
Core Insight
Imagine the learning platform has just split catalog, enrollment, billing, identity, and notifications into separate services. The previous lesson made the service boundary visible; now that boundary has traffic crossing it. The first problem is simple enough: enrollment needs to call billing, catalog, and notifications. The deeper problem appears a few weeks later. One service client retries for two minutes, another gives up after one second, a third hides failures behind vague errors, and only half the calls emit useful trace context.
That is the moment when service-to-service communication stops being just "make an HTTP request." Each remote call is now a reliability boundary. It needs an explicit timeout, retry posture, identity expectation, telemetry shape, and fallback decision. Without that policy, every team invents different defaults and the platform becomes unpredictable under stress.
This lesson deliberately stops before service mesh adoption. A mesh is one possible implementation later in the track. The mental model here is more basic: internal traffic needs a policy contract whether that policy lives in application code, shared libraries, sidecars, gateways, or platform configuration.
The trade-off is consistency versus local freedom. Teams can tune calls however they want, but inconsistent behavior makes outages harder to contain. Central policy can make the fleet more predictable, but only if it preserves enough context for the domain semantics of each call.
The Call Is a Boundary
A local function call usually fails in ways the process can see directly. A network call fails ambiguously. The request may not arrive. The response may be delayed. The server may finish the work after the client times out. A retry may repeat a side effect. A dependency may be overloaded, and your retry behavior may make it worse.
For enrollment calling billing, the platform should know five things before the first production request crosses the boundary:
- how long enrollment waits before giving up
- whether the operation is safe to retry
- which identity is allowed to call billing
- what trace and log fields connect both sides
- what enrollment does if billing is unavailable
enrollment -> billing
timeout: 800 ms
retry: only idempotent reads
identity: enrollment-service
trace: propagate request and learner context
fallback: queue manual review, do not silently enroll
The exact values are less important than the fact that the policy exists. A service boundary without communication policy is an implicit contract waiting to surprise the system. The platform may work on a quiet Tuesday, but the undefined parts show up during overload: callers wait longer than their users can tolerate, retries add load to an already slow dependency, and incident responders cannot tell which request in enrollment corresponds to which failed payment attempt in billing.
The key distinction is that a policy is not just a library setting. It is a statement about behavior under uncertainty. "Retry three times" is not a good policy by itself. The policy needs to say which operations are retryable, how retries are budgeted, how deadlines flow through a request path, and which fallback preserves correctness when the dependency is unavailable.
Worked Request Path
Trace one checkout attempt. A learner clicks enroll on a paid course. Enrollment has to check course availability, ask billing to authorize payment, and then record the enrollment result. The old monolith may have handled this inside one process. The microservice version has at least one remote call where uncertainty enters.
learner request
-> enrollment service
-> billing service
-> payment provider
Now add a deadline. Suppose the user-facing request has a 2 second budget. Enrollment cannot spend all 2 seconds waiting for billing, because it still needs time to write its own state and answer the user. Billing cannot spend all of enrollment's remaining time waiting for the payment provider, because billing may need to record a pending or failed attempt. A deadline-aware path might look like this:
user request budget: 2000 ms
enrollment -> billing: 800 ms timeout
billing -> payment provider: 500 ms timeout
enrollment response work: 200 ms reserved
The numbers are illustrative; the shape is the lesson. Timeouts need to fit together. If enrollment gives up after 800 ms but billing keeps waiting for the payment provider for 5 seconds, the system can do useless work after the caller has already moved on. If enrollment retries an unsafe payment command without an idempotency key, a timeout can turn into a duplicate charge. If the request lacks trace context, the incident timeline becomes guesswork.
A better policy makes the semantics visible:
operation: authorize payment for enrollment
identity: enrollment-service may call billing authorize endpoint
deadline: enrollment waits at most 800 ms
retry: no blind retry; retry only with idempotency key and billing-owned dedupe
trace: propagate request id, learner id class, enrollment id, billing attempt id
fallback: mark enrollment pending_payment; show user a pending state
Notice that the fallback is not a transport detail. It is a product and domain decision. The platform can standardize deadline propagation and tracing, but it cannot decide whether a learner should receive course access after uncertain payment. That belongs to the services that own enrollment and billing semantics.
Policy Belongs at the Right Layer
Not every concern belongs in the same place. Some behavior is transport-level, some is platform-level, and some is domain-level.
Transport policy includes timeouts, connection limits, TLS posture, retry budgets, and basic routing behavior. Platform policy includes common telemetry, service identity, certificate rotation, and fleet-wide defaults. Domain policy decides whether an enrollment can continue, whether a payment can be retried, or whether a failed notification should block a workflow.
transport/platform:
timeout, retry budget, identity, telemetry, routing defaults
domain:
payment authorization, enrollment eligibility, compensation
That split matters because centralizing the wrong policy creates hidden coupling. A platform can standardize a retry budget, but it cannot decide whether charging a card twice is acceptable or whether a learner should be enrolled after billing becomes uncertain. The service that owns the business rule must still own the semantic consequence.
The trade-off is leverage versus meaning. Shared network policy reduces duplicated operational work, but the application still has to express which calls are idempotent, which failures are safe, and which decisions belong to the domain.
The boundary between layers should be explicit:
| Concern | Usually belongs to | Why |
|---|---|---|
| mTLS identity | platform | every service needs consistent caller identity |
| trace propagation | platform/shared libraries | incidents need uniform request evidence |
| retry budget | platform plus service config | consistency matters, but calls differ |
| idempotency key meaning | domain service | only the owner knows what counts as the same command |
| fallback user state | domain workflow | correctness depends on product semantics |
This split prevents two opposite mistakes. One mistake is leaving everything to application teams, which creates a fleet of incompatible defaults. The other is centralizing meaning in the platform, which makes the platform accidentally responsible for business correctness. Good platform policy creates guardrails around transport behavior while leaving semantic ownership with the services.
Failure Modes and Design Checks
Issue: Copying the same retry policy onto every call.
Clarification / Fix: Classify calls by semantics. Idempotent reads, commands with idempotency keys, and irreversible side effects need different retry behavior.
Issue: Treating timeout values as local preferences.
Clarification / Fix: Coordinate timeouts across a request path. A downstream timeout that exceeds the caller's deadline creates wasted work and confusing failure reports.
Issue: Leaving observability as an application afterthought.
Clarification / Fix: Make trace propagation, request identifiers, and dependency labels part of the communication contract from the beginning.
Issue: Setting downstream timeouts longer than upstream deadlines.
Clarification / Fix: Propagate deadlines or compute budgets across the whole request path. A dependency should not keep doing work that the caller can no longer use.
Issue: Treating authentication as enough service identity.
Clarification / Fix: Authentication proves who is calling, but policy must also say which service is allowed to call which endpoint and under what conditions.
Close the lesson and reconstruct the policy for one service call from memory. Name the caller, callee, operation semantics, timeout, retry posture, caller identity, trace fields, and fallback. If you cannot fill one of those boxes, the call has an implicit contract. The next outage will probably discover it for you.
Resources
- [DOC] Google SRE Book: Handling Overload
- Focus: Connect retries, overload, and client behavior to service reliability.
- [DOC] AWS Builders Library: Timeouts, Retries, and Backoff with Jitter
- Focus: Study why default retry behavior can amplify failure when it is not designed carefully.
- [ARTICLE] Martin Fowler: Circuit Breaker
- Focus: Use it as a mental model for containing dependency failure across service calls.
- [DOC] OpenTelemetry Traces
- Focus: Use it for the trace vocabulary that connects requests across service boundaries.
Key Takeaways
- Every service-to-service call needs explicit policy for timeout, retry, identity, telemetry, and fallback behavior.
- Shared network policy improves consistency, but domain semantics still belong with the service that owns the rule.
- Deadlines, retries, and fallbacks must be designed as one request path, not as isolated client defaults.
- The core trade-off is fleet predictability versus the local context each call needs to stay correct.
← Back to Cloud Platform and Microservices