Cloud Platform Architecture Capstone

LESSON

016 45 min intermediate CAPSTONE

Cloud Platform Architecture Capstone

The core idea: A cloud platform design is convincing only when its boundaries, traffic controls, runtime model, and operating signals fit together as one system instead of a pile of fashionable infrastructure components.

Core Insight

For this capstone, imagine the learning platform has grown from one backend into separate services for catalog, enrollment, billing, identity, recommendations, and progress. Teams want independent deployment, safer releases, consistent edge policy, and enough operational control to handle failure without turning every service into a special case.

The design challenge is not "use microservices, Kubernetes, an API gateway, a CDN, serverless, and a mesh." The challenge is to decide which problems each layer owns, which problems stay in application code, and which risks are not worth adding a platform layer for yet.

This is where the track comes together. Service boundaries define ownership. Gateways shape public ingress. Orchestration keeps workloads aligned with declared intent. Meshes and traffic policy standardize internal communication only when the fleet needs that leverage. Edge and serverless placement move selected logic closer to demand without pretending every workflow can live there.

The trade-off is coherence versus platform weight. A good capstone answer should add control surfaces only where they clarify ownership, reduce repeated operational work, or make failure easier to contain. Everything else belongs on a deliberate "not yet" list.

Capstone Scenario

Design the platform architecture for a learning product with these constraints:

Web and mobile clients need a stable public API.
Enrollment, billing, identity, catalog, recommendations, and progress change at different speeds.
The platform must support safe rollout of service changes without exposing every user at once.
Some public content is cacheable at the edge, but learner progress, payment, and enrollment state are authoritative deeper inside the system.
The team is large enough that repeated traffic, security, and telemetry policy are becoming inconsistent across services.

Your output should be a short architecture memo, not a tool shopping list.

Design Walkthrough

Start with boundaries. Identify which services own durable facts and policies. Catalog may own course metadata. Enrollment may own enrollment state. Billing may own payment and subscription rules. Identity may own authentication context. Other services can read projections or call APIs, but they should not casually mutate another service's source of truth.

Then design the edge. The API gateway should provide one disciplined public boundary: route requests, normalize identity context, apply coarse edge policy, and adapt API shape for clients. It should not become the hidden owner of enrollment, billing, or catalog rules.

Next, place runtime control. Kubernetes or another orchestrator can keep desired workload state aligned with reality. It can manage replicas, health checks, rollout mechanics, and recovery from node or process failure. That does not remove the need for honest readiness signals or compatible service versions.

Finally, decide whether a mesh is justified. If internal service-to-service policy is already fragmented, a mesh can standardize mTLS, telemetry, retries, and traffic splitting. If the fleet is still small, strong client libraries and simpler platform standards may be enough.

One acceptable high-level map might look like this:

web/mobile clients
      |
      v
CDN / edge cache for public content
      |
      v
API gateway: identity context, routing, edge limits
      |
      v
services: catalog, enrollment, billing, identity, progress
      |
      v
orchestrator: replicas, health, rollout mechanics
      |
      v
optional mesh: internal mTLS, telemetry, routing policy

The map is not the answer by itself. The answer is the reasoning attached to each boundary: what it owns, what it must not own, what fails when it is wrong, and what evidence would show that it is operating safely.

Worked Architecture Memo

A strong memo might start with the ownership map, because that is the part the infrastructure cannot fix later. In this platform, catalog owns course metadata and publishing state. Enrollment owns whether a learner has joined a course and whether a seat is reserved. Billing owns subscription and payment facts. Identity owns authentication and account identity. Progress owns learner progress events and read models.

That ownership map immediately rules out several tempting shortcuts. The gateway may call enrollment and billing to shape a mobile dashboard, but it should not decide that a learner is entitled to a paid course. Catalog may expose course metadata to the edge cache, but the cache should not become the authority for whether a course is currently enrollable. Recommendations may read progress-derived signals, but it should not directly mutate progress state.

The public edge then becomes a boundary translator, not a domain brain:

clients
  -> CDN for public course pages and static assets
  -> API gateway for authenticated API traffic
  -> domain services for authoritative facts

The CDN is allowed to serve public course descriptions, thumbnails, and documentation because those can be stale for a short period without corrupting business state. It should not serve learner progress, payment status, active enrollment, or seat reservations as if they were static content. Those facts need authoritative reads or carefully designed projections with clear staleness rules.

The gateway handles login context, request normalization, coarse rate limits, and client-shaped responses. For example, the mobile dashboard can ask the gateway for profile summary, active enrollments, subscription state, and recommendations in one response. The gateway can decide which optional panel degrades when recommendations is slow. It should not decide whether enrollment is active or whether billing is paid; those answers come from the owning services.

Runtime control belongs in two layers. The orchestrator keeps declared workload state true: how many replicas should run, which version is ready, which pods should stop receiving traffic, and whether a rollout has stalled. The mesh, if adopted, standardizes repeated internal traffic policy: service identity, mTLS, telemetry, retries, and internal traffic splits. If the platform team cannot operate the mesh as part of the production path, the memo should defer it and use simpler client libraries plus platform standards first.

Now attach a risky change to the architecture. Suppose enrollment v2 changes seat reservation logic. A credible release plan does not say "deploy with Kubernetes." It says:

deploy: orchestrator runs enrollment v2 beside v1
expose: gateway or mesh routes beta cohort to v2
observe: conflicts, checkout latency, reservation errors, billing mismatches
promote: 1 percent -> 5 percent -> 25 percent -> 50 percent -> 100 percent
rollback: route traffic back to v1 if signals cross thresholds
compatibility: v1 and v2 must read the same reservation records

This is where the platform design proves whether it is coherent. If the gateway routes beta users to v2 but an internal mesh rule sends checkout calls back to v1, the release plan is incoherent. If v2 writes records v1 cannot read, traffic rollback is false comfort. If support sees learners charged without seats, infrastructure health is not enough evidence to promote.

The memo should also include a "not yet" list. Maybe the first version defers a full mesh until there are enough services to justify the operational cost. Maybe it avoids serverless for enrollment because seat reservation is stateful and latency-sensitive, while using edge compute only for lightweight request shaping. Maybe it avoids multi-region active-active writes until the platform has a real consistency and recovery requirement. These omissions are not weakness. They show that the design can distinguish useful control from fashionable weight.

Readiness Signals

A strong capstone memo should make these signals visible:

Service ownership is explicit enough that enrollment, billing, catalog, identity, and progress do not casually rewrite each other's facts.
Public ingress policy is concentrated at the gateway without turning the gateway into a domain monolith.
Edge caching is limited to content that can lag safely; learner progress, enrollment, and payment state stay authoritative deeper inside the system.
Progressive delivery has promotion and rollback criteria before traffic moves.
The mesh is either justified by repeated internal traffic policy or intentionally deferred.
The design names what each platform layer must not own, not only what it is allowed to do.

These signals matter because platform architecture is not measured by how many named components appear in the diagram. It is measured by whether teams can change, operate, and debug the system without losing ownership boundaries.

Common Capstone Failures

Issue: Drawing a correct-looking component map without authority boundaries.

Clarification / Fix: Name the service that owns each durable fact and policy. A diagram with boxes is weak if payment state, enrollment state, catalog publishing, and progress can be rewritten from several places.

Issue: Treating the gateway as a convenient place for every cross-service decision.

Clarification / Fix: Let the gateway shape public API responses and enforce edge policy, but keep final resource decisions in the services that own the facts.

Issue: Adding a mesh because it appears in mature platform diagrams.

Clarification / Fix: Justify the mesh only if internal traffic policy is already a repeated fleet-wide problem and the platform team can debug the data plane and control plane under incident pressure.

Issue: Saying rollback is possible without naming the compatibility boundary.

Clarification / Fix: A rollback plan should say which data formats, APIs, and side effects remain safe when traffic returns to the old version.

Close the track by sketching the architecture from memory. Put one verb next to each layer: CDN caches, gateway normalizes, services own, orchestrator reconciles, mesh standardizes, delivery policy promotes or rolls back. If two layers have the same verb, decide whether one is accidental duplication.

Review Checklist

Use these questions to test the design:

Which service owns each important business fact?
Which traffic is north-south at the public edge, and which is east-west inside the platform?
Which logic belongs at the gateway, which belongs in services, and which can safely move to the edge?
What evidence promotes a rollout from 5 percent traffic to 50 percent and then to 100 percent?
What fails if the mesh, gateway, CDN, or orchestrator is misconfigured?
Which parts of the design are intentionally deferred because the operational weight is not yet justified?

Expected Deliverable

Write a one-page architecture memo with:

A component map for clients, gateway, services, edge/cache, orchestrator, and optional mesh.
Three explicit ownership boundaries and why they belong there.
A progressive delivery plan for one risky service change.
A failure-mode section that names the most dangerous coupling or control-plane risk.
A short "not yet" list of platform features that would be premature.

Resources

[ARTICLE] Microservices
- Focus: Revisit service boundaries, team ownership, and the cost of distributed systems.
[DOC] Kubernetes Concepts
- Focus: Connect desired state, workload control, services, and runtime operations.
[DOC] Envoy Service Mesh Architecture
- Focus: Use the data-plane perspective to reason about what a mesh adds and what it complicates.
[PATTERN] API Gateway
- Focus: Compare client-facing edge composition with internal service ownership.

Key Takeaways

A good cloud platform design assigns clear ownership to services, gateway, orchestrator, mesh, and edge layers.
Runtime control is valuable only when connected to honest signals, rollback plans, and explicit policy boundaries.
A credible capstone answer shows one risky change moving through deployment, exposure, observation, promotion, and rollback.
The capstone answer should defend trade-offs, including the platform features deliberately left out.
Architecture maturity shows up as clear "not here" decisions as much as clear component choices.

← Back to Cloud Platform and Microservices

← Back to Architecture And Platforms

← Back to Learning Hub