Monitoring Distributed Systems

Day 095: Monitoring Distributed Systems

Monitoring is useful only when it tells the team whether the system is still keeping its promises, whether it is drifting toward failure, and what action should come next before users absorb the full impact.


Today's "Aha!" Moment

Many teams start monitoring from the tooling: "what metrics can we collect?" The more useful starting point is different: "what do we need to know before customers tell us something is wrong?"

Keep one example throughout the lesson. The learning platform handles purchases, enrollments, reminders, and background event processing. At 10:00 AM everything looks healthy. At 10:20, p95 checkout latency starts climbing. At 10:25, queue age for notification workers grows fast. At 10:27, database connection usage is near saturation. At 10:30, support begins hearing that confirmations are delayed.

That is the aha. Monitoring is not a wall of graphs. It is an early warning system for degraded promises. Good monitoring tells you not only that the system is "up," but whether user-facing flows are slowing down, whether pressure is building inside queues or dependencies, and whether the team should scale, shed load, page someone, or investigate a bottleneck.

Once you see monitoring that way, a lot of bad habits become obvious. CPU and memory are not enough. Collecting every metric is not strategy. A dashboard with no decision path is decoration. Monitoring earns its keep only when the signals map to customer impact and operational action.


Why This Matters

The problem: Distributed systems rarely fail in one dramatic instant. They often degrade gradually through latency growth, backlog accumulation, dependency saturation, and partial failure. Without the right signals, teams notice too late or respond too slowly.

Before:

After:

Real-world impact: Faster detection, better capacity planning, more meaningful alerts, and a much stronger chance of intervening before small degradation becomes a visible outage.


Learning Objectives

By the end of this session, you will be able to:

  1. Explain what good distributed monitoring measures first - Connect it to promises made to users and critical workflows.
  2. Distinguish behavior metrics from pressure metrics - Understand why both are needed to operate a distributed system well.
  3. Connect metrics to decisions - Recognize which signals support alerting, scaling, diagnosis, and planning.

Core Concepts Explained

Concept 1: Start with User-Facing Behavior, Not Just Machine Health

The first monitoring question should be: are the important workflows behaving acceptably for users right now?

For the learning platform, that might mean watching:

Those signals matter because they express what the product is actually delivering. A service can have low CPU and still be failing users because an external dependency is timing out or because a queue is backing up.

user promise:
  "checkout should succeed quickly"

measure:
  rate
  error rate
  latency
  workflow success

This is why host metrics alone are incomplete. They tell you something about resource state, but not whether the customer-facing path is working. Monitoring should begin at the promise boundary and then drill inward.

The trade-off is more product-aware instrumentation versus the simplicity of machine-only monitoring. The extra work is worth it because users experience workflows, not servers.

Concept 2: Saturation and Backlog Signals Show Trouble Before Full Failure

Many incidents start as pressure, not as a crash.

The checkout flow may still succeed, but more slowly, because:

These are saturation signals. They tell you a constrained part of the system is approaching its limit.

requests arrive
      |
      v
service latency rises
      |
      v
queue depth grows
      |
      v
dependency saturation appears

This is why good monitoring needs both behavior and pressure. Behavior tells you the symptom. Saturation tells you what is likely to break next if load or degradation continues.

def important_metrics(rate, error_rate, p95_ms, queue_age_s, db_pool_in_use):
    return {
        "rate": rate,
        "error_rate": error_rate,
        "p95_ms": p95_ms,
        "queue_age_s": queue_age_s,
        "db_pool_in_use": db_pool_in_use,
    }

The code only illustrates the mix. Good monitoring rarely relies on one class of metrics alone.

The trade-off is broader signal design versus simpler dashboards. But without saturation signals, teams often learn about impending failure only when the failure is already customer-visible.

Concept 3: Metrics Matter Only If They Lead to Decisions

A metric is useful when the team knows what it means and what it should trigger.

For example:

signal -> interpretation -> likely action

high latency -> path degraded -> inspect trace / dependency
high queue age -> backlog rising -> scale workers / reduce input
high db pool usage -> saturation risk -> tune pool / reduce concurrency

This is where many dashboards fail. They show numbers but do not support a decision. Monitoring should be designed around the questions operators ask during incidents and during planning:

The trade-off is less "collect everything" freedom versus more disciplined monitoring that operators can actually act on. The discipline is the point.

Troubleshooting

Issue: Collecting many metrics with no operational purpose.

Why it happens / is confusing: Monitoring systems make collection easy, so teams delay the harder design question of what they need to know.

Clarification / Fix: Start from critical user flows and likely operator decisions. Add metrics because they answer real questions.

Issue: Focusing only on CPU, memory, and instance uptime.

Why it happens / is confusing: Host metrics are familiar and easy to instrument.

Clarification / Fix: Keep infrastructure metrics, but pair them with workflow behavior and saturation signals so you can see both impact and cause.

Issue: Building dashboards that nobody can act on.

Why it happens / is confusing: Visual completeness can feel like operational maturity.

Clarification / Fix: For each major metric, ask what action it informs. If the answer is unclear, the metric or the dashboard likely needs redesign.


Advanced Connections

Connection 1: Monitoring ↔ Alerting and SLOs

The parallel: Monitoring becomes governance once the team uses those signals to define alert thresholds and service-level expectations.

Real-world case: Error rate and tail latency only become operationally powerful when they map to promises the team cares enough to defend.

Connection 2: Monitoring ↔ Capacity Planning

The parallel: Historical pressure signals reveal which bottleneck is likely to saturate first as traffic grows.

Real-world case: Queue age, connection-pool usage, and tail-latency trends often predict the next scaling problem more honestly than average CPU does.


Resources

Optional Deepening Resources


Key Insights

  1. Monitoring should begin with the promises users experience - Workflow behavior matters more than machine liveness alone.
  2. Saturation signals expose failure while it is still forming - Queue age, pool pressure, and backlog growth often surface before an outright outage.
  3. Metrics are valuable only when they support action - Good monitoring helps the team decide what to inspect, scale, shed, or fix next.

Knowledge Check (Test Questions)

  1. Why are rate, error rate, and latency such common starting metrics for a service?

    • A) Because together they describe how the service is behaving from the user's point of view.
    • B) Because infrastructure metrics are never useful.
    • C) Because they are the only metrics a distributed system can collect.
  2. What does a saturation metric usually tell you?

    • A) That a shared resource, queue, or dependency is approaching a limit that may soon degrade the system.
    • B) The exact line of source code that caused the issue.
    • C) Whether the service was deployed in one region or several.
  3. Why is a dashboard by itself not enough?

    • A) Because monitoring is only useful when signals support alerting, diagnosis, scaling, or planning decisions.
    • B) Because dashboards should never be used in operations.
    • C) Because latency does not matter if uptime is good.

Answers

1. A: Rate, errors, and latency are a compact way to measure whether the service is handling work, failing requests, or slowing down for users.

2. A: Saturation metrics expose pressure building in constrained parts of the system before that pressure becomes a larger incident.

3. A: A graph is only operationally meaningful if the team knows what decision it should drive.



← Back to Learning