Day 175: Real-Time Anomaly Detection - ML-Powered Alerts

Anomaly detection is most useful when it helps the team notice strange behavior early, not when it pretends to replace explicit reliability goals and deterministic alerting.

Today's "Aha!" Moment

Traditional alerts are great when you know what “bad” looks like in advance. If checkout error rate goes above a threshold, or p99 latency burns the SLO too quickly, the system should page. But some failures do not start as clean threshold breaches. They begin as patterns that simply look “off”:

A traffic shape changes unexpectedly for one region. One dependency grows slower than usual only at certain hours. Retry behavior looks subtly different from the baseline. One customer segment suddenly behaves unlike its own history.

This is the niche anomaly detection tries to fill. Instead of asking “did metric X cross static line Y?”, it asks “is this behavior unusual relative to what the system normally does?”

That sounds magical until you see the catch: normal behavior is not fixed. It changes with time of day, day of week, seasonality, product launches, and user mix. So anomaly detection is not about finding “weird numbers” in isolation. It is about building a model of expected behavior and deciding when deviation from that model is worth human attention.

That is the aha. ML-powered alerts are strongest when they augment observability with pattern sensitivity, not when they try to replace clear SLOs, explicit thresholds, and human judgment.

Why This Matters

Suppose the warehouse company sees occasional checkout regressions after a rollout. Static threshold alerts are useful for major incidents, but smaller early signals keep slipping through:

Payment latency is higher than normal for one provider, but not high enough to breach the global SLO yet. Retry volume rises only during one seasonal traffic window. One region’s request mix looks unlike its own history, but fleet averages stay normal.

These are exactly the kinds of problems anomaly detection can surface earlier than simple rules.

But the other half of the story matters too. If anomaly detection is poorly tuned, the team gets false positives, alerts during expected seasonal changes, “interesting” signals with no clear action, and models that drift until unusual behavior becomes normalized. So the question is not whether anomaly detection is smart. The question is whether it helps humans and systems notice the right deviations without creating another source of alert fatigue.

Learning Objectives

By the end of this session, you will be able to:

Explain what anomaly detection is good for - Distinguish unusual-pattern detection from deterministic SLO or threshold alerting.
Reason about how anomaly detection works operationally - Understand baselines, seasonality, scoring, and feedback loops at a high level.
Use anomaly alerts responsibly - Recognize when they should inform investigation versus when they should drive paging directly.

Core Concepts Explained

Concept 1: Anomaly Detection Tries to Model “Normal,” Not Just “Bad”

A static threshold says, “If this number goes above 200 ms, alert.” An anomaly detector says, “Given this time, context, and recent history, this behavior looks unusually different from expectation.”

That difference matters when “normal” itself moves.

For example, checkout traffic may legitimately spike every weekday morning in Europe, while payment retries normally rise slightly on weekend evenings in another region. A static threshold either fires too often or misses the early drift. An anomaly model can try to learn those patterns and flag deviations relative to them.

At a high level, the system usually needs to account for:

trend
seasonality
recent history
noise level
context such as route, region, provider, or tenant segment

This is why anomaly detection is not merely a different threshold. It is a shift from absolute rules to baseline-aware deviation detection.

Concept 2: Operational Anomaly Detection Is a Pipeline, Not a Single Model

In production, anomaly detection is less about one algorithm name and more about a working pipeline:

metric stream
    |
    v
baseline / expected pattern
    |
    v
deviation score
    |
    v
alert policy
    |
    +--> investigate
    +--> correlate with rollout / region / provider
    +--> escalate if user-facing risk is confirmed

The important operational pieces are often:

which signals are fed into the model
how baseline windows are chosen
whether seasonality is represented
how sensitive the scoring threshold is
how the output is combined with other evidence

For the warehouse company, a useful anomaly detector might look at checkout latency for each payment provider by region and compare current behavior against the recent seasonal baseline. It may flag one provider-region pair as unusual long before global SLO burn becomes obvious.

But that pipeline only works if the system treats anomaly alerts as evidence with context, not as disembodied truth.

Concept 3: The Biggest Trade-off Is Sensitivity versus Trust

Anomaly detection is powerful precisely because it can notice subtle deviations. That is also why it can become noisy and distrusted.

The main trade-offs are:

sensitive model -> catches more weak signals, but risks false positives
conservative model -> pages less, but misses weak anomalies
more context dimensions -> better localization, but more model and observability complexity
more automation -> faster reaction, but more risk if the model is wrong

This is why the healthiest operational pattern is usually:

deterministic alerts for known user-facing failure conditions
anomaly alerts for investigation, triage, and early warning
careful promotion of only the most trusted anomaly classes into paging or automated response

For example, a clear checkout SLO burn alert should still wake the on-call team. A “latency anomaly detected in one provider-region pair” may be better as a high-priority investigate signal unless the organization has strong evidence that this anomaly class reliably predicts user harm.

The real product of anomaly detection is not “AI operations.” It is improved attention allocation: help the team notice unusual behavior sooner without teaching them to ignore alerts.

Troubleshooting

Issue: The anomaly detector fires constantly during normal business cycles.

Why it happens / is confusing: The baseline does not model seasonality or expected periodic variation well enough.

Clarification / Fix: Rework the baseline to account for time-based patterns and evaluate anomalies relative to comparable historical windows.

Issue: The detector is quiet even though engineers can see strange behavior manually.

Why it happens / is confusing: Sensitivity may be too low, the wrong signals are being modeled, or the baseline has adapted too quickly and normalized the drift.

Clarification / Fix: Review feature choice, scoring thresholds, and adaptation windows. Quiet models can fail as badly as noisy ones.

Issue: Teams start treating anomaly alerts as smart enough to replace SLO alerts.

Why it happens / is confusing: ML-driven systems feel sophisticated, so they get trusted beyond what their evidence quality justifies.

Clarification / Fix: Keep anomaly detection as an augmentation layer unless the organization has strong proof that a given anomaly class is reliable enough for direct paging.

Advanced Connections

Connection 1: Anomaly Detection <-> Tail-Based Sampling

The parallel: Anomaly signals can help decide which rare or suspicious traces deserve to be retained for deeper investigation.

Real-world case: A latency anomaly on one provider-region path may justify retaining more traces from that slice during the incident window.

Connection 2: Anomaly Detection <-> Cost Optimization

The parallel: Richer models and more dimensions can improve signal quality, but they also increase data volume, compute cost, and operational complexity.

Real-world case: Modeling every tenant, route, and provider separately may improve localization while making the observability and ML pipeline much more expensive.

Resources

Optional Deepening Resources

[DOCS] OpenSearch Anomaly Detection
- Link: https://docs.opensearch.org/latest/observing-your-data/ad/index/
- Focus: Use it as a practical reference for real-time anomaly detection in an observability stack.
[DOCS] Elastic Anomaly Detection
- Link: https://www.elastic.co/guide/en/machine-learning/current/ml-ad-overview.html
- Focus: See how anomaly detection is framed operationally around baselines, unusual patterns, and explainable signals.
[DOCS] Azure Monitor Dynamic Thresholds
- Link: https://learn.microsoft.com/en-us/azure/azure-monitor/alerts/alerts-dynamic-thresholds
- Focus: Use it to understand baseline-aware alerting and how seasonality or normal variation affect thresholding.
[SITE] Google SRE Book
- Link: https://sre.google/sre-book/table-of-contents/
- Focus: Keep anomaly detection in the broader reliability context: explicit user-facing goals still matter most.

Key Insights

Anomaly detection models unusual behavior, not just static failure - It helps surface patterns that simple thresholds may miss.
The whole pipeline matters more than the algorithm name - Baselines, context, scoring, and alert policy determine whether the system is useful.
Trust is the limiting factor - A noisy or opaque anomaly system quickly teaches teams to ignore it, so it should usually complement deterministic alerts rather than replace them.

Knowledge Check (Test Questions)

What makes anomaly detection different from a simple static threshold?
- A) It never uses historical data.
- B) It judges current behavior relative to an expected baseline instead of only against a fixed line.
- C) It guarantees zero false positives.
Why should many anomaly alerts start as investigate signals rather than pager alerts?
- A) Because anomalies are always unimportant.
- B) Because they often need context and validation before the organization can trust them as direct user-harm indicators.
- C) Because ML cannot run in real time.
What is a common cause of noisy anomaly alerts?
- A) Modeling normal seasonal variation poorly.
- B) Having too many engineers.
- C) Using any metrics at all.

Answers

1. B: The central idea is comparing current behavior with expected behavior, not merely checking a fixed threshold.

2. B: Many anomaly outputs are useful early warnings, but they are not automatically reliable enough to page without additional context.

3. A: If the baseline does not handle normal periodic patterns well, ordinary behavior starts to look anomalous.

← Back to Learning