Alert on Percentiles, Not Averages

Problem

Our average latency alerts never fired, but users were complaining about slow responses. The dashboard showed 45ms average — well within thresholds.

One 10-second request hidden in 1000 fast ones is invisible to mean-based alerts. The average was 45ms, but P99 was 3.2 seconds. 1% of users were having a terrible experience, and our monitoring couldn't see it.

Switched to P95/P99 alerting:

# Datadog monitor
query: "p99:trace.fastapi.request{service:loan-api} > 500"
threshold: 500ms

Takeaway

Averages smooth over outliers. Alert on P95 for early warnings, P99 for real user impact. The slowest 5% of requests are what users actually experience — percentiles surface that where averages don't.

Alert on Percentiles, Not Averages

Problem

Key Insight

Takeaway

Alert on Percentiles, Not Averages

Problem

Key Insight

Takeaway