Problem
Our average latency alerts never fired, but users were complaining about slow responses. The dashboard showed 45ms average — well within thresholds.
Key Insight
One 10-second request hidden in 1000 fast ones is invisible to mean-based alerts. The average was 45ms, but P99 was 3.2 seconds. 1% of users were having a terrible experience, and our monitoring couldn't see it.
Switched to P95/P99 alerting:
# Datadog monitor
query: "p99:trace.fastapi.request{service:loan-api} > 500"
threshold: 500msTakeaway
Averages lie. Alert on P95 for early warnings, P99 for real user impact. The 5% of requests that matter most are the ones averages hide.