Skip to content
All Notes

Alert on Percentiles, Not Averages

Problem

Our average latency alerts never fired, but users were complaining about slow responses. The dashboard showed 45ms average — well within thresholds.

Key Insight

One 10-second request hidden in 1000 fast ones is invisible to mean-based alerts. The average was 45ms, but P99 was 3.2 seconds. 1% of users were having a terrible experience, and our monitoring couldn't see it.

Switched to P95/P99 alerting:

# Datadog monitor
query: "p99:trace.fastapi.request{service:loan-api} > 500"
threshold: 500ms

Takeaway

Averages lie. Alert on P95 for early warnings, P99 for real user impact. The 5% of requests that matter most are the ones averages hide.