Problem
We used the same /health endpoint for both liveness and readiness probes. Under load, the health check included a database ping that timed out. Kubernetes marked pods as unhealthy and restarted them — causing cascading restarts across the cluster.
Key Insight
Liveness and readiness serve fundamentally different purposes:
| Probe | Controls | Failure Action |
|---|---|---|
| Liveness | Is the process alive? | Restart the pod |
| Readiness | Can it serve traffic? | Stop routing traffic to it |
Separate endpoints with different thresholds:
livenessProbe:
httpGet:
path: /health/live # Just: is the process running?
port: 8000
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready # Can it serve? (DB, deps OK)
port: 8000
failureThreshold: 1 # Stop traffic immediatelyTakeaway
Conflating liveness and readiness cost us a full cluster restart under peak load. A database timeout triggered the liveness probe, Kubernetes restarted every pod, and traffic dropped to zero — from a health check misconfiguration, not a real outage. Separate endpoints with separate failure thresholds prevent one slow dependency from taking down the entire service.