Kubernetes Liveness != Readiness

Problem

We used the same /health endpoint for both liveness and readiness probes. Under load, the health check included a database ping that timed out. Kubernetes marked pods as unhealthy and restarted them — causing cascading restarts across the cluster.

Key Insight

Liveness and readiness serve fundamentally different purposes:

Probe	Controls	Failure Action
Liveness	Is the process alive?	Restart the pod
Readiness	Can it serve traffic?	Stop routing traffic to it

Separate endpoints with different thresholds:

livenessProbe:
  httpGet:
    path: /health/live    # Just: is the process running?
    port: 8000
  failureThreshold: 3
 
readinessProbe:
  httpGet:
    path: /health/ready   # Can it serve? (DB, deps OK)
    port: 8000
  failureThreshold: 1     # Stop traffic immediately

Takeaway

Conflating liveness and readiness cost us a full cluster restart under peak load. A database timeout triggered the liveness probe, Kubernetes restarted every pod, and traffic dropped to zero — from a health check misconfiguration, not a real outage. Separate endpoints with separate failure thresholds prevent one slow dependency from taking down the entire service.