The Default That Fails
Most Kubernetes HPA configurations start with CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70This works for compute-heavy batch jobs. For API services, it's almost always the wrong signal.
Why CPU Lies for API Services
API services spend most of their time waiting — for database responses, upstream API calls, cache lookups. An API pod at 15% CPU utilization might be processing 500 concurrent requests, all waiting on I/O. CPU is low, but the pod is saturated.
The reverse is also true: a pod doing connection setup or TLS handshakes might spike to 80% CPU briefly, triggering a scale-up when no additional capacity is needed.
Scenario: 500 concurrent requests, all waiting on DB
├── CPU utilization: 15% ← HPA says "fine, no scaling needed"
├── P99 latency: 3.2 seconds ← Users say "this is broken"
└── Connection pool: 100% ← Pod is actually saturated
Better Signal: Request Latency
Scale based on what users actually experience — response time:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
metrics:
- type: Pods
pods:
metric:
name: http_request_duration_p99
target:
type: AverageValue
averageValue: 500m # 500ms target
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300When P99 latency exceeds 500ms, HPA adds pods. When latency drops, it scales down (slowly, to avoid flapping).
The Infrastructure Cost
Custom metrics HPA requires a metrics pipeline:
Application
→ Datadog Agent (DaemonSet)
→ Datadog Metrics API
→ Datadog Cluster Agent
→ Kubernetes External Metrics API
→ HPA controller
This is more infrastructure than CPU-based scaling. The tradeoff: accurate scaling behavior vs. simpler setup.
Practical Implementation
1. Expose Latency Metrics
# FastAPI middleware that records request duration
import time
from datadog import statsd
@app.middleware("http")
async def metrics_middleware(request, call_next):
start = time.perf_counter()
response = await call_next(request)
duration = time.perf_counter() - start
statsd.histogram(
"http.request.duration",
duration,
tags=[f"path:{request.url.path}", f"method:{request.method}"],
)
return response2. Configure Datadog Cluster Agent
# values.yaml for Datadog Helm chart
clusterAgent:
metricsProvider:
enabled: true
useDatadogMetrics: true3. Create a DatadogMetric Resource
apiVersion: datadoghq.com/v1alpha1
kind: DatadogMetric
metadata:
name: loan-api-p99-latency
spec:
query: "p99:http.request.duration{service:loan-api}.rollup(avg, 60)"4. Reference in HPA
metrics:
- type: External
external:
metric:
name: datadogmetric@default:loan-api-p99-latency
target:
type: AverageValue
averageValue: 0.5 # 500msResults
After switching from CPU to latency-based HPA:
- Scale-up accuracy improved — pods added when users experience slow responses, not when CPU spikes from GC
- Scale-down is slower but safer — 5-minute stabilization window prevents flapping
- Cost neutral — we scale up sooner but also scale down correctly instead of holding unnecessary pods
When CPU Scaling Is Fine
| Workload | Use CPU? |
|---|---|
| ML model inference | Yes — CPU-bound |
| Image processing | Yes — CPU-bound |
| Batch ETL jobs | Yes — CPU-bound |
| API with database calls | No — use latency |
| API with upstream HTTP calls | No — use latency |
Takeaway
CPU utilization measures how busy the processor is. For I/O-bound API services, what matters is how fast users get responses. Scale on the metric that reflects the user experience: request latency.