Why CPU-Based Autoscaling Fails for API Services

The Default Worth Revisiting

A common starting point for Kubernetes HPA is CPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

This works well for compute-heavy batch jobs. For I/O-bound API services, though, CPU utilization often misrepresents actual load.

API services spend most of their time waiting — for database responses, upstream API calls, cache lookups. An API pod at 15% CPU utilization might be processing 500 concurrent requests, all waiting on I/O. CPU is low, but the pod is saturated.

The reverse is also true: a pod doing connection setup or TLS handshakes might spike to 80% CPU briefly, triggering a scale-up when no additional capacity is needed.

Scenario: 500 concurrent requests, all waiting on DB
├── CPU utilization: 15%     ← HPA says "fine, no scaling needed"
├── P99 latency: 3.2 seconds ← Users say "this is broken"
└── Connection pool: 100%    ← Pod is actually saturated

Better Signal: Request Latency

Scale based on what users actually experience — response time:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_request_duration_p99
        target:
          type: AverageValue
          averageValue: 500m  # 500ms target
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

When P99 latency exceeds 500ms, HPA adds pods. When latency drops, it scales down (slowly, to avoid flapping).

The Infrastructure Cost

Custom metrics HPA requires a metrics pipeline:

Application
  → Datadog Agent (DaemonSet)
    → Datadog Metrics API
      → Datadog Cluster Agent
        → Kubernetes External Metrics API
          → HPA controller

This is more infrastructure than CPU-based scaling. The tradeoff: accurate scaling behavior vs. simpler setup.

Practical Implementation

1. Expose Latency Metrics

# FastAPI middleware that records request duration
import time
from datadog import statsd
 
@app.middleware("http")
async def metrics_middleware(request, call_next):
    start = time.perf_counter()
    response = await call_next(request)
    duration = time.perf_counter() - start
    statsd.histogram(
        "http.request.duration",
        duration,
        tags=[f"path:{request.url.path}", f"method:{request.method}"],
    )
    return response

2. Configure Datadog Cluster Agent

# values.yaml for Datadog Helm chart
clusterAgent:
  metricsProvider:
    enabled: true
    useDatadogMetrics: true

3. Create a DatadogMetric Resource

apiVersion: datadoghq.com/v1alpha1
kind: DatadogMetric
metadata:
  name: loan-api-p99-latency
spec:
  query: "p99:http.request.duration{service:loan-api}.rollup(avg, 60)"

4. Reference in HPA

metrics:
  - type: External
    external:
      metric:
        name: datadogmetric@default:loan-api-p99-latency
      target:
        type: AverageValue
        averageValue: 0.5  # 500ms

Results

After switching from CPU to latency-based HPA:

Scale-up accuracy improved — pods added when users experience slow responses, not when CPU spikes from GC
Scale-down is slower but safer — 5-minute stabilization window prevents flapping
Cost neutral — we scale up sooner but also scale down correctly instead of holding unnecessary pods

When CPU Scaling Is Fine

Workload	Use CPU?
ML model inference	Yes — CPU-bound
Image processing	Yes — CPU-bound
Batch ETL jobs	Yes — CPU-bound
API with database calls	No — use latency
API with upstream HTTP calls	No — use latency

Takeaway

CPU utilization measures how busy the processor is. For I/O-bound API services, what matters is how fast users get responses. Scale on the metric that reflects the user experience: request latency.