Problem Context
A critical FastAPI service sat on the hot path for multiple downstream consumers — dashboards, risk systems, and partner integrations all depended on it. Under growing volume, response times were degrading. P95 latency was climbing past acceptable thresholds, and the system couldn't handle projected load for the next quarter.
The root causes were layered:
- Synchronous database calls blocking the event loop
- N+1 query patterns in the ORM layer
- No caching for frequently-accessed records
- Unindexed joins on hot query paths
System Architecture
The Optimization Journey
Step 1: Profile First
Before writing any code, I profiled the existing API with Datadog APM. The results were surprising:
| Component | % of Total Latency |
|---|---|
| SQL query execution | 62% |
| ORM serialization | 18% |
| Network overhead | 12% |
| Application logic | 8% |
The database was the bottleneck, not the framework. This changed the optimization strategy entirely — async migration alone wouldn't solve the problem.
Step 2: SQL Query Optimization
The biggest wins came from rewriting queries:
N+1 elimination: The ORM was issuing separate queries for related entities. Replaced with explicit JOIN and selectinload strategies:
# Before: N+1 — one query per parent's related records
records = await session.execute(select(Record))
for record in records:
details = await session.execute(
select(Detail).where(Detail.record_id == record.id)
)
# After: single query with eager loading
records = await session.execute(
select(Record).options(selectinload(Record.details))
)Index optimization: Added composite indexes on the three most common filter combinations. One index addition reduced a 200ms query to 12ms.
Query rewriting: Replaced SELECT * patterns with explicit column selection on wide tables, reducing serialization overhead by 40%.
Result: SQL execution dropped from 62% to ~25% of total latency.
Step 3: Async Migration
With the database optimized, async migration amplified the gains:
- Replaced
psycopg2withasyncpg— native async PostgreSQL driver - Configured connection pool with explicit limits:
min_size=10, max_size=50 - Async route handlers release the event loop during I/O waits
The async migration alone wouldn't have solved the problem — but combined with optimized queries, it unlocked concurrency that was previously blocked on I/O.
Step 4: Caching Layer
For frequently-accessed records, added a TTL-based in-memory cache:
from datetime import datetime, timedelta
from typing import Any
CACHE_TTL = timedelta(minutes=5)
_cache: dict[str, tuple[datetime, Any]] = {}
async def get_record_cached(record_id: str):
now = datetime.utcnow()
if record_id in _cache:
cached_at, data = _cache[record_id]
if now - cached_at < CACHE_TTL:
return data
data = await fetch_from_db(record_id)
_cache[record_id] = (now, data)
return dataWhy TTL over event-driven invalidation: TTL is simpler to implement and reason about. For read-heavy endpoints where 5-minute staleness is acceptable, it's the right tradeoff. For mutable data where staleness is a compliance risk, we use write-through invalidation on the write path instead.
Step 5: Monitoring the Gains
Added Datadog metrics to track the optimization impact over time:
from datadog import statsd
statsd.histogram("api.query.duration", query_time_ms, tags=["endpoint:list"])
statsd.increment("api.cache.hit" if cache_hit else "api.cache.miss")Cache hit rates, query durations, and P95/P99 latencies are all dashboarded — so regressions are caught before users notice.
Technical Tradeoffs
| Decision | Benefit | Cost |
|---|---|---|
| asyncpg over psycopg2 | Non-blocking I/O, higher concurrency | Harder to debug, async stack traces are less readable |
| TTL-based cache | Simple, predictable, no infrastructure dependency | Brief staleness window; not suitable for all data |
| Connection pool limits | Prevents database connection exhaustion | Requests queue when pool is saturated under extreme load |
| Composite indexes | Dramatic query speedup on common filter paths | Write overhead (index maintenance), more storage |
| Explicit column selection | Reduced serialization cost on wide tables | Schema changes require updating select lists |
Impact
- ~50% lower P95 latency — from query optimization + async + caching combined
- ~40% higher throughput — async unlocked concurrency that was previously blocked on I/O
- 3x headroom — system handles 3x previous peak load without scaling incidents
- Predictable performance — P99 latency variance reduced by 60%
Lessons Learned
-
Profile before you optimize. I assumed async migration would be the biggest win. It wasn't — SQL query rewrites delivered 3x more improvement. Always measure first.
-
N+1 queries are the silent killer. They don't show up in single-request profiling. You only see them under load, when the database connection pool saturates. Use
selectinloador explicit JOINs. -
Composite indexes are high-leverage. One well-placed index can turn a 200ms query into a 12ms query. Analyze your top 5 queries with
EXPLAIN ANALYZEand index accordingly. -
Caching needs observability. Cache without monitoring is a black box. Track hit rates, staleness, and eviction counts. A cache with a 20% hit rate is wasting memory.
-
Connection pool sizing is an art. Too small → requests queue. Too large → database gets overwhelmed. Start with
max_size = 2 * CPU_cores, then tune with load testing.