Optimizing a High-Throughput FastAPI Service

Problem Context

A critical FastAPI service sat on the hot path for multiple downstream consumers — dashboards, risk systems, and partner integrations all depended on it. Under growing volume, response times were degrading. P95 latency was climbing past acceptable thresholds, and the system couldn't handle projected load for the next quarter.

The root causes were layered:

Synchronous database calls blocking the event loop
N+1 query patterns in the ORM layer
No caching for frequently-accessed records
Unindexed joins on hot query paths

System Architecture

Loading diagram...

The Optimization Journey

Step 1: Profile First

Before writing any code, I profiled the existing API with Datadog APM. The results were surprising:

Component	% of Total Latency
SQL query execution	62%
ORM serialization	18%
Network overhead	12%
Application logic	8%

The database was the bottleneck, not the framework. This changed the optimization strategy entirely — async migration alone wouldn't solve the problem.

Step 2: SQL Query Optimization

The biggest wins came from rewriting queries:

N+1 elimination: The ORM was issuing separate queries for related entities. Replaced with explicit JOIN and selectinload strategies:

# Before: N+1 — one query per parent's related records
records = await session.execute(select(Record))
for record in records:
    details = await session.execute(
        select(Detail).where(Detail.record_id == record.id)
    )
 
# After: single query with eager loading
records = await session.execute(
    select(Record).options(selectinload(Record.details))
)

Index optimization: Added composite indexes on the three most common filter combinations. One index addition reduced a 200ms query to 12ms.

Query rewriting: Replaced SELECT * patterns with explicit column selection on wide tables, reducing serialization overhead by 40%.

SQL execution dropped from 62% to ~25% of total latency.

Step 3: Async Migration

With the database optimized, async migration amplified the gains:

Replaced psycopg2 with asyncpg — native async PostgreSQL driver
Configured connection pool with explicit limits: min_size=10, max_size=50
Async route handlers release the event loop during I/O waits

The async migration alone wouldn't have solved the problem — but combined with optimized queries, it unlocked concurrency that was previously blocked on I/O.

Step 4: Caching Layer

For frequently-accessed records, added a TTL-based in-memory cache:

from datetime import datetime, timedelta
from typing import Any
 
CACHE_TTL = timedelta(minutes=5)
_cache: dict[str, tuple[datetime, Any]] = {}
 
async def get_record_cached(record_id: str):
    now = datetime.utcnow()
    if record_id in _cache:
        cached_at, data = _cache[record_id]
        if now - cached_at < CACHE_TTL:
            return data
 
    data = await fetch_from_db(record_id)
    _cache[record_id] = (now, data)
    return data

Why TTL over event-driven invalidation: TTL is simpler to implement and reason about. For read-heavy endpoints where 5-minute staleness is acceptable, it's the right tradeoff. For mutable data where staleness is a compliance risk, we use write-through invalidation on the write path instead.

Step 5: Monitoring the Gains

Added Datadog metrics to track the optimization impact over time:

from datadog import statsd
 
statsd.histogram("api.query.duration", query_time_ms, tags=["endpoint:list"])
statsd.increment("api.cache.hit" if cache_hit else "api.cache.miss")

Cache hit rates, query durations, and P95/P99 latencies are all dashboarded — so regressions are caught before users notice.

Technical Tradeoffs

Decision	Benefit	Cost
asyncpg over psycopg2	Non-blocking I/O, higher concurrency	Harder to debug, async stack traces are less readable
TTL-based cache	Simple, predictable, no infrastructure dependency	Brief staleness window; not suitable for all data
Connection pool limits	Prevents database connection exhaustion	Requests queue when pool is saturated under extreme load
Composite indexes	Dramatic query speedup on common filter paths	Write overhead (index maintenance), more storage
Explicit column selection	Reduced serialization cost on wide tables	Schema changes require updating select lists

Impact

~50% lower P95 latency from query optimization, async migration, and caching combined
System handles 3x previous peak load without scaling incidents
P99 latency variance reduced by 60%

Lessons Learned

Profile before you optimize. I assumed async migration would be the biggest win. It wasn't — SQL query rewrites delivered 3x more improvement. Always measure first.
N+1 queries don't show up in single-request profiling. You only see them under load, when the database connection pool saturates. Use selectinload or explicit JOINs.
One well-placed composite index can turn a 200ms query into a 12ms query. Analyze your top 5 queries with EXPLAIN ANALYZE and index accordingly.
Cache without monitoring is a black box. Track hit rates, staleness, and eviction counts. A cache with a 20% hit rate is wasting memory.
Connection pool sizing requires tuning. Too small and requests queue; too large and the database gets overwhelmed. Start with max_size = 2 * CPU_cores and load test from there.