Engineering Notes
Short, practical lessons from building production systems. Each note captures one key insight.
Problem A batch processing script using an on-device LLM hung after 5-6 items. No timeout, no error — just stuck. The script created a fresh session for every item: for item in items: session = La
Problem AI coding assistants consume tokens, make tool calls, and modify files — but the only visibility you get is the chat output. There's no dashboard showing context window usage, cost, active too
Problem An LLM-powered HTML generator injected all 18 design system component patterns into every prompt — ~3,055 tokens of context. But most requests only needed 3-7 components. The unused patterns w
Problem I assumed migrating a synchronous FastAPI service to async would be the biggest performance win. I was ready to rewrite the entire database layer. Key Insight Profiling with Datadog APM reveal
Problem Our average latency alerts never fired, but users were complaining about slow responses. The dashboard showed 45ms average — well within thresholds. Key Insight One 10-second request hidden in
Problem Distributed tracing across 15+ microservices was broken. Developers were supposed to pass correlation IDs in every request, but they kept forgetting. Some services generated their own IDs, oth
Problem We used the same /health endpoint for both liveness and readiness probes. Under load, the health check included a database ping that timed out. Kubernetes marked pods as unhealthy and restarte
Problem We published a "Logging Best Practices" style guide. It said: use JSON format, include correlation IDs, follow a consistent schema. Six months later, every service logged differently. Some use