I catalogued every AI decision across three production systems and found a consistent pattern — along with five gaps I'd been working around instead of solving.
I catalogued every AI decision across three production systems and found a consistent pattern — along with five gaps I'd been working around instead of solving.
Five manual runs looked fine. Then 36 automated tests exposed non-deterministic sourcing, biased scoring, and a confidence threshold that fired randomly.
I added vector search to the screening pipeline and watched it rank a junior frontend developer above a Principal Engineer who processed 1B+ events/day. The embedding model matched vocabulary, not capability.
The model autonomously combined keyword search and vector search in the optimal sequence — without being told to. Then I ran experiments to measure what vague descriptions, over-calling, and temperature actually do to tool selection.
ReAct agents recover from their own mistakes — not because the model is clever, but because of how tools return errors and how the loop is structured. Here's what that looks like in practice.
Three projects, three evaluations, zero cases where fine-tuning was justified. Here's the decision framework, the cost math, and why simpler approaches won every time.