The previous article in this series evaluated fine-tuning across three projects and concluded that in-context techniques like prompting, RAG, and tool use won every time. But keeping your entire pipeline in-context has a quiet cost: your token usage explodes, your system's complexity shifts to autonomous agent loops, and the cognitive overhead of review climbs.
We are discovering that in the era of AI tools, generating code is becoming cheap, but reviewing and reasoning about it is becoming expensive. When you move from a weekend prototype to a production system that you have to maintain, the bottleneck is no longer writing the code. It is understanding its side effects and owning its blast radius.
The next phase of AI engineering isn't about more autonomy; it's about optimizing the economics of engineering effort.
Tiered Intelligence: Stop Over-Provisioning Your Brains
Defaulting to the most powerful frontier model for every single task is the architectural equivalent of using a supercomputer to run a calculator. It is not just expensive; it increases the volume of code you have to read. Smaller models are faster and, crucially, produce more direct, predictable outputs that require less cognitive effort to audit.
I caught myself doing this recently: using a frontier reasoning model on a heavy-thinking run to write a simple shell script to parse test logs. The model took 30 seconds, cost me 40,000 tokens of context, and generated a highly verbose script with three helper functions I didn't need. A lite model—or a single grep command—could have done the job in seconds for a fraction of a cent. My habits were lagging behind my options.
Using tools like Gemini CLI makes the need for model awareness visible:
| Task Category | Model Tier | Example Tool |
|---|---|---|
| Repo Indexing / Search | Flash / Lite | gemini-cli --model flash |
| Repetitive Refactoring | Lite / Pro | cursor / claude-code |
| Architectural Design | Frontier | Deep Thinking / Claude Opus 4.7* |
The Token Mirror: Making Costs Painfully Visible
In the "magic" phase of AI, we don't think about tokens. We just want the answer. But tokens are the unit economics of AI engineering. If the input context is noisy, the output code will be bloated. Bloated code takes longer to review.
To solve this, I’ve been experimenting with an RTK proxy (Reduced Token Knowledge) and Caveman. The RTK proxy acts as a CLI proxy that intercepts shell outputs and prunes them—compressing Git diffs, AWS logs, or test results before they hit the LLM.
Evidence: The 60% Gain
In my recent workflows, using rtk consistently resulted in an average 60% reduction in token usage across common development tasks, without degrading the model's output quality.
| Task Context | Raw Size (Tokens) | Pruned Size (Tokens) | Reduction | Output Accuracy |
|---|---|---|---|---|
| Git Diff (Large Refactor) | 18,400 | 6,200 | 66% | 100% correct edits |
| AWS CloudWatch Logs | 42,100 | 12,500 | 70% | Identified root cause |
| Test Run Failures | 8,900 | 4,100 | 54% | Successfully fixed |
By stripping whitespace, boilerplate, and redundant metadata, you're not just saving money—you're maximizing the signal-to-noise ratio.
When you use a tool like Caveman to make these costs painfully visible, your prompting strategy changes. You stop asking for "the whole file" and start asking for the "relevant diff." You do this because you know that every extra token of output is another line of code you have to read, own, and maintain.
Designing for Review: Owning the Blast Radius
Autonomy is a spectrum, not a goal. Even with robust prompts and coding guidelines, AI output is never guaranteed. The goal of pragmatic AI engineering is to keep the "Cognitive Load of Review" as low as possible.
Constraint over Creativity: Use READMEs and
.cursorrulesto constrain behavior, not just suggest it. If you don't constrain the model's environment, it will invent abstractions to solve problems you don't have. Tell the model what it is forbidden to do before you tell it what to build.
Simplicity is Debuggability: If the AI generates a 200-line complex abstraction, but you could have written a 50-line simple one, you've inherited technical debt you didn't even write. Reject complexity early. If you can't audit it in 60 seconds, don't commit it.
The Infrastructure Tax: Code is a Liability
We often treat code as an asset, but as Greg Kogan notes, software is a liability.1 AI tools allow us to write code 10x faster, but our engineering organizations are not growing 10x larger to maintain it. This discrepancy triggers a hidden, system-level infrastructure tax:
- Dependency Inflation: More generated code pulls in more packages, deepening the dependency graph and increasing the surface area for security vulnerabilities and breaking changes.
- The CI/CD Crunch: As the codebase swells, so does the test suite. Running a million tests on every single commit becomes a physical bottleneck and a massive cloud bill.
- Container Bloat: Larger binaries and bloated Docker images lead to slower deployments, more complex service orchestrations, and higher runtime overhead.
The bottleneck isn't just the developer's attention; it is the entire engineering pipeline's capacity to absorb changes. If you care about software quality and long-term design, your role in the AI era is to act as a gatekeeper, advocating for minimal code, tight constraints, and clean architecture over raw volume.
Takeaway: The Sane Developer's Stack
The next phase of AI engineering is defined by:
- Model Awareness: Matching task complexity to model cost to limit code verbosity.
- Context Efficiency: Using tools like RTK to prune inputs, forcing high-signal outputs.
- Infrastructure Discipline: Minimizing code footprint to prevent CI/CD capacity crunch and dependency creep.
- Human Ownership: Recognizing that someone still owns the blast radius.
AI makes code generation cheap. It doesn't make system understanding cheap.
Footnotes
-
This perspective is heavily inspired by Greg Kogan's talk, Software is a Liability, which details how rapid code growth creates an unsustainable tax on testing pipelines and infrastructure. ↩