Skip to content
All Notes

Reducing LLM Token Usage by 44% with Selective Context Injection

Problem

An LLM-powered HTML generator injected all 18 design system component patterns into every prompt — ~3,055 tokens of context. But most requests only needed 3-7 components. The unused patterns wasted tokens and could confuse the model with irrelevant examples.

Key Insight

Run two LLM calls instead of one. The first call is cheap and fast (temperature=0, max 500 tokens) — it classifies which components the prompt needs. The second call receives only those component patterns.

Phase 1: analyze_prompt("show me a search page")
  → ["stat_card", "data_table", "filter_chips"]   # 200-400ms

Phase 2: generate(filtered_prompt, user_prompt)
  → complete HTML with only 3 component patterns   # 3-8s

For refinement (editing existing HTML), detect components already in the output using CSS marker combinations — no LLM call needed. Then union existing + newly needed components.

Result: 43.7% reduction in context tokens, 23.3% reduction in total system prompt size, zero regressions across 16 E2E tests.

Takeaway

Don't inject your entire knowledge base into every LLM prompt. Use a cheap classification pass first to select relevant context, then inject only what's needed. The two-call overhead (~400ms) is far less than the token savings on the main generation.