Problem
An LLM-powered HTML generator injected all 18 design system component patterns into every prompt — ~3,055 tokens of context. But most requests only needed 3-7 components. The unused patterns wasted tokens and could confuse the model with irrelevant examples.
Key Insight
Run two LLM calls instead of one. The first call is cheap and fast (temperature=0, max 500 tokens) — it classifies which components the prompt needs. The second call receives only those component patterns.
Phase 1: analyze_prompt("show me a search page")
→ ["stat_card", "data_table", "filter_chips"] # 200-400ms
Phase 2: generate(filtered_prompt, user_prompt)
→ complete HTML with only 3 component patterns # 3-8s
For refinement (editing existing HTML), detect components already in the output using CSS marker combinations — no LLM call needed. Then union existing + newly needed components.
Result: 43.7% reduction in context tokens, 23.3% reduction in total system prompt size, zero regressions across 16 E2E tests.
Takeaway
Don't inject your entire knowledge base into every LLM prompt. Use a cheap classification pass first to select relevant context, then inject only what's needed. The two-call overhead (~400ms) is far less than the token savings on the main generation.