Problem
An LLM-powered HTML generator injected all 18 design system component patterns into every prompt — ~3,055 tokens of context. But most requests only needed 3-7 components. The unused patterns wasted tokens and could confuse the model with irrelevant examples.
Key Insight
Run two LLM calls instead of one. The first call is cheap and fast (temperature=0, max 500 tokens) — it classifies which components the prompt needs. The second call receives only those component patterns.
Phase 1: analyze_prompt("show me a search page")
→ ["stat_card", "data_table", "filter_chips"] # 200-400ms
Phase 2: generate(filtered_prompt, user_prompt)
→ complete HTML with only 3 component patterns # 3-8s
For refinement (editing existing HTML), detect components already in the output using CSS marker combinations — no LLM call needed. Then union existing + newly needed components.
This delivered a 43.7% reduction in context tokens, 23.3% reduction in total system prompt size, with zero regressions across 16 E2E tests.
Takeaway
A cheap classification pass to select relevant context — then injecting only what's needed — is more token-efficient than sending a full knowledge base on every prompt. The two-call overhead (~400ms) is well worth the savings on the main generation.