Reducing LLM Token Usage by 44% with Selective Context Injection

Problem

An LLM-powered HTML generator injected all 18 design system component patterns into every prompt — ~3,055 tokens of context. But most requests only needed 3-7 components. The unused patterns wasted tokens and could confuse the model with irrelevant examples.

Key Insight

Run two LLM calls instead of one. The first call is cheap and fast (temperature=0, max 500 tokens) — it classifies which components the prompt needs. The second call receives only those component patterns.

Phase 1: analyze_prompt("show me a search page")
  → ["stat_card", "data_table", "filter_chips"]   # 200-400ms

Phase 2: generate(filtered_prompt, user_prompt)
  → complete HTML with only 3 component patterns   # 3-8s

For refinement (editing existing HTML), detect components already in the output using CSS marker combinations — no LLM call needed. Then union existing + newly needed components.

This delivered a 43.7% reduction in context tokens, 23.3% reduction in total system prompt size, with zero regressions across 16 E2E tests.

Takeaway

A cheap classification pass to select relevant context — then injecting only what's needed — is more token-efficient than sending a full knowledge base on every prompt. The two-call overhead (~400ms) is well worth the savings on the main generation.