The Problem
PII detection in text is traditionally handled by regex — pattern matching for SSNs, credit cards, emails, phone numbers. It's fast (0.05ms per item) and works for obvious patterns.
But regex fails on contextual PII: names tied to salaries, medical diagnoses, natural-language dates of birth, physical addresses. These require understanding meaning, not matching patterns.
I benchmarked regex against Apple's on-device Foundation Models (~3B parameters) across 25 diverse test cases to find where each approach wins.
Results Summary
| Metric | Regex | On-Device LLM |
|---|---|---|
| Precision | 75.0% | 100.0% |
| Recall | 40.0% | 100.0% |
| F1 Score | 52.2% | 100.0% |
| Latency | 0.05ms | 555ms |
Regex is 11,000x faster. The LLM is perfect on this test set. The question is when each matters.
What Regex Catches vs Misses
Catches (6/15 PII cases):
- SSN with dashes (like 123-45-6789)
- Phone numbers, emails, credit cards
- API keys, passwords in assignments
Misses (9/15 PII cases):
- SSN without dashes in natural language
- Names with salary information
- Medical diagnoses without patient identifiers
- Physical addresses in free text
- Employee IDs with DOB, natural-language dates of birth
False positives (2/10 clean cases):
- Error codes that match SSN patterns flagged incorrectly
- Toll-free 1-800 numbers flagged as personal phone numbers
The Key Insight: Binary Classification Beats Extraction
Instead of asking the LLM to extract PII (complex generation prone to hallucination), I reframed it as binary classification: "Does this text contain PII? Yes/No."
This plays to a small model's strength. The schema is simple:
@generable("Classify whether text contains sensitive information")
class PIIClassification:
contains_pii: bool = guide(
description="True if text contains personal data, PHI, or secrets. "
"False for generic statistics, code references, or documentation."
)
confidence: str = guide(
description="Confidence level",
anyOf=["high", "medium", "low"],
)
pii_category: str = guide(
description="Primary category of sensitive data found",
anyOf=["ssn", "credit_card", "email", "phone", "name",
"address", "medical", "financial", "dob",
"employee_id", "api_key", "password", "none"],
)Why the Prompt Engineering Matters
Version 1 (too restrictive) — 76.9% F1:
"Return true ONLY for real personal data about a specific individual."
Missed medical info and API keys.
Version 2 (comprehensive) — 100% F1:
"PII includes: names tied to personal details, physical addresses,
phone numbers, SSNs, dates of birth, salary figures tied to a person,
employee identifiers. Protected health information counts as PII
even without a patient name. Secrets include: passwords, API keys,
access tokens. Do NOT flag: generic statistics, code variable names,
toll-free or 1-800 numbers, or error codes that happen to look
like sensitive patterns."
The difference: explicit negative examples eliminate false positives. Telling the model what not to flag is as important as telling it what to flag.
Context Is Everything
Regex sees 555-867-5309 and thinks "phone number."
The LLM sees Call me at 555-867-5309 and knows it's a personal phone number.
The LLM sees Call the support line at 1-800-555-0199 and knows it's a published toll-free number.
Same pattern, different meaning. Only contextual understanding can distinguish them.
The Hybrid Strategy
Neither approach alone is optimal. The best architecture combines both:
# Fast pre-filter: regex catches obvious patterns
regex_findings = regex_scan(text)
# Contextual deep dive: LLM for everything regex missed
if not regex_findings and len(text) > 200:
llm_findings = await llm_classify(text)This gives you:
- Fast for clean text (regex only, under 1ms)
- Accurate for ambiguous text (LLM kicks in)
- No hallucination on short text (200-char threshold prevents the small model from inventing PII on insufficient context)
Fresh Sessions Prevent Context Bleed
Each classification uses a fresh session. If you reuse a session across documents, the model can "remember" earlier findings and make inconsistent decisions.
The tradeoff: 555ms per classification vs ~30ms with session reuse. For security scanning, accuracy matters more than speed.
When to Use Each
| Scenario | Approach |
|---|---|
| CI pipelines, pre-commit hooks | Regex only |
| Scanning thousands of files | Regex only |
| Privacy audits, compliance reviews | Regex + LLM |
| Processing human-written content | Regex + LLM |
| Only need obvious patterns (SSN, CC) | Regex only |
| Need contextual PII (names, medical, salary) | LLM required |
Takeaways
- Binary classification beats extraction for small models — constrain the output space
- Explicit negative examples in the prompt eliminate false positives
- Fresh sessions per classification prevent context bleed across documents
- 200-char minimum prevents hallucination on insufficient context
- Hybrid regex + LLM gives the best balance of speed and accuracy