Regex vs On-Device LLM for PII Detection: A 25-Case Benchmark

The Problem

PII detection in text is traditionally handled by regex — pattern matching for SSNs, credit cards, emails, phone numbers. It's fast (0.05ms per item) and works for obvious patterns.

But regex fails on contextual PII: names tied to salaries, medical diagnoses, natural-language dates of birth, physical addresses. These require understanding meaning, not matching patterns.

I benchmarked regex against Apple's on-device Foundation Models (~3B parameters) across 25 diverse test cases to find where each approach wins.

Results Summary

Metric	Regex	On-Device LLM
Precision	75.0%	100.0%
Recall	40.0%	100.0%
F1 Score	52.2%	100.0%
Latency	0.05ms	555ms

Regex is 11,000x faster. The LLM is perfect on this test set. The question is when each matters.

What Regex Catches vs Misses

Catches (6/15 PII cases):

SSN with dashes (like 123-45-6789)
Phone numbers, emails, credit cards
API keys, passwords in assignments

Misses (9/15 PII cases):

SSN without dashes in natural language
Names with salary information
Medical diagnoses without patient identifiers
Physical addresses in free text
Employee IDs with DOB, natural-language dates of birth

False positives (2/10 clean cases):

Error codes that match SSN patterns flagged incorrectly
Toll-free 1-800 numbers flagged as personal phone numbers

The Key Insight: Binary Classification Beats Extraction

Instead of asking the LLM to extract PII (complex generation prone to hallucination), I reframed it as binary classification: "Does this text contain PII? Yes/No."

This plays to a small model's strength. The schema is simple:

@generable("Classify whether text contains sensitive information")
class PIIClassification:
    contains_pii: bool = guide(
        description="True if text contains personal data, PHI, or secrets. "
                    "False for generic statistics, code references, or documentation."
    )
    confidence: str = guide(
        description="Confidence level",
        anyOf=["high", "medium", "low"],
    )
    pii_category: str = guide(
        description="Primary category of sensitive data found",
        anyOf=["ssn", "credit_card", "email", "phone", "name",
               "address", "medical", "financial", "dob",
               "employee_id", "api_key", "password", "none"],
    )

Why the Prompt Engineering Matters

Version 1 (too restrictive) — 76.9% F1:

"Return true ONLY for real personal data about a specific individual."

Missed medical info and API keys.

Version 2 (comprehensive) — 100% F1:

"PII includes: names tied to personal details, physical addresses,
phone numbers, SSNs, dates of birth, salary figures tied to a person,
employee identifiers. Protected health information counts as PII
even without a patient name. Secrets include: passwords, API keys,
access tokens. Do NOT flag: generic statistics, code variable names,
toll-free or 1-800 numbers, or error codes that happen to look
like sensitive patterns."

The difference: explicit negative examples eliminate false positives. Telling the model what not to flag is as important as telling it what to flag.

Context Is Everything

Regex sees 555-867-5309 and thinks "phone number."

The LLM sees Call me at 555-867-5309 and knows it's a personal phone number.

The LLM sees Call the support line at 1-800-555-0199 and knows it's a published toll-free number.

Same pattern, different meaning. Only contextual understanding can distinguish them.

The Hybrid Strategy

Neither approach alone is optimal. The best architecture combines both:

# Fast pre-filter: regex catches obvious patterns
regex_findings = regex_scan(text)
 
# Contextual deep dive: LLM for everything regex missed
if not regex_findings and len(text) > 200:
    llm_findings = await llm_classify(text)

This gives you:

Fast for clean text (regex only, under 1ms)
Accurate for ambiguous text (LLM kicks in)
No hallucination on short text (200-char threshold prevents the small model from inventing PII on insufficient context)

Fresh Sessions Prevent Context Bleed

Each classification uses a fresh session. If you reuse a session across documents, the model can "remember" earlier findings and make inconsistent decisions.

The tradeoff: 555ms per classification vs ~30ms with session reuse. For security scanning, accuracy matters more than speed.

When to Use Each

Scenario	Approach
CI pipelines, pre-commit hooks	Regex only
Scanning thousands of files	Regex only
Privacy audits, compliance reviews	Regex + LLM
Processing human-written content	Regex + LLM
Only need obvious patterns (SSN, CC)	Regex only
Need contextual PII (names, medical, salary)	LLM required

Takeaways

Binary classification beats extraction for small models — constrain the output space
Explicit negative examples in the prompt eliminate false positives
Fresh sessions per classification prevent context bleed across documents
200-char minimum prevents hallucination on insufficient context
Hybrid regex + LLM gives the best balance of speed and accuracy