On-Device vs Cloud LLM: A Practical Benchmark

The Question

On-device LLMs promise privacy and zero cost. Cloud LLMs promise deeper reasoning and broader capabilities. But how do they actually compare on real engineering tasks?

I built a benchmark comparing Apple's on-device Foundation Models (~3B parameters) against three tiers of cloud models across three tasks: commit message generation, code review, and text classification.

Results

Metric	On-Device (~3B)	Cloud Small	Cloud Medium	Cloud Large
Avg Latency	1647ms	700ms	1833ms	3833ms
Privacy	Local	Cloud	Cloud	Cloud
Cost/call	$0	~$0.00025	~$0.003	~$0.015
Commit msg depth	Generic	Concise	Detailed	Most specific
Code review issues	3 found	4 found	5+ found	6+ found
Classification	Good	Good	Good	Excellent

Surprising finding: The smallest cloud model (700ms) is faster than the on-device model (1647ms). Network latency is negligible compared to model size. On-device wins on privacy, not speed.

Task 1: Commit Message Generation

Input: A git diff adding retry logic with exponential backoff.

The on-device model produced a correct, structured commit message:

{
  "type": "fix",
  "title": "Added retry logic with exponential backoff",
  "body": "Introduces a retry mechanism with exponential backoff
           to handle transient connection errors.",
  "breaking": false
}

Accurate but generic. Larger cloud models identified the specific failure mode being addressed and suggested the commit title reference the payment processing context.

Verdict: On-device is sufficient for commit messages. The smaller model captures the "what" correctly; it misses some "why" context that larger models infer.

Task 2: Code Review

Input: A Python function with validation gaps and naming issues.

Issues Found	On-Device	Small	Medium	Large
Missing type annotations	Yes	Yes	Yes	Yes
No negative price handling	Yes	Yes	Yes	Yes
Invalid customer type handling	Yes	Yes	Yes	Yes
Naming convention issues	No	No	Yes	Yes
Edge case: zero quantity	No	No	Yes	Yes
Docstring quality	No	No	No	Yes

On-device catches the obvious functional bugs. Larger models find progressively more subtle issues.

Task 3: Text Classification

Input: A support ticket about payment processing failure.

All models correctly classified it as billing / high priority / negative sentiment. The differences were in action item quality — larger models produced more specific, actionable recommendations.

The Decision Framework

Don't pick one — use them together based on the task:

Loading diagram...

Use Case	Best Choice	Why
Proprietary code analysis	On-device	Zero data exposure
Commit message generation	On-device	Good enough quality, zero cost
High-volume triage	Small cloud	700ms, cheapest
Code review assistance	Medium cloud	Best speed/quality ratio
Architecture decisions	Large cloud	Deepest reasoning

Key Takeaways

Privacy is a feature, not a limitation. On-device models trade capability for guaranteed data sovereignty. For proprietary code, this is non-negotiable.
Speed hierarchy is counterintuitive. The smallest cloud model is faster than the on-device model. Network latency matters less than model size.
Quality scales predictably with model size. Each tier catches progressively more subtle issues. Choose the tier that matches your accuracy requirement.
Structured output is the equalizer. With constrained schemas (anyOf, range, typed fields), even the smallest model produces valid, usable output. Schema design matters more than model size for extraction tasks.
Cost per call varies 60x. On-device is free. Large cloud models cost ~$0.015/call. For high-volume tasks (CI pipelines, auto-triage), this difference compounds.