The Question
On-device LLMs promise privacy and zero cost. Cloud LLMs promise deeper reasoning and broader capabilities. But how do they actually compare on real engineering tasks?
I built a benchmark comparing Apple's on-device Foundation Models (~3B parameters) against three tiers of cloud models across three tasks: commit message generation, code review, and text classification.
Results
| Metric | On-Device (~3B) | Cloud Small | Cloud Medium | Cloud Large |
|---|---|---|---|---|
| Avg Latency | 1647ms | 700ms | 1833ms | 3833ms |
| Privacy | Local | Cloud | Cloud | Cloud |
| Cost/call | $0 | ~$0.00025 | ~$0.003 | ~$0.015 |
| Commit msg depth | Generic | Concise | Detailed | Most specific |
| Code review issues | 3 found | 4 found | 5+ found | 6+ found |
| Classification | Good | Good | Good | Excellent |
Surprising finding: The smallest cloud model (700ms) is faster than the on-device model (1647ms). Network latency is negligible compared to model size. On-device wins on privacy, not speed.
Task 1: Commit Message Generation
Input: A git diff adding retry logic with exponential backoff.
The on-device model produced a correct, structured commit message:
{
"type": "fix",
"title": "Added retry logic with exponential backoff",
"body": "Introduces a retry mechanism with exponential backoff
to handle transient connection errors.",
"breaking": false
}Accurate but generic. Larger cloud models identified the specific failure mode being addressed and suggested the commit title reference the payment processing context.
Verdict: On-device is sufficient for commit messages. The smaller model captures the "what" correctly; it misses some "why" context that larger models infer.
Task 2: Code Review
Input: A Python function with validation gaps and naming issues.
| Issues Found | On-Device | Small | Medium | Large |
|---|---|---|---|---|
| Missing type annotations | Yes | Yes | Yes | Yes |
| No negative price handling | Yes | Yes | Yes | Yes |
| Invalid customer type handling | Yes | Yes | Yes | Yes |
| Naming convention issues | No | No | Yes | Yes |
| Edge case: zero quantity | No | No | Yes | Yes |
| Docstring quality | No | No | No | Yes |
On-device catches the obvious functional bugs. Larger models find progressively more subtle issues.
Task 3: Text Classification
Input: A support ticket about payment processing failure.
All models correctly classified it as billing / high priority / negative sentiment. The differences were in action item quality — larger models produced more specific, actionable recommendations.
The Decision Framework
Don't pick one — use them together based on the task:
| Use Case | Best Choice | Why |
|---|---|---|
| Proprietary code analysis | On-device | Zero data exposure |
| Commit message generation | On-device | Good enough quality, zero cost |
| High-volume triage | Small cloud | 700ms, cheapest |
| Code review assistance | Medium cloud | Best speed/quality ratio |
| Architecture decisions | Large cloud | Deepest reasoning |
Key Takeaways
-
Privacy is a feature, not a limitation. On-device models trade capability for guaranteed data sovereignty. For proprietary code, this is non-negotiable.
-
Speed hierarchy is counterintuitive. The smallest cloud model is faster than the on-device model. Network latency matters less than model size.
-
Quality scales predictably with model size. Each tier catches progressively more subtle issues. Choose the tier that matches your accuracy requirement.
-
Structured output is the equalizer. With constrained schemas (
anyOf,range, typed fields), even the smallest model produces valid, usable output. Schema design matters more than model size for extraction tasks. -
Cost per call varies 60x. On-device is free. Large cloud models cost ~$0.015/call. For high-volume tasks (CI pipelines, auto-triage), this difference compounds.