Complete evaluation infrastructure on a single Mac Studio. Six eval dimensions, a 5.8x throughput advantage discovered in a different backend, and a routing system built on real data.
24
Models
6
Eval dims
5.8x
MLX vs Ollama
88k
Peak tok/s
$0
Cloud cost
Key Findings
Size is not quality for conversation — qwen2.5:7b (5GB) scored 100% multi-turn; llama3.3:70b (42GB) scored 47.8%
MLX delivers 5.8x aggregate throughput vs Ollama at 32 concurrent users — invisible at low concurrency, massive at scale
--decode-concurrency 8 made things worse — MLX's dynamic batcher outperforms any fixed value
qwen2.5:7b wins on value — 80.6% quality, 100% multi-turn, 93% domain, 10k tok/s, 5GB
When every model scores 0%, the task is broken — found and fixed a wrong answer key mid-run