How we built a complete local LLM evaluation infrastructure on a single Mac Studio, benchmarked 24 models across 6 dimensions, discovered a 5.8x throughput advantage hiding in plain sight, and built a routing system that sends every query to exactly the right model.
We started with a question that sounds simple but isn't: on a single machine with 256GB of RAM, which local LLMs are actually worth running, and when?
Over two days, we pulled 24 models ranging from 2GB to 142GB, built a six-dimensional evaluation suite from scratch, discovered that the conventional wisdom about larger models being better is wrong for at least two important use cases, found a 5.8x throughput advantage sitting unused in the MLX backend, and shipped a four-tier cascade router backed by real eval data rather than guesswork.
Everything ran locally. Zero cloud API calls. Zero per-token cost. Unlimited iteration.
1. Size is not quality. qwen2.5:7b (5GB) scored 100% on multi-turn conversation. llama3.3:70b (42GB) scored 47.8% on the same eval — worse than the 3B model. For conversational tasks, running the 70B model is not just wasteful, it is actively worse.
2. The right backend matters as much as the right model. MLX and Ollama run the same model at the same quality. At 32 concurrent users and long outputs, MLX delivers 619 tok/s aggregate versus Ollama's 107 tok/s — a 5.8x difference that is entirely invisible unless you measure it.
3. One model wins on value. qwen2.5:7b is the answer to almost every question: 80.6% quality, 100% multi-turn, 93% domain, 10,601 tok/s peak throughput, 5GB. No other model comes close on value per gigabyte.
Every eval decision was anchored to one question: which model should a router send this specific query to? Not "which model scores highest in aggregate" — that leads to running 70B models on greetings. The goal was a routing table backed by real data, not vibes.
Mac Studio with Apple M-series chip and 256GB unified RAM. This is not a GPU cluster — it is a single consumer machine that happens to have enough memory to fit every model we tested simultaneously if needed. The unified memory architecture means CPU and GPU share the same pool, which changes the calculus on model loading and concurrency compared to discrete GPU systems.
We ran two model serving backends side by side:
mlx_lm.server) with OpenAI-compatible API and native speculative decoding. Installed via pip install mlx-lm.24 models across 8 size classes, selected to span the useful range on a 256GB machine:
| Size class | Models | VRAM ~ |
|---|---|---|
| 2–3B | llama3.2:3b, llama3.2-vision:11b (grouped here for comparison) | 2–7GB |
| 5–8B | qwen2.5:7b, qwen3:8b, deepseek-r1:7b | 5–6GB |
| 9–14B | phi4:14b, qwen2.5-coder:14b, deepseek-r1:14b, qwen3:14b, qwen2.5:14b | 8–10GB |
| 20–22B | mistral-small:22b, gpt-oss:20b | 13GB |
| 27–30B | gemma3:27b, qwen3-coder:30b, qwen3:30b | 17–18GB |
| 32B | qwen2.5:32b, deepseek-r1:32b | 19–20GB |
| 70–72B | llama3.3:70b, deepseek-r1:70b, qwen2.5:72b | 42–47GB |
| 235B | qwen3:235b | 142GB |
qwen3:235b requires 142GB of the machine's 256GB RAM. It loads, it runs, and it scores well on domain tasks — but at 300-second eval timeouts and single-digit tok/s under any load, it is a benchmark curiosity rather than a production option. It is included in the data for completeness but excluded from routing recommendations.
We designed six distinct eval suites, each testing a different capability axis. Running a single combined eval would conflate very different failure modes — a model can be great at code and terrible at conversation. The suites are intentionally orthogonal.
| Eval | Script | Models tested | What it measures |
|---|---|---|---|
| Quality R2 | quality_eval_r2.py | 22 models | Reasoning, coding, knowledge, instruction following — 16 tasks, deterministic scoring |
| Multi-turn | multiturn_eval.py | 6 models | Conversation coherence, context retention, sycophancy resistance — 9 scenarios, 245 max pts |
| Domain | domain_eval.py | 8 models | Medical, legal, math, code debugging, scientific reasoning — 23 tasks |
| Throughput / concurrency | concurrency_stress.py, deep_concurrency.py | 12 models | Peak tok/s, safe concurrency ceiling, latency under load |
| RAG grounding | rag_eval.py | 7 models | Knowledge tasks with and without retrieved context — bare vs RAG delta |
| Think vs no-think | qwen3_think_vs_nothink.py | 3 qwen3 models | Quality and throughput impact of extended thinking mode |
llama3.3:70b judges open-ended responses. It never judges its own outputs. This matters — self-judging inflates scores by 8-15% in our testing.The first time we ran concurrent evals, qwen3:235b loaded into RAM while a smaller model was mid-eval. The smaller model's throughput dropped 80%. On a machine with a unified memory bus, there is no isolation between workloads. Serial execution is not a limitation — it is the only way to get clean comparative data.
Five scoring bugs were discovered and fixed mid-run across the eval scripts. Each required a partial re-run for affected models:
sql_injection — 5-point task was being scored as 3 points. Fixed.off_by_one — had a spurious gate condition that rejected correct answers. Removed.clinical_stats — only credited full Bayes calculation, not the correct "base rate alone" shortcut. Fixed to accept either.fermi_estimation — rejected answers expressed as "/day" vs "/year." Fixed to normalize units.evolutionary_reasoning — required "inverted" when "non-inverted" is also a valid correct answer. Fixed.multistep_math — answer key in quality_eval_r2.py was wrong (A=20, B=10, C=15). All 22 models scored 0/10. Fixed to correct answers (A=24, B=12, C=19). All 22 models needed re-score on this task.The quality eval (Round 2, after scoring bug fixes) is the most comprehensive single dataset. 22 models, 16 tasks across four categories: reasoning, coding, knowledge, and instruction following.
| Rank | Model | Overall | Reasoning | Coding | Knowledge | Instruction |
|---|---|---|---|---|---|---|
| 1 | llama3.3:70b | 83.8% | 60% | 95% | 100% | 90% |
| 1 | qwen2.5:72b | 83.8% | 80% | 95% | 67% | 90% |
| 3 | qwen2.5-coder:14b | 81.2% | 80% | 95% | 67% | 80% |
| 4 | qwen2.5:7b | 80.6% | 80% | 88% | 67% | 85% |
| 4 | deepseek-r1:32b | 80.6% | 60% | 88% | 100% | 85% |
| 4 | qwen2.5:32b | 80.6% | 80% | 92% | 67% | 80% |
| 7 | qwen3:14b | 78.8% | 60% | 100% | 67% | 90% |
| 7 | qwen3-coder:30b | 78.8% | 60% | 100% | 67% | 90% |
| 9 | llama3.2:3b | 77.5% | 60% | 95% | 67% | 90% |
| 9 | phi4:14b | 77.5% | 60% | 95% | 67% | 90% |
| 11 | qwen3:30b | 76.9% | 60% | 92% | 67% | 90% |
| 12 | gemma3:12b | 75.6% | 60% | 88% | 67% | 90% |
| 12 | qwen2.5:14b | 75.6% | 60% | 92% | 67% | 85% |
| 12 | mistral-small:22b | 75.6% | 60% | 88% | 67% | 90% |
| 12 | gemma3:27b | 75.6% | 60% | 88% | 67% | 90% |
| 16 | deepseek-r1:70b | 71.2% | 40% | 95% | 67% | 90% |
| 17 | gpt-oss:20b | 70.0% | 52% | 100% | 67% | 65% |
| 18 | deepseek-r1:14b | 68.8% | 40% | 95% | 67% | 80% |
| 19 | llama3.2-vision:11b | 68.1% | 60% | 88% | 33% | 85% |
| 20 | qwen3:8b | 63.1% | 52% | 68% | 67% | 70% |
| 21 | deepseek-r1:7b | 59.4% | 40% | 68% | 67% | 70% |
| 22 | qwen3:235b * | 56.2% | 40% | 25% | 100% | 75% |
* qwen3:235b score heavily penalized by 300s timeouts on coding/reasoning tasks under concurrent eval load. Estimated real quality ~80%+.
The knowledge ceiling at 67%. Every model except llama3.3:70b and deepseek-r1:32b (both 100%) hit the exact same 67% knowledge score. This is not a coincidence — it traces to two specific tasks. The hallucination_probe asks about Han Kang's 2024 Nobel Prize (most models have a training cutoff before October 2024 and correctly refuse). The confabulation_trap presents a fabricated Einstein quote and tests whether the model refuses to validate it. Models that score 100% on knowledge are the ones that refused cleanly on both. The 67% floor is structural, not meaningful.
Coding floor at 88%. With two exceptions (qwen3:8b and deepseek-r1:7b at 68%), every tested model scores 88-100% on coding tasks. Coding is the most saturated category — it is no longer a useful differentiator above 7B parameters.
Reasoning spreads widest (40-80%). The only category where model quality genuinely separates. Knights/knaves logic, Monty Hall, and logic grids are the hardest tasks in the suite. The 70B models do not win this category — qwen2.5:7b, qwen2.5-coder:14b, and qwen2.5:32b all tie at 80% reasoning alongside qwen2.5:72b.
The multi-turn eval was the most surprising result of the entire project. We tested 6 models on 9 conversational scenarios using proper /api/chat endpoints with full message history. The results inverted everything the quality leaderboard suggested.
| Rank | Model | Score | Best scenario | Worst scenario |
|---|---|---|---|---|
| 1 | qwen2.5:7b | 100% | All scenarios | None — perfect |
| 2 | llama3.2:3b | 90.6% | correction_handling | false_premise_resistance |
| 3 | gemma3:12b | 88.6% | context_retention | false_premise_resistance |
| 4 | qwen3:14b | 75.1% | topic_switching | gradual_refinement |
| 5 | qwen3:30b | 61.2% | false_premise_resistance | gradual_refinement |
| 6 | llama3.3:70b | 47.8% | instruction_persistence | gradual_refinement |
Pattern 1: Think mode cascades into failure. The gradual_refinement scenario asks models to iteratively improve code across 4 turns — add validation, add memoization, add docstring, return final version. qwen3 models in think mode spent 60-180 seconds per turn generating a reasoning chain, then often produced a subtly wrong intermediate. Turn 2 built on turn 1's error. By turn 4, the code was broken and the model was confident about it. This pattern explained nearly every qwen3 failure in the entire eval.
Pattern 2: Sycophancy scales with model size. In correction_handling, after giving a correct answer, the user pushes back with "I don't think that's right." Small models held their ground. Large models apologized and changed their answer. llama3.3:70b — the highest-quality model on static benchmarks — was the most sycophantic conversationalist. Trained on more human feedback, it learned too well that humans like it when you agree with them.
Pattern 3: instruction_persistence reveals stubbornness vs compliance. Models were told to always append a TL;DR after every response. Three scenarios later, they were explicitly told to stop. Most models acknowledged the request and then immediately appended "TL;DR: …" anyway. Only qwen2.5:7b stopped completely on the first ask.
Pattern 4: Context degradation is real above 12B. In long_context_degradation, models were given a list of 15 facts and asked questions drawn from early, middle, and late portions of the list across 6 turns. Models above 12B showed measurable accuracy drops on early-list facts by turn 6 — the recency bias overwhelmed earlier context. qwen2.5:7b and llama3.2:3b showed no degradation.
Do not use a 70B model for conversation. It is slower (126 tok/s vs 10,601 tok/s), more expensive in RAM, more sycophantic, and scores 52 percentage points lower on multi-turn benchmarks than qwen2.5:7b. The size premium buys nothing for conversational tasks and actively costs quality.
8 models, 23 tasks, 5 high-stakes domains. This eval was designed for the T4 routing question: when a query needs expert-level reasoning, which model do you send it to?
| Rank | Model | Overall | Medical | Legal | Math | Code | Science |
|---|---|---|---|---|---|---|---|
| 1 | qwen3:30b | 97.8% | 100% | 100% | 100% | 90% | 100% |
| 1 | qwen3:235b | 97.8% | 100% | 100% | 100% | 90% | 100% |
| 3 | qwen3:14b | 93.9% | 100% | 95% | 75% | 96% | 100% |
| 4 | qwen2.5:7b | 93.0% | 98% | 75% | 100% | 96% | 94% |
| 5 | gemma3:12b | 91.7% | 98% | 90% | 90% | 90% | 90% |
| 6 | llama3.2:3b | 83.0% | 86% | 70% | 90% | 94% | 74% |
| 7 | llama3.3:70b | 80.4% | 82% | 75% | 95% | 96% | 56% |
| 8 | deepseek-r1:14b | 75.7% | 94% | 70% | 28% | 90% | 86% |
Medical: everyone passes the floor. Even llama3.2:3b scored 86% on medical tasks. The medical_refuse task (a patient describes chest pain — respond appropriately) was passed by every model: all correctly advised calling emergency services rather than offering home remedies. Drug interaction tasks and clinical statistics (Bayes theorem with base rates) were handled correctly by all models above 7B.
qwen3 architecture dominates structured reasoning. qwen3:30b and qwen3:235b tied at 97.8% — perfect across medical, legal, math, and science. The think mode genuinely helps when the problem has a definite answer and there is no time pressure. Domain eval tasks are single-turn; the think mode cascading failure that hurt multi-turn evals is not triggered here.
llama3.3:70b science collapse (56%). The model that led the quality leaderboard at 83.8% failed on evolutionary reasoning and Fermi estimation. The science tasks that failed involve applying principles across domains rather than recalling established facts. This is a genuine capability gap, not a formatting issue.
deepseek-r1:14b math catastrophe (28%). Three of four math tasks failed. The reasoning chains were plausible but the final answers were wrong. The irrationality of √2 proof was abandoned mid-chain. This is likely a quantization artifact — the 4-bit GGUF weights clip precision in ways that cause mathematical reasoning to drift. The 32B version scores 80.6% on quality; the 14B is not a scaled-down version of that quality.
qwen2.5:7b legal weakness (75%). The only domain where qwen2.5:7b meaningfully underperforms. Contract ambiguity resolution and 4th Amendment analysis showed inconsistent reasoning across runs. Legal tasks require tracking multiple interpretive frameworks simultaneously — a task that benefits from the larger context that comes with bigger models.
This was the most technically surprising result of the project. MLX and Ollama run the same model weights at the same quality. Benchmarking head-to-head with identical hardware, identical prompts, and increasing concurrency revealed a performance gap that is invisible at low concurrency and massive at any real multi-user load.
| Concurrent users | MLX default | MLX dc=8 | Ollama | MLX advantage |
|---|---|---|---|---|
| 1 | 116 tok/s | 118 tok/s | 101 tok/s | 1.1x |
| 2 | 193 tok/s | 208 tok/s | 103 tok/s | 1.9x |
| 4 | 270 tok/s | 270 tok/s | 106 tok/s | 2.5x |
| 8 | 318 tok/s | 317 tok/s | 107 tok/s | 3.0x |
| 16 | 365 tok/s | 320 tok/s | 107 tok/s | 3.4x |
| 32 | 619 tok/s | 320 tok/s | 107 tok/s | 5.8x |
| Backend | Short output | Medium output | Long output |
|---|---|---|---|
| MLX (default) | 2.0s | 2.1s | 2.0s |
| MLX (dc=8) | 2.3s | 4.5s | 8.3s |
| Ollama | 1.5s | 11.8s | 29.1s |
Ollama serializes. When multiple requests arrive simultaneously, Ollama queues them and processes one at a time using llama.cpp. Aggregate throughput plateaus at roughly single-request performance (~107 tok/s for qwen2.5:7b) regardless of how many concurrent requests are sent. The 30th user waits for requests 1-29 to complete before getting their first token — hence 29s TTFT at n=32.
MLX batches natively. MLX's Metal backend processes multiple requests in a true batch on the GPU. Adding more concurrent requests increases GPU utilization without proportionally increasing latency. At n=32, MLX is using the hardware more efficiently than Ollama can at n=1.
The intuitive fix — set --decode-concurrency 8 to give MLX a fixed batch size — is strictly worse than the default. It caps aggregate throughput at 320 tok/s (vs 619 for default) and increases TTFT at higher concurrency. MLX's dynamic batcher outperforms any fixed value. The right configuration is no configuration.
For very short outputs (fewer than 20 tokens) at concurrency 1, Ollama is slightly faster (1.5s TTFT vs 2.0s). The MLX batch setup cost is not recovered on trivial outputs. This informed the T1 routing decision: llama3.2:3b on Ollama for greetings and simple fact lookups, where the overhead matters and batching provides no benefit.
The eval data is only useful if it drives actual routing decisions. We built two systems: a Model Council for ensemble high-stakes responses, and a Cascade Router for single-model routing of every query.
The router classifies every incoming query into one of four tiers based on complexity signals, then routes to the appropriate model and backend. All assignments are backed by eval data.
| Tier | Model | Backend | When | Rationale |
|---|---|---|---|---|
| T1 — Trivial | llama3.2:3b | Ollama | Greetings, simple facts, single-sentence answers | 23k tok/s; short outputs — Ollama overhead-free queuing wins |
| T2 — Normal | qwen2.5:7b | MLX | QA, summarization, multi-turn conversation | 100% multi-turn, 80.6% quality, 5.8x throughput at concurrency |
| T3 — Complex | qwen3:14b | MLX (no_think) | Reasoning, code review, structured analysis | 100% coding, 78.8% quality, batching advantage maintained |
| T4 — Expert | qwen3:30b | Ollama (think) | Medical/legal/scientific, deep research | 97.8% domain — best medical, legal, math, science of any tested model |
The router was tested against 25 labeled queries representing all four tiers. Classification uses keyword patterns, complexity heuristics, and token budget estimation.
For high-stakes queries where a single model answer is insufficient, the Council runs multiple models and synthesizes their outputs. Four modes:
We implemented and tested a /speculate endpoint that uses qwen2.5:7b (T2) to draft tokens and qwen3:30b (T4) to verify them — the standard speculative decoding pattern. Results were negative: because the draft and verify models are so different in capability, the acceptance rate was low and the overhead of running two models outweighed the latency savings. Speculative decoding works well when draft and verify models are close in capability (e.g., 7B draft, 14B verify). The T2/T4 pairing is too dissimilar.
256GB of unified RAM means 142GB models load without complaint. qwen3:235b loaded in about 4 minutes and answered correctly when given time. The problem: "given time" means 90-300 seconds per query under any concurrency. During the concurrent quality eval, 4 of 8 coding and reasoning tasks hit the 300-second timeout and scored zero.
Its actual domain quality (97.8%) ties qwen3:30b. But qwen3:30b runs at 3,007 tok/s. The 235B model's real-world quality per second of wall time is the worst of any model tested. Having enough RAM to run it does not mean you should.
Lesson: Maximum RAM headroom is not an invitation to run the biggest model. Measure time-to-answer, not just answer quality.
Every model in quality_eval_r2 scored 0/10 on multistep_math. After the first run, this was attributed to the models being bad at multi-step math. After the second run produced the same result, we audited the answer key. The key had A=20, B=10, C=15. The correct answers are A=24, B=12, C=19.
All 22 models were scoring correctly — the scoring script was grading them wrong. A full re-score pass was required.
Lesson: When every model scores 0% on a task, the task is broken, not the models. Universal failure is a red flag that demands auditing the rubric, not the responses.
For qwen3:8b and qwen3:30b, think mode improves quality with acceptable latency tradeoffs. The conventional expectation is that think mode always helps on hard tasks, just at the cost of speed.
qwen3:14b inverts this: no_think scores higher on quality while running 2x faster. Think mode on the 14B model spends tokens on reasoning that consistently reaches worse conclusions than the model's direct answer. We don't have a mechanistic explanation — this is an empirical finding, not a theoretical one.
Policy derived: Always use no_think for qwen3:14b. Test think vs no_think empirically for any model; don't assume the documentation's recommendation matches your workload.
The MLX documentation mentions --decode-concurrency as a tuning parameter for throughput. Setting it to 8 seemed like a reasonable optimization for a multi-user workload. The benchmark showed it was strictly worse than the default at every concurrency level above 4.
The dynamic batcher that MLX uses by default adapts to the actual batch size at runtime. A fixed concurrency setting of 8 creates a ceiling — once 8 requests are batched, subsequent requests wait even if GPU capacity exists. At n=32 concurrency, the fixed setting produces 320 tok/s versus the default's 619 tok/s.
Lesson: Dynamic scheduling outperforms static configuration for variable workloads. Measure before tuning. The default is often default for a reason.
Early on, we attempted to run the quality eval and the concurrency stress test simultaneously to save wall-clock time. Both results were invalid: the quality eval showed anomalously high latencies, and the concurrency test had unusually low throughput. The unified memory bus does not partition — any workload on the machine affects every other workload.
Adding 6 hours of wall-clock time to serialize all eval runs produced clean, reproducible data. The time cost was real. The data quality improvement was essential.
Lesson: On a machine with unified memory, serialization is not optional for accurate benchmarking. Build checkpoint/resume into every eval from day one — you will need it.
| Script | Description | Status |
|---|---|---|
quality_eval_r2.py | 22-model quality eval, 16 tasks, 4 categories, deterministic scoring | Done |
multiturn_eval.py | 9-scenario multi-turn eval, /api/chat transport, NO_THINK_MODELS set, 60s timeout | Done |
domain_eval.py | 23-task domain eval, 5 domains, 27/27 scoring tests pass | Done |
mlx_vs_ollama.py | Head-to-head benchmark, quality + throughput + TTFT at multiple concurrency levels | Done |
mlx_concurrency.py | MLX-only concurrency deep dive, default vs decode-concurrency=8 comparison | Done |
concurrency_stress.py | Ollama concurrency stress test, peak tok/s and safe ceiling per model | Done |
rag_eval.py | RAG vs bare grounding delta, 7 models | Done |
qwen3_think_vs_nothink.py | Think mode quality and throughput comparison, 3 qwen3 models | Done |
| File | Description | Status |
|---|---|---|
council/router.py | 4-tier cascade router, eval-backed tier assignments, dual-backend support | Running |
council/council.py | Model Council — vote, synthesize, debate, raw modes | Running |
council/server.py | HTTP API on port 8080, /ask, /council, /speculate, dual-backend health endpoint | Running |
| File | Contents |
|---|---|
quality_r2_20260306_221534.json | Final Quality R2 results, 22 models (note: multistep_math rescored) |
multiturn_20260306_234513.json | Multi-turn eval, 6 models, 9 scenarios |
domain_20260307_001136.json | Domain eval, 8 models, 23 tasks |
mlx_vs_ollama_20260306_225305.json | MLX vs Ollama head-to-head benchmark |
mlx_concurrency_20260307_*.json | Two-pass MLX concurrency deep dive (default vs dc=8) |
deep_concurrency_20260306_193319.json | Ollama deep concurrency suite, 12 models |
REPORT.md | Consolidated findings, all 12 sections, with full tables and recommendations |
How to build a local LLM eval farm on Apple Silicon. Applicable to any Mac with 64GB+ unified RAM; scaled results with more RAM.
Pull llama3.2:3b first. Run it at 512 concurrent requests. Understand what your machine can do at the ceiling before loading larger models. The smallest model establishes your throughput floor and your concurrency ceiling — everything else is measured against it.
A full eval suite on 22 models takes 6-18 hours depending on model size and task complexity. Runs will crash. Models will timeout. The machine will be needed for other things. Every eval script should save state after each model response and resume from the last checkpoint. This is not optional — it is the difference between usable data and wasted compute.
On Apple Silicon with unified memory, there is no isolation between workloads. Run one eval at a time. Build a queue if needed. Add 30-60% more wall-clock time to your estimate and accept it — the data quality difference is not subtle.
Install MLX (pip install mlx-lm) and run the same model on both Ollama and MLX with increasing concurrency. Do not assume one is faster — measure it. The answer depends on your concurrency pattern, output length distribution, and workload mix.
Use a separate model as judge. The difference between a model evaluating its own outputs versus a different model's outputs ranges from 8-15% score inflation. Designate your best available model as judge and never run it in the same eval it is scoring.
If every model fails a task, the task is broken. Check the rubric, the answer key, and the scoring logic before concluding that the models have a shared capability gap. Universal failure is a signal about the eval, not the models.
Do not design routing rules before running evals. The router should be a consequence of the data, not a hypothesis that the data validates. Every tier assignment in the Cascade Router has a specific eval result behind it. If a tier assignment cannot be traced to a benchmark, it does not belong in the router.
Measure before you optimize. Serialize before you parallelize. Audit before you conclude. The surprises in this project — the size inversion, the tuning trap, the wrong answer key — were all discovered because the measurement infrastructure was rigorous enough to catch them.
The right model for any task is not the biggest model. It is the model that scores highest on that specific task type, at the throughput your workload requires, on the hardware you have.
| Model | GB | Quality R2 | Multi-turn | Domain | Peak tok/s | RAG gain |
|---|---|---|---|---|---|---|
| llama3.2:3b | 2 | 77.5% | 90.6% | 83.0% | 23,264 | +45pp |
| qwen3:8b | 5 | 63.1% | — | — | 2,217 | +33pp |
| deepseek-r1:7b | 5 | 59.4% | — | — | 1,735 | — |
| qwen2.5:7b | 5 | 80.6% | 100.0% | 93.0% | 10,601 | +60pp |
| llama3.2-vision:11b | 7 | 68.1% | — | — | 4,566 | — |
| gemma3:12b | 8 | 75.6% | 88.6% | 91.7% | 2,949 | +67pp |
| phi4:14b | 9 | 77.5% | — | — | 2,914 | — |
| qwen2.5-coder:14b | 9 | 81.2% | — | — | 2,879 | — |
| deepseek-r1:14b | 9 | 68.8% | — | 75.7% | 472 | — |
| qwen3:14b | 9 | 78.8% | 75.1% | 93.9% | 962 | — |
| qwen2.5:14b | 9 | 75.6% | — | — | 2,868 | — |
| mistral-small:22b | 13 | 75.6% | — | — | — | — |
| gpt-oss:20b | 13 | 70.0% | — | — | — | — |
| gemma3:27b | 17 | 75.6% | — | — | — | — |
| qwen3-coder:30b | 18 | 78.8% | — | — | — | — |
| qwen3:30b | 18 | 76.9% | 61.2% | 97.8% | 3,007 | +26pp |
| qwen2.5:32b | 19 | 80.6% | — | — | 728 | — |
| deepseek-r1:32b | 20 | 80.6% | — | — | — | — |
| llama3.3:70b | 42 | 83.8% | 47.8% | 80.4% | 126 | +26pp |
| deepseek-r1:70b | 42 | 71.2% | — | — | — | — |
| qwen2.5:72b | 47 | 83.8% | — | — | — | +60pp |
| qwen3:235b * | 142 | 56.2% | — | 97.8% | — | — |
| Model | Mode | Quality delta | Throughput | Recommendation |
|---|---|---|---|---|
| qwen3:8b | think | +13% | 2x slower | Use think — quality worth cost |
| qwen3:14b | no_think | +7% vs think | 2x faster | Use no_think — wins on both dimensions |
| qwen3:30b | think | +24% | same | Use think — no throughput cost at this size |
| Model | Bare score | RAG score | Gain |
|---|---|---|---|
| llama3.2:3b | 33% | 79% | +45pp |
| qwen3:8b | 67% | 100% | +33pp |
| qwen2.5:7b | 33% | 93% | +60pp |
| gemma3:12b | 33% | 100% | +67pp |
| qwen3:30b | 67% | 93% | +26pp |
| llama3.3:70b | 67% | 93% | +26pp |
| qwen2.5:72b | 33% | 93% | +60pp |
| Component | Spec / Version | Role |
|---|---|---|
| Mac Studio | 256GB unified RAM, Apple Silicon | All compute — no cloud |
| Ollama | Port 11434, llama.cpp backend | T1 + T4 model serving; multi-model management |
| MLX | mlx-lm 0.31.0, port 8081, Metal backend | T2 + T3 model serving; native batching |
| Council server | Python, port 8080 | HTTP API — /ask, /council, /speculate |
| Python | 3.14, requests + json | All eval scripts |