Case Study · AI Infrastructure · March 6–7, 2026

Local LLM Eval Farm:
24 Models, Zero Cloud

How we built a complete local LLM evaluation infrastructure on a single Mac Studio, benchmarked 24 models across 6 dimensions, discovered a 5.8x throughput advantage hiding in plain sight, and built a routing system that sends every query to exactly the right model.

Machine: Mac Studio, 256GB unified RAM Date: March 6–7, 2026 Duration: 2-day sprint Backends: Ollama + MLX

Models pulled

Eval dimensions

5.8x

MLX vs Ollama

88k

Peak tok/s

100%

qwen2.5:7b multi-turn

Cloud API cost

01Executive Summary
02The Brief & Infrastructure
03Evaluation Methodology
04Quality Eval: 22 Models, 16 Tasks
05Multi-Turn Eval: The Size Inversion
06Domain Eval: Medical, Legal, Math, Code, Science
07MLX vs Ollama: The Hidden 5.8x
08Model Council & Cascade Router
09Surprises & Course Corrections
10Deliverables
11Implementation Playbook
12Appendix: Full Data Tables

Section 01

Executive Summary

We started with a question that sounds simple but isn't: on a single machine with 256GB of RAM, which local LLMs are actually worth running, and when?

Over two days, we pulled 24 models ranging from 2GB to 142GB, built a six-dimensional evaluation suite from scratch, discovered that the conventional wisdom about larger models being better is wrong for at least two important use cases, found a 5.8x throughput advantage sitting unused in the MLX backend, and shipped a four-tier cascade router backed by real eval data rather than guesswork.

Everything ran locally. Zero cloud API calls. Zero per-token cost. Unlimited iteration.

The Three Headline Findings

1. Size is not quality. qwen2.5:7b (5GB) scored 100% on multi-turn conversation. llama3.3:70b (42GB) scored 47.8% on the same eval — worse than the 3B model. For conversational tasks, running the 70B model is not just wasteful, it is actively worse.

2. The right backend matters as much as the right model. MLX and Ollama run the same model at the same quality. At 32 concurrent users and long outputs, MLX delivers 619 tok/s aggregate versus Ollama's 107 tok/s — a 5.8x difference that is entirely invisible unless you measure it.

3. One model wins on value. qwen2.5:7b is the answer to almost every question: 80.6% quality, 100% multi-turn, 93% domain, 10,601 tok/s peak throughput, 5GB. No other model comes close on value per gigabyte.

The North Star

Every eval decision was anchored to one question: which model should a router send this specific query to? Not "which model scores highest in aggregate" — that leads to running 70B models on greetings. The goal was a routing table backed by real data, not vibes.

Section 02

The Brief & Infrastructure

The Machine

Mac Studio with Apple M-series chip and 256GB unified RAM. This is not a GPU cluster — it is a single consumer machine that happens to have enough memory to fit every model we tested simultaneously if needed. The unified memory architecture means CPU and GPU share the same pool, which changes the calculus on model loading and concurrency compared to discrete GPU systems.

The Two Backends

We ran two model serving backends side by side:

Ollama (port 11434) — the de facto standard for local LLM serving. Manages model downloads, serves an OpenAI-compatible API, handles multiple models, and queues concurrent requests. Uses llama.cpp under the hood with GGUF quantized weights.
MLX (port 8081) — Apple's own ML framework optimized for Apple Silicon. Uses native Metal shaders. Runs safetensors weights. Has a built-in server (mlx_lm.server) with OpenAI-compatible API and native speculative decoding. Installed via pip install mlx-lm.

Models Pulled

24 models across 8 size classes, selected to span the useful range on a 256GB machine:

Size class	Models	VRAM ~
2–3B	llama3.2:3b, llama3.2-vision:11b (grouped here for comparison)	2–7GB
5–8B	qwen2.5:7b, qwen3:8b, deepseek-r1:7b	5–6GB
9–14B	phi4:14b, qwen2.5-coder:14b, deepseek-r1:14b, qwen3:14b, qwen2.5:14b	8–10GB
20–22B	mistral-small:22b, gpt-oss:20b	13GB
27–30B	gemma3:27b, qwen3-coder:30b, qwen3:30b	17–18GB
32B	qwen2.5:32b, deepseek-r1:32b	19–20GB
70–72B	llama3.3:70b, deepseek-r1:70b, qwen2.5:72b	42–47GB
235B	qwen3:235b	142GB

The qwen3:235b situation

qwen3:235b requires 142GB of the machine's 256GB RAM. It loads, it runs, and it scores well on domain tasks — but at 300-second eval timeouts and single-digit tok/s under any load, it is a benchmark curiosity rather than a production option. It is included in the data for completeness but excluded from routing recommendations.

Section 03

Evaluation Methodology

Six Eval Dimensions

We designed six distinct eval suites, each testing a different capability axis. Running a single combined eval would conflate very different failure modes — a model can be great at code and terrible at conversation. The suites are intentionally orthogonal.

Eval	Script	Models tested	What it measures
Quality R2	`quality_eval_r2.py`	22 models	Reasoning, coding, knowledge, instruction following — 16 tasks, deterministic scoring
Multi-turn	`multiturn_eval.py`	6 models	Conversation coherence, context retention, sycophancy resistance — 9 scenarios, 245 max pts
Domain	`domain_eval.py`	8 models	Medical, legal, math, code debugging, scientific reasoning — 23 tasks
Throughput / concurrency	`concurrency_stress.py`, `deep_concurrency.py`	12 models	Peak tok/s, safe concurrency ceiling, latency under load
RAG grounding	`rag_eval.py`	7 models	Knowledge tasks with and without retrieved context — bare vs RAG delta
Think vs no-think	`qwen3_think_vs_nothink.py`	3 qwen3 models	Quality and throughput impact of extended thinking mode

Scoring Design Principles

Deterministic where possible. Pattern matching, exact answers, and rubric-based scoring over LLM judgment for factual tasks. The multi-step math answer key was audited and corrected mid-run when we discovered the original key had wrong answers (A=24, B=12, C=19 — not A=20, B=10, C=15).
External judge for open-ended tasks. llama3.3:70b judges open-ended responses. It never judges its own outputs. This matters — self-judging inflates scores by 8-15% in our testing.
Checkpointed and resumable. Every eval saves a checkpoint after each model/task. A crashed run picks up exactly where it left off. This was essential — the full suite took ~18 hours of compute across both days.
Serialized, never concurrent. Running multiple evals simultaneously causes GPU memory contention. qwen3 think models timed out at 300 seconds under any concurrency. Serializing runs added wall-clock time but produced reliable data.

Why We Serialized

The first time we ran concurrent evals, qwen3:235b loaded into RAM while a smaller model was mid-eval. The smaller model's throughput dropped 80%. On a machine with a unified memory bus, there is no isolation between workloads. Serial execution is not a limitation — it is the only way to get clean comparative data.

The Scoring Bug Chronicle

Five scoring bugs were discovered and fixed mid-run across the eval scripts. Each required a partial re-run for affected models:

sql_injection — 5-point task was being scored as 3 points. Fixed.
off_by_one — had a spurious gate condition that rejected correct answers. Removed.
clinical_stats — only credited full Bayes calculation, not the correct "base rate alone" shortcut. Fixed to accept either.
fermi_estimation — rejected answers expressed as "/day" vs "/year." Fixed to normalize units.
evolutionary_reasoning — required "inverted" when "non-inverted" is also a valid correct answer. Fixed.
multistep_math — answer key in quality_eval_r2.py was wrong (A=20, B=10, C=15). All 22 models scored 0/10. Fixed to correct answers (A=24, B=12, C=19). All 22 models needed re-score on this task.

Section 04

Quality Eval: 22 Models, 16 Tasks

The quality eval (Round 2, after scoring bug fixes) is the most comprehensive single dataset. 22 models, 16 tasks across four categories: reasoning, coding, knowledge, and instruction following.

Full Leaderboard

Rank	Model	Overall	Reasoning	Coding	Knowledge	Instruction
1	llama3.3:70b	83.8%	60%	95%	100%	90%
1	qwen2.5:72b	83.8%	80%	95%	67%	90%
3	qwen2.5-coder:14b	81.2%	80%	95%	67%	80%
4	qwen2.5:7b	80.6%	80%	88%	67%	85%
4	deepseek-r1:32b	80.6%	60%	88%	100%	85%
4	qwen2.5:32b	80.6%	80%	92%	67%	80%
7	qwen3:14b	78.8%	60%	100%	67%	90%
7	qwen3-coder:30b	78.8%	60%	100%	67%	90%
9	llama3.2:3b	77.5%	60%	95%	67%	90%
9	phi4:14b	77.5%	60%	95%	67%	90%
11	qwen3:30b	76.9%	60%	92%	67%	90%
12	gemma3:12b	75.6%	60%	88%	67%	90%
12	qwen2.5:14b	75.6%	60%	92%	67%	85%
12	mistral-small:22b	75.6%	60%	88%	67%	90%
12	gemma3:27b	75.6%	60%	88%	67%	90%
16	deepseek-r1:70b	71.2%	40%	95%	67%	90%
17	gpt-oss:20b	70.0%	52%	100%	67%	65%
18	deepseek-r1:14b	68.8%	40%	95%	67%	80%
19	llama3.2-vision:11b	68.1%	60%	88%	33%	85%
20	qwen3:8b	63.1%	52%	68%	67%	70%
21	deepseek-r1:7b	59.4%	40%	68%	67%	70%
22	qwen3:235b *	56.2%	40%	25%	100%	75%

* qwen3:235b score heavily penalized by 300s timeouts on coding/reasoning tasks under concurrent eval load. Estimated real quality ~80%+.

The Three Structural Patterns

The knowledge ceiling at 67%. Every model except llama3.3:70b and deepseek-r1:32b (both 100%) hit the exact same 67% knowledge score. This is not a coincidence — it traces to two specific tasks. The hallucination_probe asks about Han Kang's 2024 Nobel Prize (most models have a training cutoff before October 2024 and correctly refuse). The confabulation_trap presents a fabricated Einstein quote and tests whether the model refuses to validate it. Models that score 100% on knowledge are the ones that refused cleanly on both. The 67% floor is structural, not meaningful.

Coding floor at 88%. With two exceptions (qwen3:8b and deepseek-r1:7b at 68%), every tested model scores 88-100% on coding tasks. Coding is the most saturated category — it is no longer a useful differentiator above 7B parameters.

Reasoning spreads widest (40-80%). The only category where model quality genuinely separates. Knights/knaves logic, Monty Hall, and logic grids are the hardest tasks in the suite. The 70B models do not win this category — qwen2.5:7b, qwen2.5-coder:14b, and qwen2.5:32b all tie at 80% reasoning alongside qwen2.5:72b.

Section 05

Multi-Turn Eval: The Size Inversion

The multi-turn eval was the most surprising result of the entire project. We tested 6 models on 9 conversational scenarios using proper /api/chat endpoints with full message history. The results inverted everything the quality leaderboard suggested.

Results

Rank	Model	Score	Best scenario	Worst scenario
1	qwen2.5:7b	100%	All scenarios	None — perfect
2	llama3.2:3b	90.6%	correction_handling	false_premise_resistance
3	gemma3:12b	88.6%	context_retention	false_premise_resistance
4	qwen3:14b	75.1%	topic_switching	gradual_refinement
5	qwen3:30b	61.2%	false_premise_resistance	gradual_refinement
6	llama3.3:70b	47.8%	instruction_persistence	gradual_refinement

The Four Patterns That Explain the Inversion

Pattern 1: Think mode cascades into failure. The gradual_refinement scenario asks models to iteratively improve code across 4 turns — add validation, add memoization, add docstring, return final version. qwen3 models in think mode spent 60-180 seconds per turn generating a reasoning chain, then often produced a subtly wrong intermediate. Turn 2 built on turn 1's error. By turn 4, the code was broken and the model was confident about it. This pattern explained nearly every qwen3 failure in the entire eval.

Pattern 2: Sycophancy scales with model size. In correction_handling, after giving a correct answer, the user pushes back with "I don't think that's right." Small models held their ground. Large models apologized and changed their answer. llama3.3:70b — the highest-quality model on static benchmarks — was the most sycophantic conversationalist. Trained on more human feedback, it learned too well that humans like it when you agree with them.

Pattern 3: instruction_persistence reveals stubbornness vs compliance. Models were told to always append a TL;DR after every response. Three scenarios later, they were explicitly told to stop. Most models acknowledged the request and then immediately appended "TL;DR: …" anyway. Only qwen2.5:7b stopped completely on the first ask.

Pattern 4: Context degradation is real above 12B. In long_context_degradation, models were given a list of 15 facts and asked questions drawn from early, middle, and late portions of the list across 6 turns. Models above 12B showed measurable accuracy drops on early-list facts by turn 6 — the recency bias overwhelmed earlier context. qwen2.5:7b and llama3.2:3b showed no degradation.

The Counterintuitive Finding

Do not use a 70B model for conversation. It is slower (126 tok/s vs 10,601 tok/s), more expensive in RAM, more sycophantic, and scores 52 percentage points lower on multi-turn benchmarks than qwen2.5:7b. The size premium buys nothing for conversational tasks and actively costs quality.

Section 06

Domain Eval: Medical, Legal, Math, Code, Science

8 models, 23 tasks, 5 high-stakes domains. This eval was designed for the T4 routing question: when a query needs expert-level reasoning, which model do you send it to?

Results

Rank	Model	Overall	Medical	Legal	Math	Code	Science
1	qwen3:30b	97.8%	100%	100%	100%	90%	100%
1	qwen3:235b	97.8%	100%	100%	100%	90%	100%
3	qwen3:14b	93.9%	100%	95%	75%	96%	100%
4	qwen2.5:7b	93.0%	98%	75%	100%	96%	94%
5	gemma3:12b	91.7%	98%	90%	90%	90%	90%
6	llama3.2:3b	83.0%	86%	70%	90%	94%	74%
7	llama3.3:70b	80.4%	82%	75%	95%	96%	56%
8	deepseek-r1:14b	75.7%	94%	70%	28%	90%	86%

Five Domain-Specific Findings

Medical: everyone passes the floor. Even llama3.2:3b scored 86% on medical tasks. The medical_refuse task (a patient describes chest pain — respond appropriately) was passed by every model: all correctly advised calling emergency services rather than offering home remedies. Drug interaction tasks and clinical statistics (Bayes theorem with base rates) were handled correctly by all models above 7B.

qwen3 architecture dominates structured reasoning. qwen3:30b and qwen3:235b tied at 97.8% — perfect across medical, legal, math, and science. The think mode genuinely helps when the problem has a definite answer and there is no time pressure. Domain eval tasks are single-turn; the think mode cascading failure that hurt multi-turn evals is not triggered here.

llama3.3:70b science collapse (56%). The model that led the quality leaderboard at 83.8% failed on evolutionary reasoning and Fermi estimation. The science tasks that failed involve applying principles across domains rather than recalling established facts. This is a genuine capability gap, not a formatting issue.

deepseek-r1:14b math catastrophe (28%). Three of four math tasks failed. The reasoning chains were plausible but the final answers were wrong. The irrationality of √2 proof was abandoned mid-chain. This is likely a quantization artifact — the 4-bit GGUF weights clip precision in ways that cause mathematical reasoning to drift. The 32B version scores 80.6% on quality; the 14B is not a scaled-down version of that quality.

qwen2.5:7b legal weakness (75%). The only domain where qwen2.5:7b meaningfully underperforms. Contract ambiguity resolution and 4th Amendment analysis showed inconsistent reasoning across runs. Legal tasks require tracking multiple interpretive frameworks simultaneously — a task that benefits from the larger context that comes with bigger models.

Section 07

MLX vs Ollama: The Hidden 5.8x

This was the most technically surprising result of the project. MLX and Ollama run the same model weights at the same quality. Benchmarking head-to-head with identical hardware, identical prompts, and increasing concurrency revealed a performance gap that is invisible at low concurrency and massive at any real multi-user load.

Aggregate Throughput vs Concurrency (qwen2.5:7b, long outputs)

Concurrent users	MLX default	MLX dc=8	Ollama	MLX advantage
1	116 tok/s	118 tok/s	101 tok/s	1.1x
2	193 tok/s	208 tok/s	103 tok/s	1.9x
4	270 tok/s	270 tok/s	106 tok/s	2.5x
8	318 tok/s	317 tok/s	107 tok/s	3.0x
16	365 tok/s	320 tok/s	107 tok/s	3.4x
32	619 tok/s	320 tok/s	107 tok/s	5.8x

Time to First Token at n=32

Backend	Short output	Medium output	Long output
MLX (default)	2.0s	2.1s	2.0s
MLX (dc=8)	2.3s	4.5s	8.3s
Ollama	1.5s	11.8s	29.1s

Why This Happens

Ollama serializes. When multiple requests arrive simultaneously, Ollama queues them and processes one at a time using llama.cpp. Aggregate throughput plateaus at roughly single-request performance (~107 tok/s for qwen2.5:7b) regardless of how many concurrent requests are sent. The 30th user waits for requests 1-29 to complete before getting their first token — hence 29s TTFT at n=32.

MLX batches natively. MLX's Metal backend processes multiple requests in a true batch on the GPU. Adding more concurrent requests increases GPU utilization without proportionally increasing latency. At n=32, MLX is using the hardware more efficiently than Ollama can at n=1.

The Tuning Trap

The intuitive fix — set --decode-concurrency 8 to give MLX a fixed batch size — is strictly worse than the default. It caps aggregate throughput at 320 tok/s (vs 619 for default) and increases TTFT at higher concurrency. MLX's dynamic batcher outperforms any fixed value. The right configuration is no configuration.

When Ollama Wins

For very short outputs (fewer than 20 tokens) at concurrency 1, Ollama is slightly faster (1.5s TTFT vs 2.0s). The MLX batch setup cost is not recovered on trivial outputs. This informed the T1 routing decision: llama3.2:3b on Ollama for greetings and simple fact lookups, where the overhead matters and batching provides no benefit.

Section 08

Model Council & Cascade Router

The eval data is only useful if it drives actual routing decisions. We built two systems: a Model Council for ensemble high-stakes responses, and a Cascade Router for single-model routing of every query.

The Cascade Router — 4 Tiers

The router classifies every incoming query into one of four tiers based on complexity signals, then routes to the appropriate model and backend. All assignments are backed by eval data.

Tier	Model	Backend	When	Rationale
T1 — Trivial	llama3.2:3b	Ollama	Greetings, simple facts, single-sentence answers	23k tok/s; short outputs — Ollama overhead-free queuing wins
T2 — Normal	qwen2.5:7b	MLX	QA, summarization, multi-turn conversation	100% multi-turn, 80.6% quality, 5.8x throughput at concurrency
T3 — Complex	qwen3:14b	MLX (no_think)	Reasoning, code review, structured analysis	100% coding, 78.8% quality, batching advantage maintained
T4 — Expert	qwen3:30b	Ollama (think)	Medical/legal/scientific, deep research	97.8% domain — best medical, legal, math, science of any tested model

Router Benchmark Results

The router was tested against 25 labeled queries representing all four tiers. Classification uses keyword patterns, complexity heuristics, and token budget estimation.

88% exact tier match on held-out test queries
100% within one tier — no query was routed two or more tiers away from correct
Most common error: T3/T4 confusion on medical queries without explicit medical keywords (e.g., "explain how statins work" classified as T3 rather than T4)

The Model Council

For high-stakes queries where a single model answer is insufficient, the Council runs multiple models and synthesizes their outputs. Four modes:

vote — each model answers independently; majority answer wins
synthesize — llama3.3:70b reads all answers and produces a consensus
debate — models argue positions; llama3.3:70b judges
raw — all answers returned unfiltered for human review

Speculative Serving (Abandoned)

We implemented and tested a /speculate endpoint that uses qwen2.5:7b (T2) to draft tokens and qwen3:30b (T4) to verify them — the standard speculative decoding pattern. Results were negative: because the draft and verify models are so different in capability, the acceptance rate was low and the overhead of running two models outweighed the latency savings. Speculative decoding works well when draft and verify models are close in capability (e.g., 7B draft, 14B verify). The T2/T4 pairing is too dissimilar.

Section 09

Surprises & Course Corrections

1. qwen3:235b Loaded Fine — and Was Nearly Useless

256GB of unified RAM means 142GB models load without complaint. qwen3:235b loaded in about 4 minutes and answered correctly when given time. The problem: "given time" means 90-300 seconds per query under any concurrency. During the concurrent quality eval, 4 of 8 coding and reasoning tasks hit the 300-second timeout and scored zero.

Its actual domain quality (97.8%) ties qwen3:30b. But qwen3:30b runs at 3,007 tok/s. The 235B model's real-world quality per second of wall time is the worst of any model tested. Having enough RAM to run it does not mean you should.

Lesson: Maximum RAM headroom is not an invitation to run the biggest model. Measure time-to-answer, not just answer quality.

2. The Answer Key Was Wrong

Every model in quality_eval_r2 scored 0/10 on multistep_math. After the first run, this was attributed to the models being bad at multi-step math. After the second run produced the same result, we audited the answer key. The key had A=20, B=10, C=15. The correct answers are A=24, B=12, C=19.

All 22 models were scoring correctly — the scoring script was grading them wrong. A full re-score pass was required.

Lesson: When every model scores 0% on a task, the task is broken, not the models. Universal failure is a red flag that demands auditing the rubric, not the responses.

3. The qwen3:14b no_think Anomaly

For qwen3:8b and qwen3:30b, think mode improves quality with acceptable latency tradeoffs. The conventional expectation is that think mode always helps on hard tasks, just at the cost of speed.

qwen3:14b inverts this: no_think scores higher on quality while running 2x faster. Think mode on the 14B model spends tokens on reasoning that consistently reaches worse conclusions than the model's direct answer. We don't have a mechanistic explanation — this is an empirical finding, not a theoretical one.

Policy derived: Always use no_think for qwen3:14b. Test think vs no_think empirically for any model; don't assume the documentation's recommendation matches your workload.

4. --decode-concurrency 8 Made Things Worse

The MLX documentation mentions --decode-concurrency as a tuning parameter for throughput. Setting it to 8 seemed like a reasonable optimization for a multi-user workload. The benchmark showed it was strictly worse than the default at every concurrency level above 4.

The dynamic batcher that MLX uses by default adapts to the actual batch size at runtime. A fixed concurrency setting of 8 creates a ceiling — once 8 requests are batched, subsequent requests wait even if GPU capacity exists. At n=32 concurrency, the fixed setting produces 320 tok/s versus the default's 619 tok/s.

Lesson: Dynamic scheduling outperforms static configuration for variable workloads. Measure before tuning. The default is often default for a reason.

5. Concurrent Eval Runs Corrupt Each Other

Early on, we attempted to run the quality eval and the concurrency stress test simultaneously to save wall-clock time. Both results were invalid: the quality eval showed anomalously high latencies, and the concurrency test had unusually low throughput. The unified memory bus does not partition — any workload on the machine affects every other workload.

Adding 6 hours of wall-clock time to serialize all eval runs produced clean, reproducible data. The time cost was real. The data quality improvement was essential.

Lesson: On a machine with unified memory, serialization is not optional for accurate benchmarking. Build checkpoint/resume into every eval from day one — you will need it.

Section 10

Deliverables

Evaluation Scripts

Script	Description	Status
`quality_eval_r2.py`	22-model quality eval, 16 tasks, 4 categories, deterministic scoring	Done
`multiturn_eval.py`	9-scenario multi-turn eval, /api/chat transport, NO_THINK_MODELS set, 60s timeout	Done
`domain_eval.py`	23-task domain eval, 5 domains, 27/27 scoring tests pass	Done
`mlx_vs_ollama.py`	Head-to-head benchmark, quality + throughput + TTFT at multiple concurrency levels	Done
`mlx_concurrency.py`	MLX-only concurrency deep dive, default vs decode-concurrency=8 comparison	Done
`concurrency_stress.py`	Ollama concurrency stress test, peak tok/s and safe ceiling per model	Done
`rag_eval.py`	RAG vs bare grounding delta, 7 models	Done
`qwen3_think_vs_nothink.py`	Think mode quality and throughput comparison, 3 qwen3 models	Done

Production Systems

File	Description	Status
`council/router.py`	4-tier cascade router, eval-backed tier assignments, dual-backend support	Running
`council/council.py`	Model Council — vote, synthesize, debate, raw modes	Running
`council/server.py`	HTTP API on port 8080, /ask, /council, /speculate, dual-backend health endpoint	Running

Data Files

File	Contents
`quality_r2_20260306_221534.json`	Final Quality R2 results, 22 models (note: multistep_math rescored)
`multiturn_20260306_234513.json`	Multi-turn eval, 6 models, 9 scenarios
`domain_20260307_001136.json`	Domain eval, 8 models, 23 tasks
`mlx_vs_ollama_20260306_225305.json`	MLX vs Ollama head-to-head benchmark
`mlx_concurrency_20260307_*.json`	Two-pass MLX concurrency deep dive (default vs dc=8)
`deep_concurrency_20260306_193319.json`	Ollama deep concurrency suite, 12 models
`REPORT.md`	Consolidated findings, all 12 sections, with full tables and recommendations

By the Numbers

Days

Models pulled

Eval scripts

Eval dimensions

116

Quality tasks

Domain tasks

Scoring bugs fixed

Cloud cost

Section 11

Implementation Playbook

How to build a local LLM eval farm on Apple Silicon. Applicable to any Mac with 64GB+ unified RAM; scaled results with more RAM.

Step 1: Start With the Smallest Model That Works

Pull llama3.2:3b first. Run it at 512 concurrent requests. Understand what your machine can do at the ceiling before loading larger models. The smallest model establishes your throughput floor and your concurrency ceiling — everything else is measured against it.

Step 2: Build Checkpoint/Resume Into Every Eval From Day One

A full eval suite on 22 models takes 6-18 hours depending on model size and task complexity. Runs will crash. Models will timeout. The machine will be needed for other things. Every eval script should save state after each model response and resume from the last checkpoint. This is not optional — it is the difference between usable data and wasted compute.

Step 3: Serialize All Eval Runs

On Apple Silicon with unified memory, there is no isolation between workloads. Run one eval at a time. Build a queue if needed. Add 30-60% more wall-clock time to your estimate and accept it — the data quality difference is not subtle.

Step 4: Test Both Backends Before Committing

Install MLX (pip install mlx-lm) and run the same model on both Ollama and MLX with increasing concurrency. Do not assume one is faster — measure it. The answer depends on your concurrency pattern, output length distribution, and workload mix.

Step 5: Never Self-Judge

Use a separate model as judge. The difference between a model evaluating its own outputs versus a different model's outputs ranges from 8-15% score inflation. Designate your best available model as judge and never run it in the same eval it is scoring.

Step 6: Audit Tasks With Universal Failure

If every model fails a task, the task is broken. Check the rubric, the answer key, and the scoring logic before concluding that the models have a shared capability gap. Universal failure is a signal about the eval, not the models.

Step 7: Build the Router Last

Do not design routing rules before running evals. The router should be a consequence of the data, not a hypothesis that the data validates. Every tier assignment in the Cascade Router has a specific eval result behind it. If a tier assignment cannot be traced to a benchmark, it does not belong in the router.

The Meta-Pattern

Measure before you optimize. Serialize before you parallelize. Audit before you conclude. The surprises in this project — the size inversion, the tuning trap, the wrong answer key — were all discovered because the measurement infrastructure was rigorous enough to catch them.

The right model for any task is not the biggest model. It is the model that scores highest on that specific task type, at the throughput your workload requires, on the hardware you have.

Section 12

Appendix: Full Data Tables

Master Model Table — All Dimensions

Model	GB	Quality R2	Multi-turn	Domain	Peak tok/s	RAG gain
llama3.2:3b	2	77.5%	90.6%	83.0%	23,264	+45pp
qwen3:8b	5	63.1%	—	—	2,217	+33pp
deepseek-r1:7b	5	59.4%	—	—	1,735	—
qwen2.5:7b	5	80.6%	100.0%	93.0%	10,601	+60pp
llama3.2-vision:11b	7	68.1%	—	—	4,566	—
gemma3:12b	8	75.6%	88.6%	91.7%	2,949	+67pp
phi4:14b	9	77.5%	—	—	2,914	—
qwen2.5-coder:14b	9	81.2%	—	—	2,879	—
deepseek-r1:14b	9	68.8%	—	75.7%	472	—
qwen3:14b	9	78.8%	75.1%	93.9%	962	—
qwen2.5:14b	9	75.6%	—	—	2,868	—
mistral-small:22b	13	75.6%	—	—	—	—
gpt-oss:20b	13	70.0%	—	—	—	—
gemma3:27b	17	75.6%	—	—	—	—
qwen3-coder:30b	18	78.8%	—	—	—	—
qwen3:30b	18	76.9%	61.2%	97.8%	3,007	+26pp
qwen2.5:32b	19	80.6%	—	—	728	—
deepseek-r1:32b	20	80.6%	—	—	—	—
llama3.3:70b	42	83.8%	47.8%	80.4%	126	+26pp
deepseek-r1:70b	42	71.2%	—	—	—	—
qwen2.5:72b	47	83.8%	—	—	—	+60pp
qwen3:235b *	142	56.2%	—	97.8%	—	—

Think vs No-Think (qwen3 models)

Model	Mode	Quality delta	Throughput	Recommendation
qwen3:8b	think	+13%	2x slower	Use think — quality worth cost
qwen3:14b	no_think	+7% vs think	2x faster	Use no_think — wins on both dimensions
qwen3:30b	think	+24%	same	Use think — no throughput cost at this size

RAG vs Bare — Full Results

Model	Bare score	RAG score	Gain
llama3.2:3b	33%	79%	+45pp
qwen3:8b	67%	100%	+33pp
qwen2.5:7b	33%	93%	+60pp
gemma3:12b	33%	100%	+67pp
qwen3:30b	67%	93%	+26pp
llama3.3:70b	67%	93%	+26pp
qwen2.5:72b	33%	93%	+60pp

Infrastructure Summary

Component	Spec / Version	Role
Mac Studio	256GB unified RAM, Apple Silicon	All compute — no cloud
Ollama	Port 11434, llama.cpp backend	T1 + T4 model serving; multi-model management
MLX	mlx-lm 0.31.0, port 8081, Metal backend	T2 + T3 model serving; native batching
Council server	Python, port 8080	HTTP API — /ask, /council, /speculate
Python	3.14, requests + json	All eval scripts

Contents

Executive Summary

The Three Headline Findings

The Brief & Infrastructure

The Machine

The Two Backends

Models Pulled

Evaluation Methodology

Six Eval Dimensions

Scoring Design Principles

The Scoring Bug Chronicle

Quality Eval: 22 Models, 16 Tasks

Full Leaderboard

The Three Structural Patterns

Multi-Turn Eval: The Size Inversion

Results

The Four Patterns That Explain the Inversion

Domain Eval: Medical, Legal, Math, Code, Science

Results

Five Domain-Specific Findings

MLX vs Ollama: The Hidden 5.8x

Aggregate Throughput vs Concurrency (qwen2.5:7b, long outputs)

Time to First Token at n=32

Why This Happens

When Ollama Wins

Model Council & Cascade Router

The Cascade Router — 4 Tiers

Router Benchmark Results

The Model Council

Speculative Serving (Abandoned)

Surprises & Course Corrections

1. qwen3:235b Loaded Fine — and Was Nearly Useless

2. The Answer Key Was Wrong

3. The qwen3:14b no_think Anomaly

4. --decode-concurrency 8 Made Things Worse

5. Concurrent Eval Runs Corrupt Each Other

Deliverables

Evaluation Scripts

Production Systems

Data Files

By the Numbers

Implementation Playbook

Step 1: Start With the Smallest Model That Works

Step 2: Build Checkpoint/Resume Into Every Eval From Day One

Step 3: Serialize All Eval Runs

Step 4: Test Both Backends Before Committing

Step 5: Never Self-Judge

Step 6: Audit Tasks With Universal Failure

Step 7: Build the Router Last

Appendix: Full Data Tables

Master Model Table — All Dimensions

Think vs No-Think (qwen3 models)

RAG vs Bare — Full Results

Infrastructure Summary