Running LLMs on Apple Silicon — What's Real, What's Hype
I spent a week on one question:
If we’re building agentic AI systems, when does Apple Silicon make more sense than renting NVIDIA?
This isn’t a fan post. It’s a field note from the operator seat. I don’t care about benchmark chest-thumping. I care about what ships, what it costs, and what breaks at 2am.
The funniest chart in tech
A16Z published a chart (FactSet, Feb 2026) of standardised quarterly capex for the big five:
- Amazon: +42% YOY
- Microsoft: +89% YOY
- Alphabet: +50% YOY
- Meta: +48% YOY
- Apple: −19% YOY
Four companies ploughing $100 billion per quarter into data centres. Apple spending less than last year.
Meanwhile: Mac Minis sold out. Mac Studios on six-week backlog. Someone ran Qwen 3.5 on an iPhone. The M5 Max just shipped with 128GB unified memory running Llama 70B from anywhere.
![]()
As @JoshKale put it: “The company spending the least on AI infrastructure accidentally became the AI infrastructure.”
That’s the tension this post unpacks.
Why we care
Our constraints are boring:
- Keep client data private
- Run models without insane infra bills
- Iterate fast
- Keep things stable enough to trust
That lens changes the conversation. Most internet debate is about peak numbers. Most real teams solve for reliability, economics, and speed of execution.
The architecture problem nobody explains well
Most people treat AI infrastructure as a pure compute race. Whoever has the most FLOPS wins. But in inference-heavy workflows — what most businesses actually do — memory is often the real bottleneck.
Traditional GPU setups: CPU and GPU have separate memory pools. Every AI query means data physically moves across a bus. Latency, wasted energy, hard performance ceiling. Architectural, not computational.
Apple’s answer: eliminate the separation entirely. Unified Memory Architecture (UMA) puts CPU, GPU, and Neural Engine on the same memory pool. No data movement. No bus bottleneck.
Practical result: a Mac Studio runs a 70B-parameter model locally, silently, on your desk. Not as fast as an H100 cluster. But you didn’t rent anything, configure anything, or send your data anywhere.
M5 Max: the spec sheet that broke Twitter
Apple just dropped the M5 Max MacBook Pro:
- 18-core CPU with 6 “super cores” — world’s fastest CPU core
- 40-core GPU — rivals an RTX 4070, in a laptop
- 128GB unified memory — more than most servers
- 614 GB/s memory bandwidth — 4x the DGX Spark
- 24-hour battery life
- $3,499
Llama 70B — a model that required a $40,000 GPU cluster 18 months ago — now runs on a laptop at a coffee shop. At 20–30 tok/s, fast enough to actually use.
The local AI revolution just shipped as a consumer product. With a keyboard and a battery.
The numbers that matter
Forget FLOPS. For inference, these metrics determine what you can run and what it costs:
Memory cost per GB:
- Apple M3 Ultra: $18/GB
- NVIDIA DGX Spark: $36/GB
- NVIDIA B200 (DGX): $360/GB
Apple is 20x cheaper per GB than NVIDIA’s best datacentre GPU. Half the price even against the budget-tier Spark.
Energy economics:
- M4 Ultra: ~400 joules per inference task vs cloud GPU: ~10x more
- H100 exceeds 700W sustained; Mac Studio M4 Ultra: a fraction
- A 4-node Mac Studio cluster draws under 250 watts. Whisper-quiet. Under your desk.
Stack Mac Studios and you’re building the cheapest way to run frontier AI models today. NVIDIA has, as one thread put it, “completely missed this segment.”
Doesn’t mean Apple replaces NVIDIA. But if your problem is “fit bigger models locally, keep data onshore, and not torch budget,” the economics are brutal.
Real cluster data (what convinced me)
Jeff Geerling’s 4x Mac Studio cluster testing (Apple-loaned, December 2025) shifted this from “interesting” to “usable.”
Setup: 4x M3 Ultra Mac Studios, 1.5TB total unified memory, Thunderbolt 5. Cost: ~$40,000.
Results:
- DeepSeek V3.1 (671B): 21.1 tok/s single → 27.8 two nodes → 32.5 tok/s four nodes
- Kimi K2 Thinking (1T params): 28 tok/s across the cluster
- Power: Under 250W total. Less than 10W idle per node.
- Geekbench: M3 Ultra beats DGX Spark and AMD AI Max+ 395 in single AND multi-core
- FP64: First small desktop to break 1 Tflop — nearly double the NVIDIA GB10
Geerling’s summary: “A single M3 Ultra Mac Studio has more horsepower than my entire Framework Desktop cluster, using half the power.”
Caveats: RDMA over TB5 is still early. Latency dropped from 300μs (TCP) to 5–9μs (RDMA), but setup requires Recovery Mode and cabling doesn’t scale past 4–7 nodes. No TB5 switches exist.
But “4 quiet boxes under your desk running trillion-parameter models at 28 tok/s for 250 watts” is a different universe from two years ago.
DGX Spark: honest assessment
Not anti-NVIDIA. The Spark has a real use case — it’s a capacity play for models (120B+ in NVFP4) that would crash a 24GB consumer GPU.
But the honest picture:
Carmack’s review (Oct 2025): Power maxing at 100W (not rated 240W). Roughly half quoted performance. Gets “quite hot.” Spontaneous rebooting. Verdict: “My M3 Pro was generating tokens at comparable speeds” for models that fit in 36GB.
Jan 2026 update: 2.5x improvement on prefill and batch. But token generation — what you actually feel — is bandwidth-limited. Physics problem.
Feb 2026: NVIDIA raised the price from $3,999 to $4,699 (18% hike, LPDDR5X supply). The same memory Apple uses. But Apple is the world’s largest LPDDR5X buyer and can ship 512GB Mac Studios while NVIDIA can’t hold pricing on 128GB. Structural supply chain advantage.
The Spark’s 273 GB/s bandwidth looks thin against the M5 Max’s 614 GB/s. AMD’s Strix Halo benchmarks similarly at half the price.
Where Spark shines: Brev hybrid routing, dual-Spark 256GB pools for Llama 405B, and 30+ NIM playbooks. If you need CUDA compatibility, it’s the cheapest entry.
MLX is quietly winning
Apple-side tooling is moving faster than most realise.
A forthcoming paper (vllm-mlx, EuroMLSys ‘26) benchmarked native Apple Silicon inference against llama.cpp:
- 21–87% higher throughput than llama.cpp on Apple Silicon
- M4 Max: up to 525 tok/s on text models (Qwen3-0.6B)
- Continuous batching: 4.3x aggregate throughput at 16 concurrent requests
- Prefix caching: 28x speedup on repeated image queries
Why? MLX exploits UMA properly — lazy evaluation, native quantisation kernels, true zero-copy. llama.cpp was designed for discrete GPUs and adapted; MLX was built for UMA.
Key insight: Apple Silicon’s advantage grows with concurrency. Continuous batching on UMA is fundamentally more efficient because KV cache doesn’t transfer between devices. Single-user tok/s comparisons are misleading — in multi-user serving, the UMA advantage compounds.
Demand signals match: Mac Minis backordered, Mac Studios six-week wait. When Alibaba dropped Qwen 3.5, MLX support landed same-day — running on an iPhone within hours.
I track a simple metric: TTLD (Time To Local Deployment) — model release to running privately. For agentic workflows where privacy and iteration speed beat peak throughput, Apple + MLX consistently wins TTLD even when it loses raw numbers.
Why DeepSeek V4 changes the hardware calculus
This is the part nobody’s connecting yet.
Standard transformers waste expensive GPU compute on two fundamentally different tasks:
- Static Recall — “What’s the syntax for a Python list comprehension?” Memory lookups. No reasoning needed.
- Dynamic Reasoning — Logic, composition, novel problem-solving. Needs full compute.
Every model today uses the same expensive hardware for both. Like hiring a surgeon to do your filing.
DeepSeek V4’s Engram architecture separates them. Static knowledge offloads to an O(1) hash-based lookup table in system DRAM — 100B parameters in regular memory, not GPU memory. Throughput penalty: less than 3%.
The hardware implication: high-bandwidth system memory is now as valuable as GPU FLOPS.
On traditional x86+GPU, Engram lookups go through the PCIe bus — bottleneck. On Apple Silicon’s UMA, they’re in the same memory pool as GPU compute. Zero-cost access.
V4 activates only 32B parameters from its 1 trillion total (fewer than V3 despite being 50% larger). Less GPU compute needed, more memory needed. Apple’s $/GB advantage dominates exactly the cost structure V4 optimises for.
The convergence thesis: As architectures evolve to separate memory from reasoning — Engram being the first — Apple Silicon’s unified memory becomes more advantageous. The trajectory:
- 2024: “Apple can’t do AI” (the training narrative)
- 2025: “Apple Silicon is interesting for inference” (the memory narrative)
- 2026: “Apple Silicon + Engram-style models = optimal local inference” (convergence)
As more models adopt conditional memory and knowledge offloading, the gap widens.
ANE: watch this space
The Neural Engine deserves a mention — not because it’s production-ready for LLMs, but because it signals untapped headroom.
M4’s Neural Engine: 38 TOPS. M5 puts neural accelerators inside each GPU core — generational architecture shift. A reverse-engineering project (maderix/ANE) has demonstrated training and backprop on the ANE with compelling microbenchmarks.
But today: private APIs, fragility risk, CPU fallbacks, potential breakage every macOS update. Research territory, not production dependency. File under “strategic headroom.”
The honest counterarguments
Throughput at scale: Need 100+ tok/s on the largest open models? Apple loses. Two Mac Studios running Kimi K2.5 at 4-bit quant: 10–12 tok/s. Two AMD TBv2 cards might be cheaper for raw throughput. Apple wins $/GB but can lose throughput-per-dollar at the high end.
RDMA is green: TB5 RDMA is a strategic signal, not production-grade clustering. CPU spikes (900%+) from TB Bridge loops. Recovery Mode access required. No TB5 switches. Full mesh means N-1 cables per node. Beyond 4–7 nodes: unsolved.
Cluster management pain: Exo clustering works and auto-discovers, but it’s not Kubernetes. Don’t expect enterprise ops maturity.
Training is still NVIDIA: Heavy training and fine-tuning remain NVIDIA territory. CUDA ecosystem, mature distributed training, raw scale. Training-heavy workload? This conversation doesn’t apply.
This isn’t a religion. Any real strategy here is mixed. The question isn’t Apple or NVIDIA — it’s knowing when each wins.
How we’re deciding
Apple-first when:
- Inference-heavy workload
- Privacy or data locality matters
- Memory-per-dollar matters
- Fast local iteration is valuable
- Data stays onshore without cloud bills
NVIDIA/cloud-first when:
- High throughput target (100+ tok/s on large models)
- Training or fine-tuning heavy
- Enterprise ops maturity required
- CUDA ecosystem compatibility needed
In practice: increasingly hybrid. Apple for local core paths and private data. Cloud for peak compute when needed.
Personal Computing v2
Karpathy framed it on Latent Space: “As we leave the cloud for Personal/Private AI, some signs of Personal Computing v2 are being born in Exolabs and Apple MLX work.”
Two eras:
- PC v1 (1980s): Computation moved from mainframes to desktops.
- PC v2 (2026+): AI inference moves from data centres to desktops.
Apple won the first transition. They’re building infrastructure for the second — not with data centres, but with architecture.
The data sovereignty angle isn’t just privacy idealism. It’s economic inevitability. As models grow and API costs compound, the break-even for local inference keeps moving earlier. A Mac Mini M4 Pro pays for itself vs cloud H100 rental ($2.39/hr) in roughly 1,000 hours — about 6 weeks of continuous inference.
Where I’ve landed
A year ago, serious local AI on Apple felt niche. Now it feels like a legitimate operating mode.
Not because Apple “won AI.” Because architecture, economics, and tooling velocity combined into something too practical to ignore.
Look at that capex chart again. Big Tech is spending hundreds of billions on top-down AI infrastructure — massive data centres, custom chips, power plants. Apple spent less and ended up with sold-out hardware that developers are clustering to run trillion-parameter models under their desks.
Nobody planned that. It’s what happens when you build the right architecture and demand finds you.
If you’re building agentic systems and haven’t pressure-tested local inference, it’s worth doing. The numbers might surprise you.
Next up: concrete reference builds (Mac Mini / Mac Studio tiers), expected model envelopes, and where each setup starts to fall over.