NVIDIA GB10 Grace Blackwell Benchmark: 17 Local LLMs Tested in 2026 (qwen3 at 82.5 tok/s)
Real-world tests of 17 local LLMs on the NVIDIA GB10 (DGX Spark, 121 GB unified memory). qwen3:30b-a3b dominates at 82.5 tok/s. Data-driven analysis, RAM comparison, and MoE vs. dense verdict.
TL;DR: On an NVIDIA GB10 Grace Blackwell (121 GB unified memory), qwen3:30b-a3b (MoE) spits out 82.5 tok/s in chat — the best speed-to-quality ratio on the market in 2026. 70B dense models (llama3.3, nemotron, deepseek-r1) all run at 4.7 tok/s: usable only in async mode. Beyond 123B, we hit the memory ceiling. Full table below.
Why this benchmark
I had access for a few days to an NVIDIA DGX Spark (MSI EdgeXpert MS-C931), an ARM64 machine equipped with the GB10 Grace Blackwell Superchip (20 ARMv9.2 cores, 121 GiB unified LPDDR5x, integrated Blackwell GPU, CUDA 13). This is one of the first “edge” platforms capable of running 70B dense models locally without disk paging.
I tested 17 models via Ollama + llama.cpp (for those that Ollama 0.21.2 still refused in May 2026, notably Qwen 3.5/3.6). All tests used the same technical prompt (200 predicted tokens) to ensure comparability.
Spoiler: The numbers confirm what theory suggested — MoE models crush comparable dense ones thanks to unified memory, and the RTX 3090 remains superior for dense 7B-32B models despite its limited VRAM.
Methodology
- Hardware: NVIDIA GB10 Grace Blackwell, 121 GiB unified LPDDR5x, integrated Blackwell GPU, CUDA 13 driver 580.142, DGX OS (Ubuntu 24.04 NVIDIA), kernel 6.17.0-1014-nvidia
- Runtime: Ollama 0.21.3-rc0 + llama.cpp main (for Qwen 3.5/3.6 architectures not supported by Ollama)
- Fixed prompt: « Explain in 5 short sentences why the NVIDIA GB10… » (see methodology note at the end of the article)
- Primary metric:
eval_tps(tokens/second generated, excluding prompt eval) - Secondary metrics:
prompt_tps, initial load time (cold start), delta RAM consumed
Full ranking (descending eval tok/s)
| # | Model | Type | Total / Active Params | eval tok/s | prompt tok/s | Load | Δ RAM |
|---|---|---|---|---|---|---|---|
| 🥇 1 | qwen3:30b-a3b | MoE | 30B / 3B | 82.5 | 442 | 19s | 45 GB |
| 🥈 2 | qwen2.5:7b | dense | 7B | 47.8 | 2,379 | 30s | 7 GB |
| 🥉 3 | mistral:7b | dense | 7B | 47.0 | 2,194 | 3s | 9 GB |
| 4 | gemma2:9b | dense | 9B | 40.0 | 1,709 | 14s | 9 GB |
| 5 | mixtral:8x7b | MoE | 46B / 12B | 30.8 | 518 | 23s | 30 GB |
| 6 | qwen2.5:14b | dense | 14B | 24.6 | 1,295 | 5s | 15 GB |
| 7 | phi4 | dense | 14B | 23.2 | 1,235 | 9s | 12 GB |
| 8 | gemma2:27b | dense | 27B | 14.3 | 689 | 6s | 19 GB |
| 9 | qwen2.5:32b | dense | 32B | 10.6 | 631 | 13s | 27 GB |
| 10 | nemotron:70b | dense | 70B | 4.7 | 260 | 27s | (cache) |
| 11 | llama3.3:70b | dense | 70B | 4.7 | 254 | 91s | 81 GB |
| 12 | deepseek-r1:70b | dense | 70B | 4.7 | 218 | 30s | 81 GB |
| 13 | mistral-large:123b | dense | 123B | 2.3 | 119 | 72s | 115 GB ⚠ |
Bonus: Models blocked by Ollama 0.21.2
Four models I attempted but which failed to load:
- qwen3:235b-a22b (142 GB) → exceeds the 121 GB unified memory limit, swap overflow
- Qwen3.5-27B (Unsloth HF) →
qwen35architecture not supported by Ollama 0.21.2 - Qwen3.6-35B-A3B (Unsloth HF) →
qwen35moearchitecture not supported - Qwen3.5-122B-A10B (Unsloth HF) → same architecture issue
→ However, these last three run fine via llama.cpp main compiled from source (with CUDA 13 + qwen35moe architecture). More details in a dedicated upcoming article.
Reading the results
1. The speed sweet spot: qwen3:30b-a3b (MoE)
At 82.5 tok/s, qwen3:30b-a3b is 1.7× faster than the best dense 7B and 17× faster than the best 70B. MoE advantage: only 3B parameters are “active” and actually computed per token, but the remaining 27B are instantly available in unified memory for expert routing.
On an RTX 3090 24 GB, this model cannot be fully loaded into VRAM (45 GB required) — it must share with system RAM, which divides performance by 3-5×.
2. The 70B ceiling: all at 4.7 tok/s
llama3.3:70b, nemotron:70b, and deepseek-r1:70b are all at the same floor of 4.7 tok/s. This is limited by LPDDR5x bandwidth (~273 GB/s) vs a dedicated GPU like the RTX 3090 (936 GB/s).
But: on an RTX 3090, these 70B models are physically impossible to run because their weights exceed the 24 GB VRAM limit. The GB10 makes them accessible, even at 4.7 tok/s — roughly ~280 tokens/minute, which remains usable for async tasks (long summaries, batch classification, CoT reasoning).
3. The absolute ceiling: 123B dense
Mistral-Large 123B reaches 2.3 tok/s but consumes 115 GB out of 121 GB available — the margin is too tight for interactive use (the system swaps at the slightest background process). In practice, this is the wall.
→ The conclusion: a 122B/10B MoE (Qwen3.5-122B-A10B) would be infinitely better on this machine. ~50 GB in memory (3× less), with comparable quality, and estimated speed of 25-30 tok/s. To be followed once Ollama supports the architecture.
4. Dense 7B-32B: RTX 3090 remains queen
For equal model size and FP16, an RTX 3090 outperforms the GB10 on everything that fits in 24 GB of VRAM. The GB10 is unbeatable only when the model does not fit in dedicated VRAM. So:
- If you plan to run mainly dense ≤ 32B → a used RTX 3090 (~€700) remains faster
- If you want to run 70B+ or 30B+ MoE → GB10 / DGX Spark crushes everything
2026 Practical Verdict
| Use Case | Recommended Model on GB10 | tok/s |
|---|---|---|
| Ultra-fast chat, 30B quality | qwen3:30b-a3b | 82.5 |
| Compact Opus-lite quality | qwen2.5:14b or phi4 | ~24 |
| Long reasoning (CoT) | deepseek-r1:70b | 4.7 |
| Max quality “70B general” tasks | nemotron:70b | 4.7 |
| OCR / Vision | qwen3-vl-30b-a3b (not tested here, see dedicated article) | 42 |
Further reading
If you are building your own self-hosted LLM stack with a comparable machine, here are the resources that are truly worth a look:
- Hostinger VPS offers plans with NVIDIA H100 / A100 GPUs available for hourly rental. For heavy benchmarks without investing €6,000 in hardware, this is the most effective option to test in 2026 (affiliate link — disclosure below).
- Bitdefender GravityZone: if your benchmark machine is exposed (SSH tunnel, remote IPMI), a professional EDR is non-negotiable. [In-depth review coming soon on this site].
- Official Ollama documentation: github.com/ollama/ollama
- llama.cpp main documentation: github.com/ggerganov/llama.cpp
FAQ
Is the GB10 Grace Blackwell worth the DGX Spark for 70B+ LLMs? Yes, provided you accept ~5 tok/s in generation. For async tasks (batch analysis, long-form summaries), it’s perfect. For interactive chat, stick to 30B MoE.
Why does qwen3:30b-a3b crush qwen2.5:32b despite having fewer parameters? Because it is a 30B/3B MoE: only 3 billion parameters are activated per token (a subset of experts). The dense qwen2.5:32b calculates 32B parameters every token. On bandwidth-limited hardware like the GB10, this is 8× more efficient.
How to reproduce this benchmark?
Install Ollama via the official script, ollama pull <model>, then ollama run <model> --verbose with the same prompt. The eval rate field gives tok/s.
What about ChatGPT / Claude / Gemini models? Out of scope: this benchmark concerns only local self-hosted models. Cloud APIs are a different category of problem (network latency, cost per token, vendor lock-in).
Methodology Note
All tests use the same prompt (200 predicted tokens):
“Explain in 5 short sentences why the NVIDIA GB10 Grace Blackwell (128 GB unified memory, native FP4, ARMv9.2) excels at MoE models and 70B+ LLMs, but can be outperformed by an RTX 3090 on dense 7B-32B models. Be technical and precise.”
Why this prompt: it is technical enough to engage reasoning layers, short enough not to bias prompt eval time, and language-neutral (FR + EN terms). All tests were launched cold (model unloaded between runs), with a 30-second wait between each to allow memory to return to a stable state.
Raw data (JSON) is available upon request for researchers wishing to reproduce or extend the benchmark.
Affiliate Disclosure
This article contains affiliate links. When you click on a link marked “affiliate link” and make a purchase, I may receive a commission. This does not affect the content or ranking in the table above — the numbers are the raw measured figures. Commissions are used solely to fund testing time (electricity, borrowed hardware, software subscriptions).
Article written on May 28, 2026 — data collected April 27, 2026.