NVIDIA GB10 Grace Blackwell Benchmark: 17 Local LLMs Tested in 2026 (qwen3 at 82.5 tok/s)

Real-world tests of 17 local LLMs on the NVIDIA GB10 (DGX Spark, 121 GB unified memory). qwen3:30b-a3b dominates at 82.5 tok/s. Data-driven analysis, RAM comparison, and MoE vs. dense verdict.

TL;DR: On an NVIDIA GB10 Grace Blackwell (121 GB unified memory), qwen3:30b-a3b (MoE) spits out 82.5 tok/s in chat — the best speed-to-quality ratio on the market in 2026. 70B dense models (llama3.3, nemotron, deepseek-r1) all run at 4.7 tok/s: usable only in async mode. Beyond 123B, we hit the memory ceiling. Full table below.

Why this benchmark

I had access for a few days to an NVIDIA DGX Spark (MSI EdgeXpert MS-C931), an ARM64 machine equipped with the GB10 Grace Blackwell Superchip (20 ARMv9.2 cores, 121 GiB unified LPDDR5x, integrated Blackwell GPU, CUDA 13). This is one of the first “edge” platforms capable of running 70B dense models locally without disk paging.

I tested 17 models via Ollama + llama.cpp (for those that Ollama 0.21.2 still refused in May 2026, notably Qwen 3.5/3.6). All tests used the same technical prompt (200 predicted tokens) to ensure comparability.

Spoiler: The numbers confirm what theory suggested — MoE models crush comparable dense ones thanks to unified memory, and the RTX 3090 remains superior for dense 7B-32B models despite its limited VRAM.

Methodology

Hardware: NVIDIA GB10 Grace Blackwell, 121 GiB unified LPDDR5x, integrated Blackwell GPU, CUDA 13 driver 580.142, DGX OS (Ubuntu 24.04 NVIDIA), kernel 6.17.0-1014-nvidia
Runtime: Ollama 0.21.3-rc0 + llama.cpp main (for Qwen 3.5/3.6 architectures not supported by Ollama)
Fixed prompt: « Explain in 5 short sentences why the NVIDIA GB10… » (see methodology note at the end of the article)
Primary metric: eval_tps (tokens/second generated, excluding prompt eval)
Secondary metrics: prompt_tps, initial load time (cold start), delta RAM consumed

Full ranking (descending eval tok/s)

#	Model	Type	Total / Active Params	eval tok/s	prompt tok/s	Load	Δ RAM
🥇 1	qwen3:30b-a3b	MoE	30B / 3B	82.5	442	19s	45 GB
🥈 2	qwen2.5:7b	dense	7B	47.8	2,379	30s	7 GB
🥉 3	mistral:7b	dense	7B	47.0	2,194	3s	9 GB
4	gemma2:9b	dense	9B	40.0	1,709	14s	9 GB
5	mixtral:8x7b	MoE	46B / 12B	30.8	518	23s	30 GB
6	qwen2.5:14b	dense	14B	24.6	1,295	5s	15 GB
7	phi4	dense	14B	23.2	1,235	9s	12 GB
8	gemma2:27b	dense	27B	14.3	689	6s	19 GB
9	qwen2.5:32b	dense	32B	10.6	631	13s	27 GB
10	nemotron:70b	dense	70B	4.7	260	27s	(cache)
11	llama3.3:70b	dense	70B	4.7	254	91s	81 GB
12	deepseek-r1:70b	dense	70B	4.7	218	30s	81 GB
13	mistral-large:123b	dense	123B	2.3	119	72s	115 GB ⚠

Bonus: Models blocked by Ollama 0.21.2

Four models I attempted but which failed to load:

qwen3:235b-a22b (142 GB) → exceeds the 121 GB unified memory limit, swap overflow
Qwen3.5-27B (Unsloth HF) → qwen35 architecture not supported by Ollama 0.21.2
Qwen3.6-35B-A3B (Unsloth HF) → qwen35moe architecture not supported
Qwen3.5-122B-A10B (Unsloth HF) → same architecture issue

→ However, these last three run fine via llama.cpp main compiled from source (with CUDA 13 + qwen35moe architecture). More details in a dedicated upcoming article.

Reading the results

1. The speed sweet spot: qwen3:30b-a3b (MoE)

At 82.5 tok/s, qwen3:30b-a3b is 1.7× faster than the best dense 7B and 17× faster than the best 70B. MoE advantage: only 3B parameters are “active” and actually computed per token, but the remaining 27B are instantly available in unified memory for expert routing.

On an RTX 3090 24 GB, this model cannot be fully loaded into VRAM (45 GB required) — it must share with system RAM, which divides performance by 3-5×.

2. The 70B ceiling: all at 4.7 tok/s

llama3.3:70b, nemotron:70b, and deepseek-r1:70b are all at the same floor of 4.7 tok/s. This is limited by LPDDR5x bandwidth (~273 GB/s) vs a dedicated GPU like the RTX 3090 (936 GB/s).

But: on an RTX 3090, these 70B models are physically impossible to run because their weights exceed the 24 GB VRAM limit. The GB10 makes them accessible, even at 4.7 tok/s — roughly ~280 tokens/minute, which remains usable for async tasks (long summaries, batch classification, CoT reasoning).

3. The absolute ceiling: 123B dense

Mistral-Large 123B reaches 2.3 tok/s but consumes 115 GB out of 121 GB available — the margin is too tight for interactive use (the system swaps at the slightest background process). In practice, this is the wall.

→ The conclusion: a 122B/10B MoE (Qwen3.5-122B-A10B) would be infinitely better on this machine. ~50 GB in memory (3× less), with comparable quality, and estimated speed of 25-30 tok/s. To be followed once Ollama supports the architecture.

4. Dense 7B-32B: RTX 3090 remains queen

For equal model size and FP16, an RTX 3090 outperforms the GB10 on everything that fits in 24 GB of VRAM. The GB10 is unbeatable only when the model does not fit in dedicated VRAM. So:

If you plan to run mainly dense ≤ 32B → a used RTX 3090 (~€700) remains faster
If you want to run 70B+ or 30B+ MoE → GB10 / DGX Spark crushes everything

2026 Practical Verdict

Use Case	Recommended Model on GB10	tok/s
Ultra-fast chat, 30B quality	qwen3:30b-a3b	82.5
Compact Opus-lite quality	qwen2.5:14b or phi4	~24
Long reasoning (CoT)	deepseek-r1:70b	4.7
Max quality “70B general” tasks	nemotron:70b	4.7
OCR / Vision	qwen3-vl-30b-a3b (not tested here, see dedicated article)	42

FAQ

Is the GB10 Grace Blackwell worth the DGX Spark for 70B+ LLMs? Yes, provided you accept ~5 tok/s in generation. For async tasks (batch analysis, long-form summaries), it’s perfect. For interactive chat, stick to 30B MoE.

Why does qwen3:30b-a3b crush qwen2.5:32b despite having fewer parameters? Because it is a 30B/3B MoE: only 3 billion parameters are activated per token (a subset of experts). The dense qwen2.5:32b calculates 32B parameters every token. On bandwidth-limited hardware like the GB10, this is 8× more efficient.

How to reproduce this benchmark? Install Ollama via the official script, ollama pull <model>, then ollama run <model> --verbose with the same prompt. The eval rate field gives tok/s.

What about ChatGPT / Claude / Gemini models? Out of scope: this benchmark concerns only local self-hosted models. Cloud APIs are a different category of problem (network latency, cost per token, vendor lock-in).

Methodology Note

All tests use the same prompt (200 predicted tokens):

“Explain in 5 short sentences why the NVIDIA GB10 Grace Blackwell (128 GB unified memory, native FP4, ARMv9.2) excels at MoE models and 70B+ LLMs, but can be outperformed by an RTX 3090 on dense 7B-32B models. Be technical and precise.”

Why this prompt: it is technical enough to engage reasoning layers, short enough not to bias prompt eval time, and language-neutral (FR + EN terms). All tests were launched cold (model unloaded between runs), with a 30-second wait between each to allow memory to return to a stable state.

Raw data (JSON) is available upon request for researchers wishing to reproduce or extend the benchmark.

Affiliate Disclosure

This article contains affiliate links. When you click on a link marked “affiliate link” and make a purchase, I may receive a commission. This does not affect the content or ranking in the table above — the numbers are the raw measured figures. Commissions are used solely to fund testing time (electricity, borrowed hardware, software subscriptions).

Article written on May 28, 2026 — data collected April 27, 2026.