📊 Benchmarks · 7 min read

NVIDIA GB10 Grace Blackwell Benchmark: 17 Local LLMs Tested in 2026 (qwen3 at 82.5 tok/s)

Real-world tests of 17 local LLMs on the NVIDIA GB10 (DGX Spark, 121 GB unified memory). qwen3:30b-a3b dominates at 82.5 tok/s. Data-driven analysis, RAM comparison, and MoE vs. dense verdict.

S By Selfhostr Team · independent tests
ⓘ This article may contain affiliate links (no extra cost to you, it supports our tests). See the disclosure.

TL;DR: On an NVIDIA GB10 Grace Blackwell (121 GB unified memory), qwen3:30b-a3b (MoE) spits out 82.5 tok/s in chat — the best speed-to-quality ratio on the market in 2026. 70B dense models (llama3.3, nemotron, deepseek-r1) all run at 4.7 tok/s: usable only in async mode. Beyond 123B, we hit the memory ceiling. Full table below.

Why this benchmark

I had access for a few days to an NVIDIA DGX Spark (MSI EdgeXpert MS-C931), an ARM64 machine equipped with the GB10 Grace Blackwell Superchip (20 ARMv9.2 cores, 121 GiB unified LPDDR5x, integrated Blackwell GPU, CUDA 13). This is one of the first “edge” platforms capable of running 70B dense models locally without disk paging.

I tested 17 models via Ollama + llama.cpp (for those that Ollama 0.21.2 still refused in May 2026, notably Qwen 3.5/3.6). All tests used the same technical prompt (200 predicted tokens) to ensure comparability.

Spoiler: The numbers confirm what theory suggested — MoE models crush comparable dense ones thanks to unified memory, and the RTX 3090 remains superior for dense 7B-32B models despite its limited VRAM.

Methodology

Full ranking (descending eval tok/s)

#ModelTypeTotal / Active Paramseval tok/sprompt tok/sLoadΔ RAM
🥇 1qwen3:30b-a3bMoE30B / 3B82.544219s45 GB
🥈 2qwen2.5:7bdense7B47.82,37930s7 GB
🥉 3mistral:7bdense7B47.02,1943s9 GB
4gemma2:9bdense9B40.01,70914s9 GB
5mixtral:8x7bMoE46B / 12B30.851823s30 GB
6qwen2.5:14bdense14B24.61,2955s15 GB
7phi4dense14B23.21,2359s12 GB
8gemma2:27bdense27B14.36896s19 GB
9qwen2.5:32bdense32B10.663113s27 GB
10nemotron:70bdense70B4.726027s(cache)
11llama3.3:70bdense70B4.725491s81 GB
12deepseek-r1:70bdense70B4.721830s81 GB
13mistral-large:123bdense123B2.311972s115 GB

Bonus: Models blocked by Ollama 0.21.2

Four models I attempted but which failed to load:

→ However, these last three run fine via llama.cpp main compiled from source (with CUDA 13 + qwen35moe architecture). More details in a dedicated upcoming article.

Reading the results

1. The speed sweet spot: qwen3:30b-a3b (MoE)

At 82.5 tok/s, qwen3:30b-a3b is 1.7× faster than the best dense 7B and 17× faster than the best 70B. MoE advantage: only 3B parameters are “active” and actually computed per token, but the remaining 27B are instantly available in unified memory for expert routing.

On an RTX 3090 24 GB, this model cannot be fully loaded into VRAM (45 GB required) — it must share with system RAM, which divides performance by 3-5×.

2. The 70B ceiling: all at 4.7 tok/s

llama3.3:70b, nemotron:70b, and deepseek-r1:70b are all at the same floor of 4.7 tok/s. This is limited by LPDDR5x bandwidth (~273 GB/s) vs a dedicated GPU like the RTX 3090 (936 GB/s).

But: on an RTX 3090, these 70B models are physically impossible to run because their weights exceed the 24 GB VRAM limit. The GB10 makes them accessible, even at 4.7 tok/s — roughly ~280 tokens/minute, which remains usable for async tasks (long summaries, batch classification, CoT reasoning).

3. The absolute ceiling: 123B dense

Mistral-Large 123B reaches 2.3 tok/s but consumes 115 GB out of 121 GB available — the margin is too tight for interactive use (the system swaps at the slightest background process). In practice, this is the wall.

→ The conclusion: a 122B/10B MoE (Qwen3.5-122B-A10B) would be infinitely better on this machine. ~50 GB in memory (3× less), with comparable quality, and estimated speed of 25-30 tok/s. To be followed once Ollama supports the architecture.

4. Dense 7B-32B: RTX 3090 remains queen

For equal model size and FP16, an RTX 3090 outperforms the GB10 on everything that fits in 24 GB of VRAM. The GB10 is unbeatable only when the model does not fit in dedicated VRAM. So:

2026 Practical Verdict

Use CaseRecommended Model on GB10tok/s
Ultra-fast chat, 30B qualityqwen3:30b-a3b82.5
Compact Opus-lite qualityqwen2.5:14b or phi4~24
Long reasoning (CoT)deepseek-r1:70b4.7
Max quality “70B general” tasksnemotron:70b4.7
OCR / Visionqwen3-vl-30b-a3b (not tested here, see dedicated article)42

Further reading

If you are building your own self-hosted LLM stack with a comparable machine, here are the resources that are truly worth a look:

FAQ

Is the GB10 Grace Blackwell worth the DGX Spark for 70B+ LLMs? Yes, provided you accept ~5 tok/s in generation. For async tasks (batch analysis, long-form summaries), it’s perfect. For interactive chat, stick to 30B MoE.

Why does qwen3:30b-a3b crush qwen2.5:32b despite having fewer parameters? Because it is a 30B/3B MoE: only 3 billion parameters are activated per token (a subset of experts). The dense qwen2.5:32b calculates 32B parameters every token. On bandwidth-limited hardware like the GB10, this is 8× more efficient.

How to reproduce this benchmark? Install Ollama via the official script, ollama pull <model>, then ollama run <model> --verbose with the same prompt. The eval rate field gives tok/s.

What about ChatGPT / Claude / Gemini models? Out of scope: this benchmark concerns only local self-hosted models. Cloud APIs are a different category of problem (network latency, cost per token, vendor lock-in).


Methodology Note

All tests use the same prompt (200 predicted tokens):

“Explain in 5 short sentences why the NVIDIA GB10 Grace Blackwell (128 GB unified memory, native FP4, ARMv9.2) excels at MoE models and 70B+ LLMs, but can be outperformed by an RTX 3090 on dense 7B-32B models. Be technical and precise.”

Why this prompt: it is technical enough to engage reasoning layers, short enough not to bias prompt eval time, and language-neutral (FR + EN terms). All tests were launched cold (model unloaded between runs), with a 30-second wait between each to allow memory to return to a stable state.

Raw data (JSON) is available upon request for researchers wishing to reproduce or extend the benchmark.


Affiliate Disclosure

This article contains affiliate links. When you click on a link marked “affiliate link” and make a purchase, I may receive a commission. This does not affect the content or ranking in the table above — the numbers are the raw measured figures. Commissions are used solely to fund testing time (electricity, borrowed hardware, software subscriptions).

Article written on May 28, 2026 — data collected April 27, 2026.

Tags: benchmarkNVIDIAGB10Grace BlackwellLLMlocal AIqwen3DGX SparkMoEperformance

Related

⚖️ Comparisons

Local 70B LLM in 2026: DeepSeek-R1 vs Llama 3.3 vs Nemotron — Which to Choose?

Detailed comparison of the top 3 open-source 70B LLMs in 2026. Analyzing tok/s benchmarks, reasoning quality, and RAM consumption. Find the best model for your specific use case.

Read
⚖️ Comparisons

Best Cloud Hosting 2026: Scaleway, Hetzner Cloud, DigitalOcean, Vultr Compared

Technical 2026 comparison of top cloud hosts (Scaleway, Hetzner, DO, Vultr). Analyze vCPU pricing, sovereignty, GPU support, and benchmarks to choose the ideal VPS for self-hosting.

Read
⚖️ Comparisons

Best WordPress Hosting 2026: Shared vs VPS vs Managed (Comparison)

2026 comparison: Shared, VPS, or Managed WordPress? Technical analysis of performance, costs, and scalability to choose the ideal WordPress stack in 2026.

Read