GPU requirements by model
Indicative build to serve 32 concurrent users at 8k context on H100, at FP16 and INT4. Weights are the FP16 figure. Configure your own scenario in the tool above.
| Model | Params | Weights (FP16) | GPUs · FP16 | GPUs · INT4 | From (FP16) |
|---|---|---|---|---|---|
| Llama 3.1 8B | 8.03B | 15 GiB | 1× | 1× | £1,720/mo |
| Mistral Small 3 (24B) | 23.6B | 44 GiB | 2× | 1× | £2,279/mo |
| Gemma 2 27B | 27.2B | 50.7 GiB | 4× | 2× | £3,397/mo |
| Llama 3.3 70B | 70.6B | 131.5 GiB | 4× | 2× | £3,397/mo |
| Qwen2.5 72B | 72.7B | 135.4 GiB | 4× | 2× | £3,397/mo |
| Mixtral 8x7B · MoE | 46.7B | 87 GiB | 2× | 1× | £2,279/mo |
| Llama 3.1 405B | 405.85B | 756 GiB | 16× | 8× | £11,073/mo |
| DeepSeek-R1 (671B MoE) · MoE | 671B | 1249.8 GiB | 24× | 8× | £16,512/mo |
Indicative only. GPU counts snap to a deployable topology (1, 2, 4, 8, then multiples of 8). Large models span multiple nodes.
How we work it out
The engine is transparent and cited. Model architecture constants are read from each model’s published configuration; hardware, power and cost figures come from first-party datasheets and public indices. Everything is indicative and clearly labelled.
- ›Weights: params × bytes/param (Modal “How much VRAM for inference”; ~2 GB per 1B params at FP16).
- ›KV cache (GQA-correct): 2 · layers · num_kv_heads · head_dim · context · batch · dtype-bytes (KV-cache/GQA literature).
- ›Throughput: memory-bandwidth roofline for decode, anchored to NVIDIA NIM LLM benchmarking tables — indicative ranges.
- ›Power: max system draw (DGX H100 ≈ 10.2 kW for 8 GPUs); facility = IT × PUE (Uptime Institute PUE survey).
- ›Cooling: 1 kW = 3412.142 BTU/hr; 12,000 BTU/hr per ton (lib/ups/units).
- ›Cloud: indicative $/GPU-hr indices 2025–26 (~$3 H100 on-demand), user-overridable, converted to GBP.
- ›Per-model Hugging Face config.json (num_hidden_layers, hidden_size, num_attention_heads, num_key_value_heads, head_dim, intermediate_size, max_position_embeddings).
- ›MoE active-param counts per each model’s technical report (Mixtral 2-of-8; DeepSeek 8-of-256, 37B active / 671B total).
- ›DeepSeek MLA cache = kv_lora_rank(512) + qk_rope_head_dim(64) = 576 latent dims/token/layer (DeepSeek-V3 technical report).
- ›NVIDIA H100 datasheet (SXM 80GB HBM3 3.35TB/s 700W; NVL 94GB 3.9TB/s).
- ›NVIDIA H200 datasheet (141GB HBM3e 4.8TB/s 700W).
- ›NVIDIA A100 datasheet (SXM 80GB ~2.0TB/s 400W).
- ›NVIDIA L40S datasheet (48GB GDDR6 864GB/s 350W, no NVLink).
- ›NVIDIA RTX 6000 Ada datasheet (48GB GDDR6 960GB/s 300W, no NVLink).
- ›NVIDIA B200 / Blackwell datasheet (180GB HBM3e ~8TB/s ~1000W).
- ›NVIDIA DGX H100/H200 user guide — 8× SXM GPU, 8U, ~10.2 kW max system power (not 19.8 kW PSU nameplate).
- ›HGX 8-GPU baseboard platforms (Supermicro/Dell/Lenovo) for SXM; 4U–5U PCIe GPU servers for up to 8 PCIe cards.
- ›DGX SuperPOD electrical/thermal design — ≤4 DGX-class nodes per rack at typical facility power density.
- ›H100 on-demand pricing indices 2025–26 (~$1.49–6.98/GPU-hr, ~$3 average across providers); reserved ~$2.35.
- ›Specialised GPU clouds ~$2–3/hr; hyperscalers ~$4–8/hr on-demand.
Take it further
AI/GPU sizing FAQs
How many GPUs do I need to run Llama 3 70B?
At FP16 a 70-billion-parameter model needs about 131 GiB just for weights, so it fits on two 80 GB H100s for a single stream — but serving many concurrent users adds KV-cache memory, so real deployments typically use 4–8 H100s. Quantising to INT4 cuts the weights to ~33 GiB. Enter your context length and concurrent users above for an exact figure.
How is the VRAM calculated?
VRAM = model weights + KV cache + overhead. Weights = parameters × bytes-per-parameter (FP16 = 2, FP8/INT8 = 1, INT4 = 0.5). The KV cache is computed the correct way for modern models — using the number of key/value heads (grouped-query attention), not the query-head count, which is the mistake most calculators make and which overstates memory 4–8× on Llama-class models. DeepSeek’s Multi-head Latent Attention is handled separately because its cache is far smaller.
Is it cheaper to own GPUs or rent the cloud?
It depends on utilisation. Renting on-demand is cheaper when GPUs sit idle much of the time; owning (financed, plus power and cooling) wins once utilisation is sustained. The tool shows the monthly figures side by side and the break-even month for a purchase versus cloud — drag the utilisation slider to see the crossover.
Are the costs a formal quote?
No — every figure is indicative and for planning only. Capex uses reseller guide prices and the monthly is derived from a representative UK rental factor; throughput especially varies with framework, batching and tuning. Request a quote and a specialist will turn the sizing into a firm, buildable quotation.
Which GPUs and models are supported?
GPUs: H200, H100 (SXM and NVL), B200, A100, L40S and RTX 6000 Ada. Models: the Llama 3.x family, Mistral & Mixtral, Qwen 2.5, DeepSeek-R1/V3, Gemma 2/3, Phi, Command R and Falcon — with architecture constants read from each model’s published configuration.
Can I share or embed the result?
Yes. Every configuration is encoded in the page URL, so “Save / share this build” copies a link that reproduces the exact result for a colleague. The tool can also be embedded on your own site.