FREE TOOL · AI INFRASTRUCTURE SIZING

How much GPU do you
actually need?

Pick a model and your workload — get the exact GPUs, VRAM, a buildable server & rack spec, power, cooling, capex, monthly finance and a cloud-vs-own break-even. The only GPU calculator that ends in a real, orderable build.

✓GQA-correct VRAM✓Full server & rack BOM✓Power, cooling & finance✓Cloud vs own break-even

Model to self-host

70.6B params · 80 layers · GQA · up to 128k context

Weight precision lower = less VRAM

Context window8k tokens

Concurrent users32

Target GPU

Recommended build

4×

H100

1 × HGX/DGX-class SXM node · 10U · 1 rack · 5.4 kW

VRAM required217 GiB / 288 usable

■ Weights 131.5■ KV cache 80■ Overhead 5.2

Throughput

1,822 tok/s

~57/user · indicative

Max concurrent

~61

at 8k context

Cooling

18,426 BTU/hr

1.5 tons · 7.6kW facility

From

£3,397/mo

£158,000 capex · 60mo

Get this build quoted →

Your build — bill of materials

An orderable, part-level spec. Indicative ex-VAT — a formal quote follows.

4×NVIDIA H100 (80GB SXM)£104,000

1×HGX/DGX-class SXM chassis (ex-GPU)£45,000

1×Rack, ToR switching & PDUs£9,000

Indicative capex£158,000

or £3,397/mo on hire purchase (60mo)

Indicative estimate — subject to change & credit approval.

Power & cooling

Max system draw at the wall, facility at PUE 1.4, and the heat to reject.

IT load

5.4 kW

GPUs + servers

Facility power

7.6 kW

incl. PUE 1.4

Cooling

18,426

BTU/hr

Cooling

1.5

tons of refrigeration

Rack density is capped for thermal realism; a liquid-ready HGX deployment across 1 rack.

Own it vs rent the cloud

Owned = financed monthly + power & cooling. Cloud = on-demand £/GPU-hr at 60% utilisation. All editable below.

Own it (financed)

£4,784/mo

£3,397 finance + £1,387 power

£158,000 asset · you own it at the end (HP)

Rent the cloud (on-demand)

£3,889/mo

£2,334/mo on a 1-yr reserved commit

nothing owned · price rises with usage & market

At 60% utilisation on-demand cloud is cheaper monthly — worth revisiting as usage climbs. Owning wins once utilisation is sustained (drag it up below).

Get this build quoted & configured

4× H100

We'll turn this sizing into a firm, part-level quotation — GPUs, servers, networking, power & cooling — with finance options. No obligation.

GPU requirements by model

Indicative build to serve 32 concurrent users at 8k context on H100, at FP16 and INT4. Weights are the FP16 figure. Configure your own scenario in the tool above.

Model	Params	Weights (FP16)	GPUs · FP16	GPUs · INT4	From (FP16)
Llama 3.1 8B	8.03B	15 GiB	1×	1×	£1,720/mo
Mistral Small 3 (24B)	23.6B	44 GiB	2×	1×	£2,279/mo
Gemma 2 27B	27.2B	50.7 GiB	4×	2×	£3,397/mo
Llama 3.3 70B	70.6B	131.5 GiB	4×	2×	£3,397/mo
Qwen2.5 72B	72.7B	135.4 GiB	4×	2×	£3,397/mo
Mixtral 8x7B · MoE	46.7B	87 GiB	2×	1×	£2,279/mo
Llama 3.1 405B	405.85B	756 GiB	16×	8×	£11,073/mo
DeepSeek-R1 (671B MoE) · MoE	671B	1249.8 GiB	24×	8×	£16,512/mo

Indicative only. GPU counts snap to a deployable topology (1, 2, 4, 8, then multiples of 8). Large models span multiple nodes.

How we work it out

The engine is transparent and cited. Model architecture constants are read from each model’s published configuration; hardware, power and cost figures come from first-party datasheets and public indices. Everything is indicative and clearly labelled.

›Weights: params × bytes/param (Modal “How much VRAM for inference”; ~2 GB per 1B params at FP16).
›KV cache (GQA-correct): 2 · layers · num_kv_heads · head_dim · context · batch · dtype-bytes (KV-cache/GQA literature).
›Throughput: memory-bandwidth roofline for decode, anchored to NVIDIA NIM LLM benchmarking tables — indicative ranges.
›Power: max system draw (DGX H100 ≈ 10.2 kW for 8 GPUs); facility = IT × PUE (Uptime Institute PUE survey).
›Cooling: 1 kW = 3412.142 BTU/hr; 12,000 BTU/hr per ton (lib/ups/units).
›Cloud: indicative $/GPU-hr indices 2025–26 (~$3 H100 on-demand), user-overridable, converted to GBP.
›Per-model Hugging Face config.json (num_hidden_layers, hidden_size, num_attention_heads, num_key_value_heads, head_dim, intermediate_size, max_position_embeddings).
›MoE active-param counts per each model’s technical report (Mixtral 2-of-8; DeepSeek 8-of-256, 37B active / 671B total).
›DeepSeek MLA cache = kv_lora_rank(512) + qk_rope_head_dim(64) = 576 latent dims/token/layer (DeepSeek-V3 technical report).
›NVIDIA H100 datasheet (SXM 80GB HBM3 3.35TB/s 700W; NVL 94GB 3.9TB/s).
›NVIDIA H200 datasheet (141GB HBM3e 4.8TB/s 700W).
›NVIDIA A100 datasheet (SXM 80GB ~2.0TB/s 400W).
›NVIDIA L40S datasheet (48GB GDDR6 864GB/s 350W, no NVLink).
›NVIDIA RTX 6000 Ada datasheet (48GB GDDR6 960GB/s 300W, no NVLink).
›NVIDIA B200 / Blackwell datasheet (180GB HBM3e ~8TB/s ~1000W).
›NVIDIA DGX H100/H200 user guide — 8× SXM GPU, 8U, ~10.2 kW max system power (not 19.8 kW PSU nameplate).
›HGX 8-GPU baseboard platforms (Supermicro/Dell/Lenovo) for SXM; 4U–5U PCIe GPU servers for up to 8 PCIe cards.
›DGX SuperPOD electrical/thermal design — ≤4 DGX-class nodes per rack at typical facility power density.
›H100 on-demand pricing indices 2025–26 (~$1.49–6.98/GPU-hr, ~$3 average across providers); reserved ~$2.35.
›Specialised GPU clouds ~$2–3/hr; hyperscalers ~$4–8/hr on-demand.

Take it further

NVIDIA DGX systems →

The 8-GPU HGX/DGX platforms behind these builds.

GPU accelerators →

H100, H200, L40S and A100 accelerator specs.

Server room cooling →

Turn the kW load into cooling in BTU/hr and tons.

UPS sizing →

Size the UPS and circuits around the GPU load.

IT finance calculator →

Finance the whole cluster — compare HP, lease and subscription.

Cloud vs on-prem TCO →

The broader own-vs-cloud total cost picture.

How much VRAM does an LLM need? →

Weights + KV cache + overhead, explained.

Self-hosting vs cloud GPUs →

The honest UK break-even on owning vs renting.

AI/GPU sizing FAQs

How many GPUs do I need to run Llama 3 70B?

At FP16 a 70-billion-parameter model needs about 131 GiB just for weights, so it fits on two 80 GB H100s for a single stream — but serving many concurrent users adds KV-cache memory, so real deployments typically use 4–8 H100s. Quantising to INT4 cuts the weights to ~33 GiB. Enter your context length and concurrent users above for an exact figure.

How is the VRAM calculated?

VRAM = model weights + KV cache + overhead. Weights = parameters × bytes-per-parameter (FP16 = 2, FP8/INT8 = 1, INT4 = 0.5). The KV cache is computed the correct way for modern models — using the number of key/value heads (grouped-query attention), not the query-head count, which is the mistake most calculators make and which overstates memory 4–8× on Llama-class models. DeepSeek’s Multi-head Latent Attention is handled separately because its cache is far smaller.

Is it cheaper to own GPUs or rent the cloud?

It depends on utilisation. Renting on-demand is cheaper when GPUs sit idle much of the time; owning (financed, plus power and cooling) wins once utilisation is sustained. The tool shows the monthly figures side by side and the break-even month for a purchase versus cloud — drag the utilisation slider to see the crossover.

Are the costs a formal quote?

No — every figure is indicative and for planning only. Capex uses reseller guide prices and the monthly is derived from a representative UK rental factor; throughput especially varies with framework, batching and tuning. Request a quote and a specialist will turn the sizing into a firm, buildable quotation.

Which GPUs and models are supported?

GPUs: H200, H100 (SXM and NVL), B200, A100, L40S and RTX 6000 Ada. Models: the Llama 3.x family, Mistral & Mixtral, Qwen 2.5, DeepSeek-R1/V3, Gemma 2/3, Phi, Command R and Falcon — with architecture constants read from each model’s published configuration.

Can I share or embed the result?

Yes. Every configuration is encoded in the page URL, so “Save / share this build” copies a link that reproduces the exact result for a colleague. The tool can also be embedded on your own site.

How much GPU do youactually need?

GPU requirements by model

How we work it out

Take it further

AI/GPU sizing FAQs

How much GPU do you
actually need?