UK’s trusted IT infrastructure partner since 2003
Servnet
FinanceToolsConfiguratorGet in Touch
FREE TOOL · AI INFRASTRUCTURE SIZING

How much GPU do you
actually need?

Pick a model and your workload — get the exact GPUs, VRAM, a buildable server & rack spec, power, cooling, capex, monthly finance and a cloud-vs-own break-even. The only GPU calculator that ends in a real, orderable build.

GQA-correct VRAMFull server & rack BOMPower, cooling & financeCloud vs own break-even
VRAMSIZING · LIVE8× H100 · 9.8 kW
70.6B params · 80 layers · GQA · up to 128k context
8k tokens
32
Recommended build
4×
H100
1 × HGX/DGX-class SXM node · 10U · 1 rack · 5.4 kW
VRAM required217 GiB / 288 usable
Weights 131.5 KV cache 80 Overhead 5.2
Throughput
1,822 tok/s
~57/user · indicative
Max concurrent
~61
at 8k context
Cooling
18,426 BTU/hr
1.5 tons · 7.6kW facility
From
£3,397/mo
£158,000 capex · 60mo
Get this build quoted →
Your build — bill of materials
An orderable, part-level spec. Indicative ex-VAT — a formal quote follows.
4×NVIDIA H100 (80GB SXM)£104,000
1×HGX/DGX-class SXM chassis (ex-GPU)£45,000
1×Rack, ToR switching & PDUs£9,000
Indicative capex£158,000
or £3,397/mo on hire purchase (60mo)
Indicative estimate — subject to change & credit approval.
Power & cooling
Max system draw at the wall, facility at PUE 1.4, and the heat to reject.
IT load
5.4 kW
GPUs + servers
Facility power
7.6 kW
incl. PUE 1.4
Cooling
18,426
BTU/hr
Cooling
1.5
tons of refrigeration
Rack density is capped for thermal realism; a liquid-ready HGX deployment across 1 rack.
Own it vs rent the cloud
Owned = financed monthly + power & cooling. Cloud = on-demand £/GPU-hr at 60% utilisation. All editable below.
Own it (financed)
£4,784/mo
£3,397 finance + £1,387 power
£158,000 asset · you own it at the end (HP)
Rent the cloud (on-demand)
£3,889/mo
£2,334/mo on a 1-yr reserved commit
nothing owned · price rises with usage & market
At 60% utilisation on-demand cloud is cheaper monthly — worth revisiting as usage climbs. Owning wins once utilisation is sustained (drag it up below).
Get this build quoted & configured
4× H100
We'll turn this sizing into a firm, part-level quotation — GPUs, servers, networking, power & cooling — with finance options. No obligation.
All figures are indicative and for planning only; throughput especially varies with framework, batching and tuning. A formal quotation follows a technical scoping call.

GPU requirements by model

Indicative build to serve 32 concurrent users at 8k context on H100, at FP16 and INT4. Weights are the FP16 figure. Configure your own scenario in the tool above.

ModelParamsWeights (FP16)GPUs · FP16GPUs · INT4From (FP16)
Llama 3.1 8B8.03B15 GiB1×1×£1,720/mo
Mistral Small 3 (24B)23.6B44 GiB2×1×£2,279/mo
Gemma 2 27B27.2B50.7 GiB4×2×£3,397/mo
Llama 3.3 70B70.6B131.5 GiB4×2×£3,397/mo
Qwen2.5 72B72.7B135.4 GiB4×2×£3,397/mo
Mixtral 8x7B · MoE46.7B87 GiB2×1×£2,279/mo
Llama 3.1 405B405.85B756 GiB16×8×£11,073/mo
DeepSeek-R1 (671B MoE) · MoE671B1249.8 GiB24×8×£16,512/mo

Indicative only. GPU counts snap to a deployable topology (1, 2, 4, 8, then multiples of 8). Large models span multiple nodes.

How we work it out

The engine is transparent and cited. Model architecture constants are read from each model’s published configuration; hardware, power and cost figures come from first-party datasheets and public indices. Everything is indicative and clearly labelled.

  • Weights: params × bytes/param (Modal “How much VRAM for inference”; ~2 GB per 1B params at FP16).
  • KV cache (GQA-correct): 2 · layers · num_kv_heads · head_dim · context · batch · dtype-bytes (KV-cache/GQA literature).
  • Throughput: memory-bandwidth roofline for decode, anchored to NVIDIA NIM LLM benchmarking tables — indicative ranges.
  • Power: max system draw (DGX H100 ≈ 10.2 kW for 8 GPUs); facility = IT × PUE (Uptime Institute PUE survey).
  • Cooling: 1 kW = 3412.142 BTU/hr; 12,000 BTU/hr per ton (lib/ups/units).
  • Cloud: indicative $/GPU-hr indices 2025–26 (~$3 H100 on-demand), user-overridable, converted to GBP.
  • Per-model Hugging Face config.json (num_hidden_layers, hidden_size, num_attention_heads, num_key_value_heads, head_dim, intermediate_size, max_position_embeddings).
  • MoE active-param counts per each model’s technical report (Mixtral 2-of-8; DeepSeek 8-of-256, 37B active / 671B total).
  • DeepSeek MLA cache = kv_lora_rank(512) + qk_rope_head_dim(64) = 576 latent dims/token/layer (DeepSeek-V3 technical report).
  • NVIDIA H100 datasheet (SXM 80GB HBM3 3.35TB/s 700W; NVL 94GB 3.9TB/s).
  • NVIDIA H200 datasheet (141GB HBM3e 4.8TB/s 700W).
  • NVIDIA A100 datasheet (SXM 80GB ~2.0TB/s 400W).
  • NVIDIA L40S datasheet (48GB GDDR6 864GB/s 350W, no NVLink).
  • NVIDIA RTX 6000 Ada datasheet (48GB GDDR6 960GB/s 300W, no NVLink).
  • NVIDIA B200 / Blackwell datasheet (180GB HBM3e ~8TB/s ~1000W).
  • NVIDIA DGX H100/H200 user guide — 8× SXM GPU, 8U, ~10.2 kW max system power (not 19.8 kW PSU nameplate).
  • HGX 8-GPU baseboard platforms (Supermicro/Dell/Lenovo) for SXM; 4U–5U PCIe GPU servers for up to 8 PCIe cards.
  • DGX SuperPOD electrical/thermal design — ≤4 DGX-class nodes per rack at typical facility power density.
  • H100 on-demand pricing indices 2025–26 (~$1.49–6.98/GPU-hr, ~$3 average across providers); reserved ~$2.35.
  • Specialised GPU clouds ~$2–3/hr; hyperscalers ~$4–8/hr on-demand.

Take it further

AI/GPU sizing FAQs

How many GPUs do I need to run Llama 3 70B?

At FP16 a 70-billion-parameter model needs about 131 GiB just for weights, so it fits on two 80 GB H100s for a single stream — but serving many concurrent users adds KV-cache memory, so real deployments typically use 4–8 H100s. Quantising to INT4 cuts the weights to ~33 GiB. Enter your context length and concurrent users above for an exact figure.

How is the VRAM calculated?

VRAM = model weights + KV cache + overhead. Weights = parameters × bytes-per-parameter (FP16 = 2, FP8/INT8 = 1, INT4 = 0.5). The KV cache is computed the correct way for modern models — using the number of key/value heads (grouped-query attention), not the query-head count, which is the mistake most calculators make and which overstates memory 4–8× on Llama-class models. DeepSeek’s Multi-head Latent Attention is handled separately because its cache is far smaller.

Is it cheaper to own GPUs or rent the cloud?

It depends on utilisation. Renting on-demand is cheaper when GPUs sit idle much of the time; owning (financed, plus power and cooling) wins once utilisation is sustained. The tool shows the monthly figures side by side and the break-even month for a purchase versus cloud — drag the utilisation slider to see the crossover.

Are the costs a formal quote?

No — every figure is indicative and for planning only. Capex uses reseller guide prices and the monthly is derived from a representative UK rental factor; throughput especially varies with framework, batching and tuning. Request a quote and a specialist will turn the sizing into a firm, buildable quotation.

Which GPUs and models are supported?

GPUs: H200, H100 (SXM and NVL), B200, A100, L40S and RTX 6000 Ada. Models: the Llama 3.x family, Mistral & Mixtral, Qwen 2.5, DeepSeek-R1/V3, Gemma 2/3, Phi, Command R and Falcon — with architecture constants read from each model’s published configuration.

Can I share or embed the result?

Yes. Every configuration is encoded in the page URL, so “Save / share this build” copies a link that reproduces the exact result for a colleague. The tool can also be embedded on your own site.