UK’s trusted IT infrastructure partner since 2003
Servnet
FinanceToolsConfiguratorGet in Touch

How many GPUs does Llama 3 70B actually need?

Llama 3.3 and 3.1 70B are dense models, so every one of their 70.6 billion parameters is active on every token. Use the calculator below to size VRAM, GPU count, power and cost for your workload.

Llama 3 70B has become the default open-weight choice for UK enterprises that want frontier-class quality on their own hardware. Because it is a dense transformer rather than a mixture-of-experts model, its full parameter count loads into VRAM regardless of prompt, which makes GPU sizing predictable but demanding. The calculator on this page turns your precision, context length and concurrency into exact GPU counts and running costs.

Reference build · Llama 3.3 70B · FP16 · 32 users · 8k context
4× H100
216.7 GiB VRAM · 10U · 5.4 kW · 1,822 tok/s
£3,397/mo · £158,000 capex
PrecisionGPUs (H100)VRAMThroughputFrom
FP164×216.7 GiB1,822 tok/s£3,397/mo
FP84×149.7 GiB3,644 tok/s£3,397/mo
INT42×116.1 GiB3,644 tok/s£2,279/mo
VRAM breakdown — 217 GiBWeights131.5 GiBKV cache80.0 GiBOverhead5.2 GiB
Llama 3.3 70B at FP16, 8k context, 32 concurrent users — indicative.
GPUs required by precisionFP164× H100FP84× H100INT42× H100
H100 count by weight precision. Quantising cuts hardware sharply.
Size it precisely in the calculator →

All figures on this page are indicative estimates for planning only and are subject to change; hardware and throughput vary with configuration and tuning, and any monthly finance figure is subject to credit approval and is not a quotation.

Why FP16 spills across multiple 80GB GPUs

At FP16 the 70.6-billion-parameter weights alone exceed the capacity of a single 80GB accelerator, so serving Llama 3.3 or 3.1 70B unquantised means splitting the model across several H100 or H200 cards with tensor parallelism. That makes fast interconnect essential: NVLink between GPUs in one node keeps per-token latency low, whereas crossing PCIe or the network between nodes hurts throughput. The tool below shows exactly where your chosen precision crosses each card's ceiling.

How INT4 collapses the footprint

Quantising to INT4 with AWQ or GPTQ stores each weight in half a byte instead of two, cutting the weight memory roughly fourfold and often bringing Llama 3 70B down onto a single high-capacity GPU. For most retrieval, chat and summarisation workloads the quality drop is minor and easily justified by the hardware saving. The calculator contrasts FP16, FP8 and INT4 side by side so you can see the GPU-count and cost cliff each precision creates.

GQA, 128k context and the KV cache

Llama 3 70B runs 80 layers with grouped-query attention: 64 query heads share just 8 key-value heads, an eight-fold reduction in KV cache versus classic multi-head attention. That is precisely why this workhorse scales gracefully to 128k context and many concurrent users while rivalling far larger models. Even so, KV memory grows with sequence length times active requests, so long-context RAG or high concurrency can demand more VRAM than the weights. Model your real UK traffic below.

FAQs

How many H100 GPUs do I need for Llama 3 70B?

It depends on precision. At FP16 the weights exceed one 80GB GPU, so you need several H100s linked with NVLink and tensor parallelism, plus headroom for the KV cache. INT4 quantisation shrinks the footprint enough to fit far fewer cards. Enter your context length and concurrency in the calculator for an exact, current count.

Can Llama 3 70B run on a single GPU?

Not at FP16, where the 70.6B parameters alone overflow an 80GB card. With INT4 quantisation via AWQ or GPTQ the weights can fit on one high-capacity GPU, leaving room for a modest KV cache. Long contexts or many simultaneous users still push you toward a second card. The tool shows exactly when a single GPU stops being enough.

How much VRAM does Llama 3 70B use at 128k context?

Two things consume VRAM: the fixed model weights and the KV cache, which grows with sequence length and concurrent requests. Grouped-query attention, with only 8 key-value heads, keeps 128k-context caches far smaller than a plain 70B would, but heavy concurrency can still rival the weights. The calculator sums both for your precise workload rather than quoting a single figure.

Related