UK’s trusted IT infrastructure partner since 2003
Servnet
FinanceToolsConfiguratorGet in Touch
How much VRAM does a large language model need? (2026) — analysisHow much VRAM does a large language model need? (2026) — analysis — reach
AI Infrastructure · GPU Sizing

How much VRAM does a large language model need? (2026)

Servnet Editorial · AI Infrastructure Practice8 min read

The first question when self-hosting a large language model is deceptively simple: will it fit? The answer is a sum of three parts - the model weights, the KV cache and a little overhead - and getting the middle term right is where most calculators go wrong. This guide explains what actually consumes GPU memory, how precision and concurrency change it, and how a VRAM total becomes a GPU count. Put your own model and workload into the AI/GPU calculator to see the exact figures.

What uses GPU memory (Llama 3 70B · FP16 · 8k context · 32 users)
3Weights~131 GiB · parameters × 2 bytes (fixed)2KV cache~80 GiB · grows with context × concurrent users1Overhead~5 GiB · CUDA context + working buffers

VRAM is three things, not one

Every byte of GPU memory an inference server uses falls into three buckets: the model weights, the KV cache and framework overhead. Weights are fixed - they are the model. The KV cache grows with how much text you process and how many users you serve. Overhead is a small, near-constant tax for the CUDA context and working buffers. Total VRAM is simply their sum, and a model fits when that sum sits inside the usable memory of your GPUs.

The reason sizing feels confusing is that people quote only the weights - the headline figure - and forget the KV cache, which under real concurrency can rival or exceed the weights. The calculator on this page adds all three so you see the true working set, not just the model on disk.

Weights: parameters times precision

Weight memory is the easy part: it is the number of parameters multiplied by the bytes used to store each one. At 16-bit precision that is two bytes per parameter, so a 70-billion-parameter model like Llama 3.3 70B needs about 131 GiB just for weights - comfortably more than a single 80GB H100, which is why it spreads across at least two cards.

Precision is the biggest lever you have. Dropping to 8-bit halves the weights; 4-bit quantisation with AWQ or GPTQ quarters them, bringing that same 70B model down to roughly 33 GiB and often onto a single high-capacity GPU with little quality loss for most tasks. The trade is fidelity, so evaluation and fine-tuning usually stay at higher precision while production serving quantises.

  • FP16 / BF16: 2 bytes per parameter - full fidelity
  • FP8 / INT8: 1 byte per parameter - half the weights
  • INT4: 0.5 bytes per parameter - a quarter of the weights
  • A 70B model: about 131 GiB at FP16, about 33 GiB at INT4

The KV cache: the term everyone forgets

As the model generates, it caches the key and value vectors for every token it has already seen so it does not recompute them - the KV cache. Its size grows with the context length, the number of concurrent requests and the model architecture, and at scale it is often the memory that decides how many GPUs you need, not the weights.

Modern models keep it in check with grouped-query attention, where many query heads share a handful of key and value heads. Llama 3 70B uses just eight key/value heads for its 64 query heads - an eight-fold reduction versus classic multi-head attention. This is the single most common mistake in VRAM calculators: sizing the KV cache from the query-head count overstates it four- to eight-fold on Llama-class models. DeepSeek goes further with Multi-head Latent Attention, caching one small compressed latent so even 128k context stays cheap.

VRAM for Llama 3 70B weights by precision (GiB)
GiB140GiB105GiB70GiB35GiB0GiB131GiB80FP16GiB66GiB80FP8GiB33GiB80INT4WeightsKV cache

Overhead, and turning VRAM into GPUs

The last slice is overhead: the CUDA runtime context and framework working buffers, a roughly one-to-two gigabyte near-constant. Serving frameworks such as vLLM also reserve headroom - commonly using about 90% of a card - so the usable figure per GPU is a little below the nameplate.

Divide the total VRAM by the usable memory per card and round up to a sensible parallel topology - 1, 2, 4 or 8 GPUs, then multiples of eight for multi-node - and you have your GPU count. A 70B model at 16-bit serving thirty-odd users at 8k context lands on four H100s; quantised to 4-bit it fits on two. See it worked through on the Llama 3 70B GPU requirements page, or size your own in the calculator.

From VRAM to a real build

VRAM tells you how many GPUs; a deployment needs the rest of the picture. Those GPUs sit in servers - an 8-GPU HGX or DGX node draws around ten kilowatts - which need power, cooling and rack space, and they cost money to buy or rent. That is why sizing memory is only the start: the same tool carries the GPU count through to a server and rack build, power and cooling, capex and finance, and a cloud-versus-own comparison.

If you are weighing whether to buy that hardware or rent it, see self-hosting LLMs vs cloud GPUs. To spec the servers themselves, our team builds them through the NVIDIA DGX and GPU accelerator ranges.

How the VRAM total becomes a GPU count
Total VRAM vs a card’s usable memory?
Fits one card
1 GPU
A few cards
2–8 GPUs (NVLink)
Beyond one node
Multi-node (×8)
Key takeaways
  • VRAM = model weights + KV cache + overhead; a model fits when that sum is inside your GPUs' usable memory.
  • Weights = parameters x bytes-per-parameter: a 70B model is about 131 GiB at FP16 and about 33 GiB at INT4.
  • The KV cache grows with context and concurrency; size it from key/value heads (GQA), not query heads, or you overstate it four- to eight-fold.
  • Frameworks use about 90% of a card, so divide total VRAM by the usable figure and round to 1, 2, 4 or 8 GPUs.
  • VRAM is only the start - the build also needs power, cooling and a budget.
Frequently asked

FAQs — How much VRAM does a large language model need? (2026)

How much VRAM does Llama 3 70B need?

About 131 GiB for the weights at 16-bit precision, plus the KV cache and overhead - so it needs at least two 80GB GPUs, and typically four to serve many users at long context. Quantised to 4-bit the weights fall to roughly 33 GiB. Size your exact case in the AI/GPU calculator.

Does a bigger context window need more VRAM?

Yes - the KV cache grows with context length and with the number of concurrent requests. Grouped-query attention keeps that growth modest on modern models, but long context under heavy concurrency can use as much memory as the weights themselves.

How much does quantisation reduce VRAM?

Roughly in proportion to the bytes per parameter: 8-bit halves the weights versus 16-bit, and 4-bit quarters them, with small real-world overheads. It reduces only the weights, not the KV cache, though a lower-precision KV cache is also possible.

Why do other calculators give a higher figure?

Usually because they size the KV cache from the number of query heads rather than the key and value heads. On Llama-class models with grouped-query attention that overstates the cache four- to eight-fold. Our engine uses the correct key/value-head count.

Related

Got a question this article didn't answer?

One conversation with an engineer who's done this before. No sales script.

Talk to Servnet →