UK’s trusted IT infrastructure partner since 2003
sales@servnetuk.com
0800 987 4111
Servnet
ConfiguratorGet in Touch
Technical · AI Infrastructure · How-To

Building your first UK on-prem AI cluster: step-by-step

Servnet Editorial · AI Infrastructure Practice11 min read

Cloud AI training is convenient but at sustained scale (multi-week training runs, hundreds of GPUs), on-prem becomes meaningfully cheaper. UK enterprises building first on-prem AI clusters face specific decisions on hardware, fabric, storage, software stack, and facility prep. This is the practitioner's playbook.

First UK AI cluster — reference architecture
GPU nodes4-8× H100 / H200IB / 400G NDRCompute fabricParallel FSDDN / Pure FlashBlade100G EthernetMgmt + storagePower + cool50-80 kW racks

When on-prem AI beats cloud

Sustained utilisation > 50% — cloud premium for elasticity stops paying for itself.

Sensitive data + sovereignty — particularly for FS, healthcare, defence.

Multi-week training runs — cloud spot pricing helps but on-prem is more predictable.

Model fine-tuning + iteration cycles — owned infrastructure removes per-experiment cost friction.

Step 1 — Sizing

4-8 GPU starting cluster: 1× Supermicro SYS-821GE (8× H100/H200) or NVIDIA DGX B200.

16-64 GPU production cluster: 2-8× 8-GPU servers + 800G fabric.

128+ GPU pre-training cluster: dedicated AI facility, liquid cooling, 800G AI fabric.

See our NVIDIA GPU roadmap for choice within sizing.

Step 2 — Fabric design

RoCEv2 over Ethernet (Arista 7060X6, Cisco Nexus 9332D-H2R, Juniper QFX5240) — most-common UK choice in 2026.

InfiniBand NDR / X800 (NVIDIA Quantum-2 / Quantum-X800) — alternative for pure HPC / lowest-latency training.

800G GPU-to-GPU bandwidth. 400G for storage. Separate management network.

Step 3 — Storage

Hot tier: Pure FlashBlade or VAST Data — high parallel throughput for training data loading.

Warm tier: NetApp ONTAP or Dell PowerScale.

Archive: AWS S3 / Azure Blob / local object storage.

AI cluster stack — what you procure together
4Model + training frameworkPyTorch · NCCL · Slurm3GPU + interconnectH100/H200 + NVLink + IB2Parallel storageDDN A³I · Lustre / GPFS1Power + cooling50-80 kW rack, DLC

Step 4 — Software stack

NVIDIA AI Enterprise + CUDA toolkit + cuDNN.

Container platform: NVIDIA NGC + Kubernetes + Kubeflow / Slurm for training orchestration.

Model registry: MLflow / Weights & Biases.

Frameworks: PyTorch + Hugging Face Transformers + DeepSpeed for distributed training.

Step 5 — Facility

Power: 1-MW+ for serious AI cluster. Pre-survey colo or upgrade on-site facility.

Cooling: D2C liquid cooling for B200+ density. Air cooling for H100.

Network connectivity: dedicated egress for data movement.

What Servnet does

Servnet runs UK enterprise AI infrastructure builds end-to-end. Engagement: 1) workload sizing (model + training pattern + concurrency), 2) sized commercial bid across DGX + Supermicro options, 3) facility pre-survey (power + cooling + space), 4) deployment + commissioning, 5) optional ongoing managed AI infrastructure service.

Key takeaways
  • On-prem AI beats cloud when sustained utilisation > 50% + sensitive data + multi-week runs.
  • Start with 4-8 GPU cluster (single 8-GPU server). Scale to 16-64 then 128+.
  • RoCEv2 over 800G Ethernet is the dominant UK fabric choice in 2026.
  • Storage: hot (FlashBlade / VAST) + warm (ONTAP / PowerScale) + archive (S3 / Blob).
  • Software: NVIDIA AI Enterprise + Kubernetes + Kubeflow / Slurm + standard ML frameworks.
  • Facility pre-survey is essential — most legacy UK colos cap at 4-8 kW/rack.
Frequently asked

FAQs — Building your first UK on-prem AI cluster

Sizing

What's the minimum viable AI cluster?

Supermicro SYS-821GE (8× H100/H200) or 1× NVIDIA DGX B200 is a credible starting point for fine-tuning + inference workloads. Pre-training requires 16+ GPUs minimum to be useful.

How much does a starter cluster cost?

UK list pricing for 8-GPU H200 server: £180-280k. NVIDIA DGX B200: £350-500k. Add fabric switching, storage, racks, cabling, deployment: typically £100-200k. Total 8-GPU starter cluster: £300-700k.

Related

Got a question this article didn't answer?

One conversation with an engineer who's done this before. No sales script.

Talk to Servnet →