Cloud AI training is convenient but at sustained scale (multi-week training runs, hundreds of GPUs), on-prem becomes meaningfully cheaper. UK enterprises building first on-prem AI clusters face specific decisions on hardware, fabric, storage, software stack, and facility prep. This is the practitioner's playbook.
When on-prem AI beats cloud
Sustained utilisation > 50% — cloud premium for elasticity stops paying for itself.
Sensitive data + sovereignty — particularly for FS, healthcare, defence.
Multi-week training runs — cloud spot pricing helps but on-prem is more predictable.
Model fine-tuning + iteration cycles — owned infrastructure removes per-experiment cost friction.
Step 1 — Sizing
4-8 GPU starting cluster: 1× Supermicro SYS-821GE (8× H100/H200) or NVIDIA DGX B200.
16-64 GPU production cluster: 2-8× 8-GPU servers + 800G fabric.
128+ GPU pre-training cluster: dedicated AI facility, liquid cooling, 800G AI fabric.
See our NVIDIA GPU roadmap for choice within sizing.
Step 2 — Fabric design
RoCEv2 over Ethernet (Arista 7060X6, Cisco Nexus 9332D-H2R, Juniper QFX5240) — most-common UK choice in 2026.
InfiniBand NDR / X800 (NVIDIA Quantum-2 / Quantum-X800) — alternative for pure HPC / lowest-latency training.
800G GPU-to-GPU bandwidth. 400G for storage. Separate management network.
Step 3 — Storage
Hot tier: Pure FlashBlade or VAST Data — high parallel throughput for training data loading.
Warm tier: NetApp ONTAP or Dell PowerScale.
Archive: AWS S3 / Azure Blob / local object storage.
Step 4 — Software stack
NVIDIA AI Enterprise + CUDA toolkit + cuDNN.
Container platform: NVIDIA NGC + Kubernetes + Kubeflow / Slurm for training orchestration.
Model registry: MLflow / Weights & Biases.
Frameworks: PyTorch + Hugging Face Transformers + DeepSpeed for distributed training.
Step 5 — Facility
Power: 1-MW+ for serious AI cluster. Pre-survey colo or upgrade on-site facility.
Cooling: D2C liquid cooling for B200+ density. Air cooling for H100.
Network connectivity: dedicated egress for data movement.
What Servnet does
Servnet runs UK enterprise AI infrastructure builds end-to-end. Engagement: 1) workload sizing (model + training pattern + concurrency), 2) sized commercial bid across DGX + Supermicro options, 3) facility pre-survey (power + cooling + space), 4) deployment + commissioning, 5) optional ongoing managed AI infrastructure service.