The modern data lake is built on object storage, and a lakehouse layers table formats and query engines on top so analytics can run directly against that object layer. On-prem, that means real servers: nodes that present an S3-compatible object store at the capacity and throughput your analytics demand, and a compute layer that queries it. Done well, an on-prem lakehouse can be dramatically cheaper than cloud object storage at petabyte scale; done badly, it starves the query engines and disappoints everyone. This is how to design the physical storage nodes behind a UK data lake or lakehouse, from the architecture down to the drives.
From lakehouse architecture to physical nodes
A lakehouse is a layered architecture: an object storage foundation holds the raw and curated data, open table formats add transactional tables and schema over that object layer, and query and compute engines read it for analytics. The storage foundation is almost always S3-compatible object storage, because the whole ecosystem speaks S3, which means the on-prem hardware question is really how to build a fast, large S3 object store and a compute layer that can query it efficiently.
That separation of storage and compute is the architecture's strength: you scale the object nodes for capacity and throughput, and the query nodes for processing, independently. The physical design therefore splits into storage nodes optimised to serve data cheaply and at scale, and compute nodes optimised to crunch it, connected by a fast network. Get the object layer right and the rest of the lakehouse has a solid foundation to build on, much as the object-node design principles apply to any on-prem S3 store.
Designing the storage nodes
Data lake storage nodes are object-storage nodes, and they balance cheap capacity with enough throughput to keep query engines fed. Each node carries dense high-capacity drives for the bulk of the data, a flash tier for metadata and small objects so the index is fast, enough ECC memory to cache metadata and aid recovery, and a fast network to serve many readers at once. The art is balancing those four so no single one starves the others under analytical load.
Because analytics reads large volumes in parallel, throughput across the cluster matters as much as raw capacity, so the nodes are built to be added: more nodes mean more capacity and more aggregate read bandwidth together, a scale-out pattern. Build the capacity tier from high-capacity drives and the metadata tier from flash in our SSD and NVMe range, presented through a clean controller path using parts from our host bus adapters range.
- •Object storage foundation: build a fast, large S3-compatible store as the lakehouse base
- •Per node: dense capacity drives, a flash tier for metadata and small objects, ECC memory, fast network
- •Scale-out: add nodes to grow capacity and aggregate read throughput together
- •Separate storage and compute so each scales independently
Throughput, network and the query engines
Analytics is throughput-hungry: query engines scan large datasets in parallel across many nodes, so the network between storage and compute, and the read bandwidth of the storage tier, decide how fast queries return. A storage cluster that can hold a petabyte but only dribble it out to the query engines will leave expensive compute idle, so size the read throughput and the network to the analytical workload, not just the capacity to the data volume.
That means fast links between storage and compute, typically 25GbE and upward for the storage nodes and more for heavily-used clusters, on server-grade adapters from our network cards range, with the switch fabric matched end to end. The goal is that the query engines are limited by their own processing, not by waiting on the object store, which is what makes the difference between an on-prem lakehouse that flies and one that frustrates.
The economics against cloud object storage
The reason organisations build data lakes on-prem is cost at scale. Cloud object storage is convenient and elastic, but at sustained petabyte volumes the storage and, especially, the data-access and egress charges add up relentlessly, and analytics that repeatedly scans large datasets generates exactly the access patterns that make cloud object storage expensive. Owned hardware turns that recurring operating cost into a capital asset with predictable running costs.
The crossover depends on volume, how heavily the data is accessed and the lifespan of the platform, and it is the same on-prem-versus-cloud calculation that applies to compute, narrowed to bulk analytical storage. For steady, large, frequently-scanned datasets that will live for years, the on-prem lakehouse usually wins on total cost; for small or short-lived datasets, cloud often still makes sense. Model it honestly rather than assuming either way.
Putting the storage layer together
For most UK data lake and lakehouse builds the storage layer lands on a scale-out cluster of object-storage nodes, each with dense capacity drives, a flash metadata tier, generous ECC memory and a fast network, presenting an S3-compatible store that the query and compute layer reads. The capacity follows the dataset size and growth; the node count and network follow the aggregate read throughput the analytics demands.
We design these storage nodes and the cluster around them, balancing capacity, metadata flash, memory and network to the analytical workload, in our server configuration service, building larger clusters out on dense platforms from the Dell storage range. The compute layer is sized separately against it, so storage and query each scale to what the lakehouse actually needs.