UK’s trusted IT infrastructure partner since 2003
Servnet
ConfiguratorGet in Touch
Data lake and lakehouse storage nodes: on-prem hardware for analytics at scale (UK 2026) — analysisData lake and lakehouse storage nodes: on-prem hardware for analytics at scale (UK 2026) — analysis — reach
Server Infrastructure · Storage

Data lake and lakehouse storage nodes: on-prem hardware for analytics at scale (UK 2026)

Servnet Editorial · Server Infrastructure Practice12 min read

The modern data lake is built on object storage, and a lakehouse layers table formats and query engines on top so analytics can run directly against that object layer. On-prem, that means real servers: nodes that present an S3-compatible object store at the capacity and throughput your analytics demand, and a compute layer that queries it. Done well, an on-prem lakehouse can be dramatically cheaper than cloud object storage at petabyte scale; done badly, it starves the query engines and disappoints everyone. This is how to design the physical storage nodes behind a UK data lake or lakehouse, from the architecture down to the drives.

Lakehouse: object nodes to query
readreadserveS3 object node 1capacity + metadataS3 object node 2capacity + metadataFast network25GbE+ end to endQuery / computescans in parallel

From lakehouse architecture to physical nodes

A lakehouse is a layered architecture: an object storage foundation holds the raw and curated data, open table formats add transactional tables and schema over that object layer, and query and compute engines read it for analytics. The storage foundation is almost always S3-compatible object storage, because the whole ecosystem speaks S3, which means the on-prem hardware question is really how to build a fast, large S3 object store and a compute layer that can query it efficiently.

That separation of storage and compute is the architecture's strength: you scale the object nodes for capacity and throughput, and the query nodes for processing, independently. The physical design therefore splits into storage nodes optimised to serve data cheaply and at scale, and compute nodes optimised to crunch it, connected by a fast network. Get the object layer right and the rest of the lakehouse has a solid foundation to build on, much as the object-node design principles apply to any on-prem S3 store.

Designing the storage nodes

Data lake storage nodes are object-storage nodes, and they balance cheap capacity with enough throughput to keep query engines fed. Each node carries dense high-capacity drives for the bulk of the data, a flash tier for metadata and small objects so the index is fast, enough ECC memory to cache metadata and aid recovery, and a fast network to serve many readers at once. The art is balancing those four so no single one starves the others under analytical load.

Because analytics reads large volumes in parallel, throughput across the cluster matters as much as raw capacity, so the nodes are built to be added: more nodes mean more capacity and more aggregate read bandwidth together, a scale-out pattern. Build the capacity tier from high-capacity drives and the metadata tier from flash in our SSD and NVMe range, presented through a clean controller path using parts from our host bus adapters range.

  • Object storage foundation: build a fast, large S3-compatible store as the lakehouse base
  • Per node: dense capacity drives, a flash tier for metadata and small objects, ECC memory, fast network
  • Scale-out: add nodes to grow capacity and aggregate read throughput together
  • Separate storage and compute so each scales independently

Throughput, network and the query engines

Analytics is throughput-hungry: query engines scan large datasets in parallel across many nodes, so the network between storage and compute, and the read bandwidth of the storage tier, decide how fast queries return. A storage cluster that can hold a petabyte but only dribble it out to the query engines will leave expensive compute idle, so size the read throughput and the network to the analytical workload, not just the capacity to the data volume.

That means fast links between storage and compute, typically 25GbE and upward for the storage nodes and more for heavily-used clusters, on server-grade adapters from our network cards range, with the switch fabric matched end to end. The goal is that the query engines are limited by their own processing, not by waiting on the object store, which is what makes the difference between an on-prem lakehouse that flies and one that frustrates.

On-prem lakehouse vs cloud object at PB scale
£k120£k90£k60£k30£k0£k40£k22Y1£k60£k70Y3£k78£k120Y5On-prem nodesCloud object

The economics against cloud object storage

The reason organisations build data lakes on-prem is cost at scale. Cloud object storage is convenient and elastic, but at sustained petabyte volumes the storage and, especially, the data-access and egress charges add up relentlessly, and analytics that repeatedly scans large datasets generates exactly the access patterns that make cloud object storage expensive. Owned hardware turns that recurring operating cost into a capital asset with predictable running costs.

The crossover depends on volume, how heavily the data is accessed and the lifespan of the platform, and it is the same on-prem-versus-cloud calculation that applies to compute, narrowed to bulk analytical storage. For steady, large, frequently-scanned datasets that will live for years, the on-prem lakehouse usually wins on total cost; for small or short-lived datasets, cloud often still makes sense. Model it honestly rather than assuming either way.

Putting the storage layer together

For most UK data lake and lakehouse builds the storage layer lands on a scale-out cluster of object-storage nodes, each with dense capacity drives, a flash metadata tier, generous ECC memory and a fast network, presenting an S3-compatible store that the query and compute layer reads. The capacity follows the dataset size and growth; the node count and network follow the aggregate read throughput the analytics demands.

We design these storage nodes and the cluster around them, balancing capacity, metadata flash, memory and network to the analytical workload, in our server configuration service, building larger clusters out on dense platforms from the Dell storage range. The compute layer is sized separately against it, so storage and query each scale to what the lakehouse actually needs.

Key takeaways
  • A lakehouse layers table formats and query engines over an S3-compatible object store, so the hardware question is how to build that store.
  • Storage nodes balance dense capacity drives, a flash metadata tier, ECC memory and a fast network.
  • Separate storage and compute so each scales independently; the storage layer is a scale-out, add-a-node design.
  • Analytics is throughput-hungry, so size read bandwidth and network to the queries, not just capacity to the data.
  • On-prem usually beats cloud object storage on total cost for large, frequently-scanned, long-lived datasets.
Frequently asked

FAQs — Data lake and lakehouse storage nodes

Architecture

What hardware does an on-prem data lake need?

A scale-out cluster of object-storage nodes presenting an S3-compatible store, plus a separate compute layer to query it. Each storage node carries dense capacity drives, a flash metadata tier, ECC memory and a fast network. Storage and compute scale independently. We design the storage layer in server configuration.

Why is a lakehouse built on object storage?

The lakehouse ecosystem speaks S3, layering open table formats and query engines over an object storage foundation. So the on-prem hardware question is how to build a fast, large S3-compatible store, then size compute to query it. The object-node design uses dense platforms like the HPE Apollo range.

Economics

Is an on-prem data lake cheaper than cloud?

At sustained petabyte scale, usually yes, because cloud storage and especially data-access and egress charges add up under the repeated large scans analytics generates. Owned hardware turns that recurring cost into a predictable capital asset. The crossover depends on volume, access intensity and lifespan; we model it in server configuration.

Related

Continue reading

More in Storage

Got a question this article didn't answer?

One conversation with an engineer who's done this before. No sales script.

Talk to Servnet →