Hybrid Platforms Trends Series - AI - Platforms

For part three of this series, I wanted to touch on some technical considerations when building an AI platform to deliver all those game-changing golden use cases. This is not an exhaustive list but a series of important considerations for starting your AI journey. They are also not in any order of importance.

Compute

I am touching on two areas of the compute conversation: Data centre power challenges and using the CPU as an inference engine to lower the cost of running AI models.

Power and the role of water in the Datacentre!

Gone are the days of worrying about fitting 10KW of power into an entire rack; we are now in the world of multi-megawatt requirements and planning for 150KW+ racks. Modern AI servers consume the same power per server that a whole rack was responsible for just a few years ago. This brings a new set of considerations; it's not about floor space anymore but how we keep these systems cool to operate at peak performance and at an affordable cost.

Liquid cooling is the answer in this space, with most leading server manufacturers offering those options now. At a high level, we have three implementations of liquid cooling:

Closed Loop Liquid cooling: Self-contained cooling solution built into the server, generally aimed at cooling newer, high-powered components like the CPU.

Direct Liquid Cooling: Rack-level cooling of multiple components within a server, aiming to enhance down power usage effectively of the entire stack.
Immersion Liquid Cooling: Leverages a thermally but not electrically conductive liquid, such as de-ionised water. This technology has operational challenges regarding routine maintenance, finding facilities to support the technology and costs.

Direct Liquid Cooling (DLC) looks like the prevailing technology for AI workloads at scale, as it offers significant commercial, density and performance benefits while still allowing standard rack mount server maintenance tasks (like changing a faulty component). DLC allows more servers per rack and reduces airflow management in and around the rack. This increased management of thermal conditions will enable servers to run at maximum performance for sustained periods. Unlocking all the speed in those expensive GPUs and the turbo modes in CPUs will result in more 'AI' per pound spent. This leads to more efficiency and, therefore, lower operating costs.

Research by one of our OEM partners showed that the cooling cost via traditional air methods could be upwards of $250 per server per year. The same server footprint using DLC could be as low as $50 per server per year. If you're deploying a few thousand servers, that starts to add up quickly and let's not forget the sustainability impact - that's a lot of saved carbon!

The future direction of server components, CPU, RAM, Flash, GPU, etc., shows us that thermal challenges will soon be a limiting factor. This is even more prevalent in the world of AI. When considering the future of your AI platform, liquid is going to be a requirement. This means thinking about your facilities today to make sure you are ready for the future.

CPU inference

The second part of the compute conversation is when we need an AI accelerator (GPU today) and when we can get away with just the CPU. The fundamental question is: How quickly do you need a response?

We talked about training AI models in the first part of this series, but what do you do once you have completed that training? We use the term Inferencing, a fancy word for using the AI for what it was trained to do. When we consider inference, many questions exist, like, do we need a response in nanoseconds? Where do we need to deploy the AI to be accessible? What is the cost of scaling that deployment to meet demand?

All these things will impact where and how you deploy your AI for inferencing. You will often need a GPU/Accelerator to meet requirements, but silicon makers like Intel have been building AI capabilities in the CPU to give more options.

Let’s return to our initial question - "How quickly do you need a response from the AI?"

The simple answer is just a little faster than the system (or human) can consume the output. The average human can read around 230 words per minute, so your AI might only need to be a little faster to be effective with human consumers.

With the new AI features in Intel CPUs, known as Advanced Matrix Extensions (AMX) and Advanced Vector Extensions 512 (AVX-512), many inferencing (and lower-end training) can be completed on the CPU only. Returning to the last section on liquid cooling, moving the right workloads to the CPU would save massive amounts of power, cost, and complexity. Also, consider edge deployment use cases like retail, manufacturing or smart cities; building small data centres at each location would be impractical and a commercial non-starter.

Leveraging the CPU would be more practical if we picked more specialised foundation models. Models with lower parameter counts (sub-20 billion) are, in many cases, more focused and able to provide the same levels of accuracy as the larger 70-400 billion models (for specific use cases).

Storage

The second area that I wanted to touch on was storage, specifically the throughput requirements that AI places on those flash modules. The second is unifying the storage architecture for AI use cases.

Throughput

Setting the context; NVIDIA recommends we have at least one Gb/s read and write per H100 GPU. With the new Blackwell chips coming down the road, we expect to need 2-4x this performance. If we are deploying a single DGX box, it will have 8 GPUs on board, and one DGX SuperPod (Nvidia reference architecture for large-scale AI) scale unit is 32 systems, containing 256 GPUs. This shows us why storage throughput should be a significant concern.

The last thing we want is storage to be the limiting factor in the time it takes to train our model; those GPUs were not cheap, and we need to keep them running at full speed all the time. We have seen examples of more nodes being added to a multi-epoch training run, and the time to train remains unchanged. Exploration showed that the time to load data, create checkpoints, and restart was impacted by storage; limiting the potential of those new nodes.

When training larger models, regular checkpoints are needed. Doing this puts pressure on the storage and is dead time for the GPU. The faster we can enable the checkpoint to complete, the faster we get our results. Frequent checkpoints also allow for the model performance evaluation at regular training stages. This can help identify when a model overfits or underfits the data, ensuring an optimal level of training and, therefore, resource consumption.

Checkpoints are not the only place where storage is hit during AI training, and we need to consider that end-to-end journey. Data goes on a journey as part of the AI workflow from ingestion and loading to model load into the GPU and then to checkpoints and distribution. Moving data from one platform to another and each platform's read/write performance can significantly impact the overall time to complete a training run. All this wasted time is wasted money for expensive GPU hardware that we need to avoid.

This data movement at each process stage leads to the second part of the storage conversation – Unification.

Unification

How much data is needed at that intensive training stage that captures all the attention? It is believed that the GPT 3 model used only 570GB of data, which is not much in today’s PB world. If we place training at the centre of our AI pipeline and define this as needing a few hundred GB, what is the view before and after?

Our pipeline starts with the collection of the raw data needed to push through a data refinement process. This could easily be tens of PB that is then refined down to a few PB. At the other end of the pipeline, our quantised model takes up a few GB, but all the audit data, transaction logs, etc., could be a few more PB over.

Now we have an end-to-end AI pipeline that could quickly require 10's of PB capacity with a need to be able to move that data at each stage of the process - Collection>Refine>Train>Query>Retrain, etc. Leveraging different storage technologies to achieve this could add significant delays to the process, again bringing storage to the front of the performance challenge for AI training. Leveraging storage technologies that can unify these needs with the required scale and performance could cut the training time by 30%, a massive number for such expensive systems.

Networking

How important is the network in these AI deployments? Critical is the simple answer, and there is a lot of it. Nvidia publishes two reference architectures for their DGX-based AI systems: the small BasePod and the larger SuperPod (most OEM server providers have published similar architectures).

Taking a quick look at those designs, we can see how much networking is included and its importance at both the compute and storage layers.

DGX BasePOD

BasePOD Reference Architecture

SuperPod Management Rack

 In the ‘compute’ network space, we have some serious bandwidth requirements with zero tolerance for latency or loss. Considering a 180b parameter model using FP16 precision and factoring in overheads, we can calculate rough memory requirements for fine-tuning.

360 GB (model) + 360 GB (optimizer) + 216 GB (activations) = 936 GB.

The current Nvidia H100 SXM GPU has only 80GB of memory, so we can see the need to network together the GPU to allow them to split the workload, in this case, over 12 H100s. This level of connectivity has traditionally been the sphere of InfiniBand networking. It is generally accepted that it offers lower and more consistent latency and historically higher throughputs than traditional ethernet. The challenge is that it is typically more complex to deploy and requires skills that are not native to an enterprise networking team.

Over the last few years, we have seen ethernet throughputs catch up and surpass those of InfiniBand, leading to a fierce debate on which technology is best suited to support a supercomputer deployment. One of the last battlegrounds lies in the transmission protocols and how they handle situations like loss and congestion to deliver consistent bandwidth.

The Nvidia BlueField-3 SuperNICs and Spectrum-X switches are examples of how technology tackles some of these native ethernet challenges. By providing end-to-end adaptive RDMA routing with lossless network capabilities, the solution can drive ethernet utilisation up from a traditional 60% to 95%. The technology can also deliver performance isolation across the network by managing congestion control. Given the diversity of choice, consolidation of skills, and easier ongoing management, we expect ethernet to be a strong contender for future deployments.

We also need to keep a close eye on the Ultra Ethernet Consortium and its stated mission of, "Deliver an Ethernet-based, open, interoperable, high performance, full-communications stack architecture to meet the growing network demands of AI & HPC at scale." When we see the results of this technology, we will have a new wave of Ethernet capability to consider.

Data

The final part I wanted to touch on was the data strategy that needs to underpin these AI aspirations. Organisations will likely find making AI a reality challenging without a deep understanding of their data and its quality or a means to provide curated access. This does not always mean that all your data needs to be moved to a new location or that you need to spend years overhauling your data management practices. The 80/20 rule usually applies 80% value from 20% of the data.

The first step in any data project is understanding what you have and where it is located. Discovery and classification are foundational steps; sometimes, projects fail because the task looks too big to start. Most organisations need help to create a data management architecture that allows them to quickly locate, connect, integrate, and provide access to data from heterogeneous sources.

The number of data and application silos has exploded over the last ten years or so, but the number of skilled people in data teams has either stayed constant or dropped. As a result, the time lag from when a request for data is raised to when the request is fulfilled impacts the time to market or decision. We see organisations looking at three possible data architectures to tackle these challenges.

Data Mesh

Data Mesh is a decentralised data architecture approach that treats data as a product and assigns data ownership to cross-functional teams, aligning data management with business domains. The Data Mesh devolves data ownership to domain-specific teams, allowing data to be managed as a product with federated governance and access. It is generally adopted by large organisations with very complex and domain-specific data needs.

Data Fabric

A Data Fabric aims to provide a unified data management architecture that can span diverse data sources, leveraging metadata and automation to enhance value. It integrates data from disparate sources, enabling seamless access, sharing, and governance across an enterprise. Data Fabrics are usually deployed to enable enterprise-wide data integration, providing seamless access, sharing, and governance.

Data Lakehouse

A Data Lakehouse is designed to combine the scalability of a data lake with the performance of data warehouses, supporting a wide range of data types and analytical workloads. It generally supports a unified storage approach and ACID transaction guarantees (Atomicity, Consistency, Isolation, Durability). Normally, a lakehouse is deployed by organisations that require both large-scale data processing and real-time analytics. Rather than deploying multiple copies of the data to support BI and AI, the lakehouse allows the unification of this data.

The relevance of each mode of data architecture will depend on the type, size, and needs of the organisation's data. Where is your data stored today? What are the existing warehouse and data lake capabilities? What budgets are available to tackle the challenge? Sometimes, a blended approach will be required to meet the specific demands and requirements. This is where our independent architects can help by defining the target operating model as a set of functional/nonfunctional requirements and using this to validate technology strategy.

Summary

That’s the end of this Hybrid Platforms Trends series on AI, which combines the top conversations we are seeing with customers, the importance of partnership and governance, and why the platform is still critical. Let's remember the three core pillars of accelerating an AI outcome and the importance of advancing all three in parallel to ensure the speed of innovation is not lost. Please speak to our teams about how we can help you advance your AI outcomes in these areas.

Ai overview