When designing applications that run at the scale of an entire datacenter, consisting of hundreds or thousands of microservices running on countless individual servers, and that need to be invoked within microseconds to give the impression of a monolithic application, building a fully connected, high bisectional bandwidth Clos network is a must.
This is especially true because application servers, middleware servers, database servers, and storage servers can be anywhere in your datacenter. You never know which servers in your network will need to communicate with others, so you over-provision bandwidth and connectivity to keep tail latency as low as possible.
But high-bandwidth Clos networks aren't necessarily the best architecture for AI training systems, especially considering that networks for AI clusters are becoming expensive. As AI networks grow in cost and complexity, something has to be sacrificed. That's why researchers at MIT's Computer Science and Artificial Intelligence Laboratory, in collaboration with network researchers at Meta Platforms, are thinking outside the box. Or, more accurately, thinking outside the box to eliminate expensive switching layers from AI networks, dramatically reducing costs without compromising AI training performance.
The rails-only network architecture devised by CSAIL and Meta Platforms, described in a recent paper and presented at the Hot Interconnects 2024 conference this week, is certainly worthy of The Next Platform's “If you think about it, it's obvious” award. We love “obvious” insights like these because they are often game-changing for technology, and we believe what the CSAIL and Meta Platforms researchers have uncovered has the potential to transform network architectures, especially for AI systems.
Before we dive into the insights of this rails-only architecture (based on our working practice, we can also call it an inverted-spine network), let’s provide a bit of background.
Clos networks are a way to connect any node or element (such as a GPU or DPU) in a node to every other node or element across a datacenter. These Clos networks are not the only way to do all-to-all links between devices on a network. Many supercomputing centers use Dragonfly topologies these days, but adding a machine requires rewiring the entire network. This differs from Clos topologies, which make this fairly easy to do. However, they do not provide consistent latency across the network like Dragonfly networks do. (We discussed the issues with these topologies in April 2022, when we analyzed Google's own “Aquila” network interconnect, which is based on the Dragonfly topology.)
As we know, large-scale AI training systems require around 24,000 to 32,000 GPUs to train large models with trillions of parameters relatively quickly. The number of GPUs currently used in Meta Platforms' system is 24,576 to train the Llama 3.1 405B model, as previously reported. CSAIL and Meta Platforms also expect the next generation model to span 32,768 GPUs in a single cluster. The Clos network is based on Ethernet leaf and spine switches, all supporting remote direct memory access (RDMA), allowing any GPU to share data with every other GPU in the network at the same time, using its all-to-all topology.
Weiyan Wang, a PhD student at CSAIL, presented on rails-only architectures at Hot Interconnects, stating that building a high-bandwidth Clos network to interconnect over 32,000 GPUs would cost $153 million, and the network itself would consume 4.7 megawatts of power. The paper goes into a bit more detail about network speeds as another comparison, stating that a full bisection bandwidth Clos fabric using 400 Gb/s links to connect 30,000 GPUs would cost $200 million. Suffice it to say, that's a lot of money; it's much more than any hyperscaler or cloud builder would spend to connect 4,096 server nodes.
Below is a very interesting graph created by Wang that shows how network cost and network power interact as an AI cluster grows in size.
Doubling the number of GPUs to 65,536 would result in a network cost of $300 million at 400 Gb/s port speeds and consuming roughly 6 megawatts of power.
Most GPU clusters that run large language models use what's called a rail-optimized network, which is a variant of the leaf-spine network that will be familiar to readers of The Next Platform. For the purposes of comparing the data above, this is what it looks like:
We need to somehow organize our compute elements and how work gets dispatched to them. What's interesting about rail-optimized networks is that we aggregate ranks of compute devices across rails: the first compute engine of each node connects to one leaf switch, the second compute engine of each node connects to another leaf, and so on.
As a more precise (and, as you'll see, more relevant) example, Wang showed how in a cluster of 128 of Nvidia's eight-way DGX H100 nodes, the GPUs are interconnected with a total of 128 leaf switches, with two leaf switches per rail handling eight different GPU ranks across the cluster.
Here's an insight the CSAIL and Meta Platforms researchers gained: As the LLMs were undergoing training, they wondered what the traffic patterns were like across the rail and up to the spine switches, and made a surprising and very useful discovery: Most of the traffic stays within the rail and never crosses it.
The tests CSAIL and Meta Platforms ran were not against the social network's Llama 3 model, but against variations of models from the OpenAI GPT family with different numbers of parameters.
Below is a detailed description of the traffic patterns of the Megatron GPT-1T model.
Whether it's pipeline parallelism, tensor parallelism, or data parallelism, traffic very rarely reaches the expensive spin switches that interconnect leaf-based rail switches.
So what you can do is chop off the head of your network: get rid of the spine aggregation switches entirely.
But wait a minute, what if you need to share data between rails? Each HGX node inside a DGX server (or one of its clones) has a number of very high bandwidth, very low latency NVSwitch memory fabric switches built in. And instead of pumping data from the leaf up the spine to cross the rail, the NVSwitch fabric can be used to pass data to the adjacent rail.
genius!
It is a testament to the dedicated rail network that unfolds before our eyes.
That's why we call it an inverted spine switch – it's not that you don't need a spine switch, it's just that NVSwitch has enough capacity to get the work done when you need it with a fraction of the bandwidth and time. (This may not work with AMD's GPUs, since AMD doesn't have Infinity Fabric switches.)
Of course, this comes with a caveat: for this to work, the shards and replicas driving the tensor parallelism and data parallelism need to be on the same rails in the network.
The upshot of this simple switch (so to speak) from a rail-optimized network to a rail-only network in terms of reduced switch and transceiver costs is that the transceiver and switch costs comprise the majority of the overall network cost.
Nvidia may end up regretting putting such a powerful switch at the heart of its HGX system boards. But probably not. Even Nvidia knows that networking can't account for more than 20 or 25 percent of system costs for AI to become widespread.
For the cluster of 128 DGX H100 servers used in the example above, a rail-optimized network would require 20 128-port switches across the spine and leaf to interconnect the 1,024 GPUs in the back-end network. It would also require 2,688 transceivers to link the GPUs to the leaf and the leaf to the spine. A rail-only network would reduce this to eight switches for the rails and require only 1,152 transceivers to place the GPUs on eight separate rails. It uses hundreds of NVSwitch ASICs on the HGX boards as a rarely used reverse-spine aggregation layer. This would save 41 kilowatts of power and reduce network costs by $1.3 million.
In benchmark tests, we found that 3D parallelism for LLM training for this rails-only approach had no performance impact, with all-to-all communication in the cluster incurring only 11.2 percent performance overhead. For the trained LLM model, all-to-all communication was only 26.5 percent of the total communication, resulting in a communication performance of 2.86 percent. Also, keep in mind that communication time is only a small fraction of the overall wall time during an LLM training run, so the impact of using NVSwitch as a spine occasionally is negligible.
This may not be the case for other types of data analytics, AI training, or HPC simulation workloads, but it's interesting to see people trying to figure this out.
Sign up for our newsletter
We'll deliver the week's highlights, analysis and stories straight to your inbox, with no middle ground.
Subscribe now
Related articles