LAS VEGAS—Broadcom's VMware Explore event featured talks on why private clouds using private data to run private AI are the way of the future for enterprises. “It's clear that the future of enterprise is private,” Broadcom CEO Hock Tan wrote in a blog post. A hot subtopic in a lively session for analysts and media was how networks can best coordinate GPUs and other data center infrastructure needed to deliver AI. Ram Belaga, SVP and GM of Broadcom's Core Switching Group, declared, “Ethernet will be the enabling technology for this.”
Let's stop and think for a moment. Velaga began his comments by asking the audience to “think about what machine learning is and how it's different from cloud computing.” Cloud computing is about increasing CPU utilization, but with ML, it's the opposite, he said. “You can't run a machine learning workload on one GPU. You can't run an entire machine learning workload on one GPU. You need to connect many GPUs together. So machine learning is a distributed computing problem. It's actually the opposite of the cloud computing problem.”
For Amazon, Microsoft, Meta and Tencent around the world, that means connecting tens or even hundreds of thousands of GPUs to each other, in some cases across multiple facilities. In this problem domain, “the network plays a crucial role,” Velaga says. “We subscribe to the idea that the network is the computer.”
What about NVIDIA's InfiniBand?
And Ethernet is the best way to connect those computers, Belaga says. The alternative here is NVIDIA's InfiniBand, a proprietary set of solutions that the GPU giant describes as ideal for “complex workloads that require high-resolution simulations, extremely large data sets, and ultra-fast processing of highly parallelized algorithms.” What's more, InfiniBand “dramatically improves performance, accelerating time to discovery while reducing cost and complexity,” it says.
Not so, Belaga said. InfiniBand is expensive, fragile and based on the false premise that physical infrastructure is lossless. As for Ethernet, which was standardized in the 1980s and has continually innovated and advanced since then, he cited these selling points:
Widespread deployment Open and standards-based Best Remote Direct Access Memory (RDMA) performance for AI fabrics Low cost compared to proprietary technologies Consistent across front-end, back-end, storage and management networks Highly available, reliable and easy to use Broad range of silicon, hardware, software, automation, monitoring and debug solutions from a large ecosystem
On that last point, Velaga said, “We've been steadily innovating in the Ethernet world,” and continued, “When you're so competitive, you have no choice but to innovate.” He said InfiniBand is a “dead end road.”
To support this position, he pointed to an earlier effort by Microsoft and OpenAI to build Stargate, a $100 billion data center, or supercomputer, that would eventually use millions of AI chips to run OpenAI's large language models. There have been many reports about this effort, but the gist seems to be that Microsoft currently uses InfiniBand, but OpenAI prefers Ethernet, so Ethernet is likely to win.
“Today, you can deploy a million GPU clusters over Ethernet. You can't even scratch the surface with InfiniBand,” Velaga said.