While Nvidia GPUs remain unchallenged in AI training, there are early signs that competitors are catching up to the tech giant when it comes to AI inference, especially in terms of power efficiency — though none may be able to match the raw performance of Nvidia's new Blackwell chips.
This morning, ML Commons announced the results of its latest AI inference competition, ML Perf Inference v4.1. This round included first-time submissions from teams using chips from the AMD Instinct accelerator, the latest Google Trillium accelerator, Toronto-based startup UntetherAI, and the first trials of Nvidia's new Blackwell chips. Two other companies, Cerebras and FuriosaAI, announced new inference chips but did not submit to MLPerf.
Like an Olympic sport, MLPerf has many categories and subcategories. The “Datacenter Closed” category has the most submissions. In the Closed category (as opposed to the Open category), entrants must infer a given model as-is, without significant software modifications. The Datacenter category tests entrants on query batching, as opposed to the Edge category, which places an emphasis on minimizing latency.
Within each category, there are nine different benchmarks covering different kinds of AI tasks, including common use cases like image generation (e.g. Midjourney) and LLM Q&A (e.g. ChatGPT), as well as equally important but less-watched tasks like image classification, object detection, and recommendation engines.
This round of the competition included a new benchmark called Mixture of Experts, a growing trend in LLM deployments in which a language model is split into several smaller, independent language models, each fine-tuned for a specific task, such as normal conversation, solving math problems or assisting with coding. The model can then direct each query to an appropriate subset of the smaller models, or “experts.” This approach uses fewer resources per query, reducing costs and increasing throughput, said Miroslav Hodak, chair of the MLPerf Inference Workgroup and senior technical staff at AMD.
The winners in each of the popular Datacenter Closed Benchmarks were still submissions based on Nvidia's H200 GPU and GH200 superchip, which combine GPU and CPU in the same package. However, a closer look at the performance results paints a more complex picture: some submitters use many accelerator chips, while others use just one. If we normalize the number of queries per second each submitter can handle by the number of accelerators they used, and keep only the best-performing submissions for each accelerator type, some interesting details emerge (note that this approach ignores the role of the CPU and interconnect).
On an accelerator-by-accelerator basis, Nvidia's Blackwell outperformed all previous chip iterations by 2.5x on the only benchmark submitted, the LLM Q&A task. Untether AI's speedAI240 Preview chip performed roughly on par with the H200 on image recognition, the only task submitted. Google's Trillium performed just over half as well as the H100 and H200 on image generation, and AMD's Instinct performed roughly on par with the H100 on the LLM Q&A task.
Blackwell's Power
One of the reasons for Nvidia Blackwell's success is its ability to run LLM using 4-bit floating-point precision. Nvidia and its rivals have been reducing the number of bits used to represent data in some of their Transformer models, such as ChatGPT, to speed up calculations. Nvidia introduced 8-bit arithmetic in the H100, but this submission marks the first demonstration of 4-bit arithmetic in an MLPerf benchmark.
The biggest challenge in using these low-precision numbers is maintaining accuracy, says Dave Salvator, director of product marketing at Nvidia. To maintain the high level of accuracy required for MLPerf submissions, the Nvidia team had to significantly innovate its software, he said.
Another key contributor to Blackwell's success is its memory bandwidth, nearly doubled at 8 terabytes per second compared to the H200's 4.8 terabytes per second.
Nvidia GB2800 Grace Blackwell Super Chip Nvidia
While Nvidia's Blackwell proposal used a single chip, Salvator said it was built with networking and scalability in mind, and will perform best when paired with Nvidia's NVLink interconnect. Blackwell GPUs support up to 18 NVLink 100 gigabytes per second connections for a total bandwidth of 1.8 terabytes per second, roughly double the interconnect bandwidth of the H100.
Salvatore argues that as large language models grow in size, inference will also require multi-GPU platforms to keep up with demand, and Blackwell is built for this eventuality. “Blackwell is a platform,” Salvatore said.
Nvidia submitted its Blackwell chip-based systems in the Preview subcategory, meaning they're not yet available for sale, but are expected to be available by the next MLPerf release in six months.
Untether AI at work in power usage and on the edge
MLPerf also includes an energy measurement counterpart to each benchmark, systematically testing the wall-plug power each system consumes while executing a task. The main event (Data Center Closure Energy category) saw only two participants in this round: Nvidia and Untether AI. Nvidia participated in all benchmarks, while Untether only participated in image recognition.
Submitted by
accelerator
Number of accelerators
Queries per second
Watts
Queries per second/watt
NVIDIA
NVIDIA H200-SXM-141GB
8
480,131.00
5,013.79
95.76
Untethered AI
Untethered AI Speed AI240 Slim
6
309,752.00
985.52
314.30
The startup achieved this incredible efficiency by building its chips with an approach called at-memory computing: UntetherAI's chips are built as a grid of memory elements with small processors interspersed in their direct vicinity. The processors are parallelized, each operating simultaneously on data from nearby memory units, drastically reducing the time and energy it takes to move model data back and forth between memory and compute cores.
“We found that 90 percent of the energy spent running AI workloads was moving data from DRAM to cache to the processing elements,” says Robert Beachler, vice president of products at Untether AI. “So what Untether has done is invert that situation. Instead of moving data to compute, we're going to move compute to data.”
This approach proved especially effective in another MLPerf subcategory, Edge Closed, which targets more front-line use cases such as machine inspection on the factory floor, guided vision robots and autonomous vehicles, applications where low energy consumption and fast processing are paramount, Beachler said.
Submitted by
GPU Type
Number of GPUs
Single Stream Delay (ms)
Multi-Stream Delay (ms)
Samples/sec
Lenovo
NVIDIA L4
2
0.39
0.75
25,600.00
Lenovo
NVIDIA L40S
2
0.33
0.53
86,304.60
Untethered AI
UntetherAI speedAI240 Preview
2
0.12
0.21
140,625.00
In image recognition tasks, the only ones for which UntetherAI reported results, its speedAI240 Preview chip beat the NVIDIA L40S's latency performance by 2.8x and throughput (samples per second) by 1.6x. The startup also submitted power results in this category, but its Nvidia-accelerated competitors did not, making a direct comparison difficult. However, UntetherAI's speedAI240 Preview chip has a nominal power consumption of 150 watts per chip, compared to 350 watts for Nvidia's L40s, resulting in improved latency and a nominal 2.3x power reduction.
Cerebras, Furiosa skip MLPerf but announce new chips
Furiosa's new chip implements matrix multiplication, a fundamental mathematical function for AI inference, in a different, more efficient way.
Yesterday, at the IEEE Hot Chips conference at Stanford University, Celebrus unveiled its own inference service. The Sunnyvale, California-based company manufactures gigantic chips, as large as silicon wafers will allow, avoiding interconnects between chips and vastly increasing the memory bandwidth of the devices. These chips are primarily used to train large neural networks. Now the company is upgrading its software stack and using its latest computer, the CS3, for inference.
Cerebras hasn't submitted to MLPerf, but the company claims its platform beats the H100 in LLM tokens generated per second by seven times, and twice as much as the chips from competing AI startup Groq. “We're in the dial-up era of Gen AI right now,” says Andrew Feldman, CEO and co-founder of Cerebras. “That's because there's a memory bandwidth barrier. Whether it's Nvidia's H100, or the MI 300, or the TPU, they're all using the same off-chip memory and have the same limitations. We're going to break through that, and we can do that because we're wafer-scale.”
Also at Hot Chips, Seoul-based Furiosa unveiled its second-generation chip, RNGD (pronounced “renegade”). What differentiates Furiosa's chip is its Tensor Contraction Processor (TCP) architecture. A fundamental operation in AI workloads is matrix multiplication, which is typically implemented as a primitive in hardware. However, the size and shape of matrices, more commonly called tensors, can vary widely. RNGD implements this more generalized version, tensor multiplication, as a primitive. “During inference, batch sizes vary widely, so it's important to leverage the inherent parallelism and data reuse from certain tensor shapes,” June Paik, founder and CEO of Furiosa, said at Hot Chips.
Furiosa didn't submit to MLPerf, but it did compare its RNGD chip's performance internally on MLPerf's LLM abridged benchmark, and found it performed on par with Nvidia's edge-oriented L40S chip, consuming just 185 watts of power compared to the L40S's 320 watts. And Paik said further software optimizations will likely improve performance.
IBM also announced that its new Spyre chips, designed for enterprise-generated AI workloads, will be available in the first quarter of 2025.
At the very least, shoppers in the AI inference chip market won't be bored anytime soon.
From an article on your site
Related articles from around the web