As businesses race to adopt generative AI and bring new services to market, demands on data center infrastructure are at an all-time high. Training large language models is one challenge, but delivering real-time services that leverage LLMs is another.
In the latest round of MLPerf industry benchmarks, Inference v4.1, the NVIDIA platform was the top performer in all data center tests. The first submission for the upcoming NVIDIA Blackwell platform revealed that using the second-generation Transformer Engine and FP4 Tensor Cores, it delivered up to 4x performance over NVIDIA H100 Tensor Core GPUs on MLPerf's largest LLM workload, Llama 2 70B.
NVIDIA H200 Tensor Core GPUs achieved outstanding results across all benchmarks in the Data Center category, including the latest addition to the benchmark, Mixtral 8x7B Mixture of Experts (MoE) LLM, with a total of 46.7 billion parameters, with 12.9 billion parameters active per token.
MoE models are gaining popularity as a way to bring more versatility to LLM deployments, as they can answer a wider variety of questions and perform a wider variety of tasks in a single deployment. They are also more efficient, as they only activate a smaller number of experts per inference, meaning they deliver results much faster than dense models of a similar size.
The continued growth of LLMs demands more computing power to handle inference requests. To meet the real-time latency requirements to serve today's LLMs, and to serve as many users as possible, multi-GPU computing is a must. Based on the NVIDIA Hopper architecture, NVIDIA NVLink and NVSwitch provide high-bandwidth communication between GPUs, delivering significant benefits for real-time, cost-effective large-model inference. The Blackwell platform further extends the capabilities of NVLink Switch with a larger NVLink domain with 72 GPUs.
In addition to NVIDIA's submissions, 10 NVIDIA partners – ASUSTek, Cisco, Dell Technologies, Fujitsu, Giga Computing, Hewlett Packard Enterprise (HPE), Juniper Networks, Lenovo, Quanta Cloud Technology and Supermicro – all had solid MLPerf inference submissions, highlighting the broad availability of the NVIDIA platform.
Continuous software innovation
The NVIDIA platform is under continuous software development, delivering monthly improvements in performance and features.
The latest round of inference delivered breakthrough performance improvements on NVIDIA products including the NVIDIA Hopper architecture, NVIDIA Jetson platform and NVIDIA Triton inference server.
The NVIDIA H200 GPU delivered up to 27 percent higher generative AI inference performance over the previous round, highlighting the long-term added value customers can derive from their investment in the NVIDIA platform.
Part of the NVIDIA AI platform and available in NVIDIA AI Enterprise software, the Triton Inference Server is a full-featured open-source inference server that helps organizations consolidate framework-specific inference servers into a single, unified platform, lowering the total cost of ownership for serving AI models in production and reducing model deployment time from months to minutes.
This MLPerf round saw the Triton Inference Server deliver performance nearly on par with NVIDIA's bare-metal submission, demonstrating that organizations no longer have to choose between having a feature-rich, production-grade AI inference server and achieving peak throughput performance.
Heading to the Edge
Generative AI models deployed at the edge can transform sensor data, such as images and videos, into actionable insights in real time with strong contextual awareness. The NVIDIA Jetson platform for Edge AI and Robotics is uniquely capable of running any kind of model locally, including LLM, Vision Transformer, and Stable Diffusion.
In this MLPerf benchmarking round, the NVIDIA Jetson AGX Orin system-on-module achieved over 6.2x throughput improvement and 2.4x latency improvement on the GPT-J LLM workload compared to the previous round. Instead of building for specific use cases, developers can now use this general-purpose 6 billion parameter model to seamlessly interface with human language to transform generative AI at the edge.
Demonstrating performance leadership in all areas
“MLPerf Inference demonstrates the versatility and performance of the NVIDIA platform across all benchmark workloads, from the data center to the edge, powering the most innovative AI-powered applications and services. For more details on these results, read our tech blog.”
H200 GPU-powered systems are available today from CoreWeave, the first cloud service provider to announce general availability, as well as server manufacturers ASUS, Dell Technologies, HPE, QTC and Supermicro.
Please see the software product information notice.