NVIDIA detailed its Blackwell AI platform and how it is leveraging a new high-bandwidth interface to fuse two GPUs together.
NVIDIA Blackwell GPUs detailed at Hot Chips with NV-HBI, 5th Generation Tensor Cores for AI, 4th Gen NVLINK, Spectrum-X and more
Last week, NVIDIA revealed the first images of Blackwell running in a data center and announced it would be releasing more information about its Blackwell AI platform.
Today, the company announced the latest details about its entire Blackwell platform, which uses not one chip but several different products, including:
Blackwell GPU Grace CPU NVLINK Switch Chip Bluefield-3 ConnectX-7 ConnectX-8 Spectrum-4 Quantum-3
The entire NVIDIA Blackwell AI platform is powered by over 400 “optimized” CUDA-X libraries that deliver top performance on Blackwell chips. These libraries target a wide variety of application domains and build on a decade of innovation built within the CUDA-X package. They support an ever-expanding set of algorithms to power the next generation of AI models.
Now, let's talk about Blackwell. The chip has six major components: an AI superchip with 208 billion transistors, a Transformer Engine supporting FP4/FP6 data formats through tensor cores, a Secure AI Engine with full performance encryption and TEE, 5th generation NVLINK scalable to 576 AI GPUs, a RAS Engine with 100% in-system self-testing, and a Decompression Engine with 800 GB/s bandwidth.
AI Superchip – 208 Billion Transistors (TSMC 4NP, >1600mm2) Transformer Engine – 5th Gen Tensor Cores (FP4, FP6, FP8 data formats) 5th Gen NVLink – Scalable up to 576 GPUs (1.8 TB/sec bandwidth) NV-HBI (NVIDIA High Bandwidth Interface) – Die-to-Die Interconnect with 10 TB/sec bandwidth RAS Engine – 100% In-System Self Test Decompression Engine – 800 GB/sec Bandwidth Secure AI – Full Performance Encryption and TEE
The NVIDIA Blackwell GPU itself has the highest AI compute, memory bandwidth, and interconnect bandwidth of any single GPU. The GPU uses two reticle-limited GPUs integrated into one using NV-HBI, which we'll explain in a moment. The chip itself has 208 billion transistors packaged on the TSMC 4NP process node in a design of over 1600 mm2. The Blackwell AI GPU offers 20 PetaFLOPS FP4 AI, 8 TB/s memory bandwidth (8 sites of HBM3e), 1.8 TB/s bidirectional NVLINK bandwidth, and a high-speed NVLINK-C2C link to the Grace CPU.
NVIDIA's multi-die architecture journey began with Ampere, and while it wasn't a traditional MCM design, the two GPU blocks were fused together using a high-bandwidth interconnect, making the chip no different from a monolithic implementation.
This design was further refined in subsequent generations, moving to a two-die implementation in Blackwell. The chips are fused with NV-HBI (NVIDIA High-Bandwidth Interface) to provide 10 TB/s of bidirectional bandwidth on a single edge, with very low power per bit and a consistent link between the GPUs, enabling both superior performance and a no-compromise solution.
The Blackwell GPU architecture is enhanced by the 5th generation Tensor Core architecture with new microtensor scale FP formats, including FP4, FP6, FP8, etc. These microtensor scale factors are applied to fixed-length vectors, allowing for the mapping of elements to fixed scale factors, resulting in a wider FP range, amplified bandwidth, lower power, and finer granularity of quantization.
In terms of the performance impact of the 5th generation Tensor Cores, each of the existing data formats (FP16, BF16, FP8) provides a 2x speedup per clock per SM compared to Hopper, while FP6 provides a 2x speedup compared to Hopper's FP8 and FP4 provides a 4x speedup compared to Hopper's FP8 format. In addition to the new formats, Blackwell AI GPUs also offer increased operating frequencies and SM counts compared to Hopper chips.
One of Blackwell's newest features is NVIDIA Quasar Quantization, which converts low-precision formats such as FP4 into high-precision data using optimized libraries, HW and SW transformer engines, and low-precision numerical algorithms. Compared to BF16, Quantized FP4 delivers the same MMLU scores on LLM and the same accuracy across Nemotron-4 15B and 340B models.
NVIDIA Blackwell brings together multiple chips, systems, and NVIDIA CUDA software to power next-generation AI across use cases, industries, and countries. NVIDIA GB200 NVL72 is a multi-node, liquid-cooled, rack-scale solution connecting 72 Blackwell GPUs and 36 Grace CPUs, raising the bar for AI system design. NVLink interconnect technology provides communication between all GPUs, enabling record-high throughput and low-latency inference for generative AI. NVIDIA Quasar Quantization System pushes the boundaries of physics to accelerate AI computing. NVIDIA researchers are building AI models that help build processors for AI.
Another key feature of the NVIDIA Blackwell AI platform is the fifth-generation NVLINK, which connects the entire platform with 18 NVLINKs, each with 100 GB/s of bandwidth, and x2@200 Gbps-PAM4, with 1.8 TB/s of bandwidth.
Also included are 4th generation NVLINK switch chips, configured in an NVLINK switch tray with a die size of over 800mm2 (TSMC 4NP). These chips scale NVLINK to 72 GPUs on a GB200 NVL72 rack, providing 7.2 TB/sec of omnidirectional bidirectional bandwidth across 72 ports and 3.6 TFLOPs of SHARP in-network computing. The tray contains two of these switches with a total bandwidth of 14.4 TB/sec.
All of this is integrated into the NVIDIA GB200 Grace Blackwell superchip, a powerhouse of AI computing with one Grace CPU and two Blackwell GPUs (four GPU dies). The board features an NVLINK-C2C interconnect, providing 40 PetaFLOPS of FP4 computing and 20 PetaFLOPS of FP8 computing. One Grace Blackwell tray features two Grace CPUs (72 cores each) and four Blackwell GPUs (eight GPU dies).
NVLINK spines are used in GB200 NVL72 and NVL36 servers delivering up to 36 Grace CPUs and 72 Blackwell GPUs, all fully connected using NVLINK switch racks. The servers deliver 720 PetaFLOPS of training, 1440 PetaFLOPS of inference, supporting model sizes up to 27 trillion parameters and bandwidth up to 130 TB/s (multi-node).
Finally, there is Spectrum-X, the world's first Ethernet fabric built for AI. It consists of two chips: Spectrum-4 with 100 billion transistors, 51.2T bandwidth, 64x 800G and 128x 400G ports, and Bluefield-3 DPU with 16 Arm A78 cores, 256 threads, and 400 Gb/s Ethernet. These two AI Ethernet chips are integrated into the Spectrum-X800 rack, an end-to-end platform for cloud AI workloads.
Combined, NVIDIA's Blackwell AI platform will deliver 30x improvement in real-time inference and 25x improvement in energy efficiency over Hopper. But NVIDIA is just getting started. After Blackwell, the green team plans to release Blackwell Ultra in 2025 with increased compute density and memory, followed by Rubin and Rubin Ultra with HBM4 and an entirely new architecture in 2026-2027. The entire CPU, network, and interconnect ecosystem is also scheduled to receive significant updates throughout 2025-2027.
Share this story