Intel Gaudi 3 OAM working sample package 1
Intel's main AI chip up until Falcon Shores is the Intel Gaudi 3. Some new details emerged at Hot Chips 2024. We've been covering this for a while now (e.g. April 2024), but it's expected to go from samples to production by 2024.
This is a live stream so please excuse any typos.My fingers are getting rough by 5pm.
Intel Gaudi 3 for AI Training and Inference
This is the third generation of Gaudi since around 2019. This generation brings further improvements in compute, memory bandwidth and capacity.
Inter Gaudi 3 Hot Chip 2024_Page_02
This is the OAM module. It has two interconnected compute dies that are mirror images of each other.
Inter Gaudi 3 Hot Chip 2024_Page_03
Here is the block diagram: What's really interesting here is that we have 14 decoders for HEVC, H264, JPEG, VP9, which is important for video inference, and we also get a lot of speed and feeds.
Inter Gaudi 3 Hot Chip 2024_Page_04
Each die has two DCOREs (Deep Learning Cores). Each die has a pair of matrix multiplication engines and 16 tensor processor cores, as well as 24MB of cache.
Inter Gaudi 3 Hot Chip 2024_Page_05
The Matrix Multiplication Engine is the large-scale matrix calculation engine of the Gaudi 3 accelerator.
Inter Gaudi 3 Hot Chip 2024_Page_06
Tensor processors are for non-Matmul computations.
Inter Gaudi 3 Hot Chip 2024_Page_07
L2, L3 and HBM are all in a unified memory space. There is also a memory context ID that allows tagging of shared cache lines. There is also near memory computing capability to save work for the TPC.
Inter Gaudi 3 Hot Chip 2024
Gaudi 3 also has its own control path and runtime drivers.
Inter Gaudi 3 Hot Chip 2024_Page_09
A quick word here about the Intel Gaudi software suite: I wish Intel had gone a step further and just talked about the Gaudi suite for Falcon Shores. If Falcon Shores is 2025, that should be on the table.
Inter Gaudi 3 Hot Chip 2024_Page_10
The graph compiler orchestrates how work is divided among the accelerators. The NOC bandwidth is designed to support parallel MME and TPC work.
Inter Gaudi 3 Hot Chip 2024_Page_11
When I saw Habana Labs at Hot Chips 31 in 2019 (Hot Chips was last held at Stanford Memorial Theater), one of the cool things they did was this: Habana uses an RDMA Ethernet network from the accelerators to connect each accelerator to each other and to a larger topology.
Inter Gaudi 3 Hot Chip 2024_Page_12
Here are some performance benchmarks: Although scaling has been done, it appears that Llama3-8B is still being optimized.
Inter Gaudi 3 Hot Chip 2024_Page_13
Gaudi 3 is designed to be easily scaled out using standard networks, using Ethernet networks.
Inter Gaudi 3 Hot Chip 2024_Page_14
At the same time, there is the question of whether it is “at any scale” or has it actually been tested on high-end systems with 65,000 or 100,000+ accelerators.
Final Words
This is a chip that is being ramped up in production, so we should start seeing more of it soon. After showcasing Gaudi 2 at Intel Developer Cloud last year, we got our first look at Gaudi 3 UBB earlier this year.
In April 2024, the Supermicro Gaudi 3 box was also unveiled.
Supermicro SYS 822GA NGR3 Intel Gaudi 3 8 Way 2
There's a lot here, and we want to roll it out at scale.