SAMBA NOVA SN40L HOT CHIPS 2024_Page_02
The theme of Hot Chips 2024 is obviously AI. The SambaNova SN40L RDU is the company's first design for the trillion-parameter AI model era. I think the SN40L is outdoors, and there were a lot of people there early this morning. Hopefully I'll be able to take a snapshot later.
We're live on Hot Chips 2024 this week so please excuse any typos.
SambaNova SN40L RDU for Trillion Parameter AI Models
The new SambaNova SN40L “Cerulean” architecture, which is a 5nm TSMC chip with three layers of memory, which is very impressive, and a dataflow architecture designed as a training and inference chip.
SAMBA NOVA SN40L HOT CHIPS 2024_Page_02
The three tiers of memory are 520MB of on-chip SRAM. Then there's 64GB of HBM. Then there's additional DDR memory as a capacity tier. SambaNova is showing a 16-socket system here with features like 8GB of on-chip SRAM and 1TB of HBM.
SAMBA NOVA SN40L HOT CHIPS 2024_Page_03
Shown here are 1040 compute and memory units with SambaNova tile mesh switches.
SAMBA NOVA SN40L HOT CHIPS 2024_Page_04
This is the compute unit. Instead of a traditional execution unit like fetch/decode, it has a series of static stages. The PCU can act as a streaming unit (data from left to right), the blue is a cross lane reduction tree, and for matrix computation operations it can be used as a systolic array.
SAMBA NOVA SN40L HOT CHIPS 2024_Page_05
Here is a high level memory unit block diagram: These are programmable managed scratch pads instead of traditional caches.
SAMBA NOVA SN40L HOT CHIPS 2024_Page_06
The chip also has a mesh network. There are three physical networks: Vector, Scalar, and Control.
SAMBA NOVA SN40L HOT CHIPS 2024_Page_07
The AGCU is used to access off-chip memory (HBM and DDR) and the PCU is designed to access the on-chip SRAM scratchpad.
SAMBA NOVA SN40L HOT CHIPS 2024_Page_08
This is the highest level of interconnection.
SAMBA NOVA SN40L HOT CHIPS 2024_Page_09
Below is an example of how Softmax is caught by the compiler and mapped to hardware.
SAMBA NOVA SN40L HOT CHIPS 2024_Page_10
Mapping this onto the LLM and GenAI transformer models, we get the following: Looking inside the decoder, there are a number of different operations.
SAMBA NOVA SN40L HOT CHIPS 2024_Page_12
A close-up of a decoder is shown below. Each box is an operator. It typically runs multiple operators at the same time, keeping the data on-chip for reuse.
SAMBA NOVA SN40L HOT TIP 2024_Page_13
Below is SambaNova's speculation about how operators are fused on the GPU, which they point out may not be accurate:
SAMBA NOVA SN40L HOT TIP 2024_Page_14
In the RDU, the entire decoder is one kernel call, and it is the compiler that does this mapping.
SAMBA NOVA SN40L HOT TIP 2024_Page_15
This slide is just a “cool diagram” of decoder mapping.
SAMBA NOVA SN40L HOT TIP 2024_Page_16
Going back to the structure of our transformer, here are the different functions of our decoder, we can see that each function call has a startup overhead.
SAMBA NOVA SN40L HOT TIP 2024_Page_17
It is written as one call rather than 32 calls.
SAMBA NOVA SN40L HOT TIP 2024_Page_18
In other words, there is less call overhead as one call is made instead of multiple calls, resulting in more time for the chip to do useful work on the data.
SAMBA NOVA SN40L HOT TIP 2024_Page_19
Here is SambaNova performance on llama3.1: Here is a QR code, this is for SambaNova, no one knows what it is, so I recommend not using it.
SAMBA NOVA SN40L HOT TIP 2024_Page_20
Here's a comparison of scaling vs batch size using a different QR code. The same caveats apply: as you increase the batch size, ideally the scaling will match the batch size. SambaNova comes close.
SAMBA NOVA SN40L HOT TIP 2024_Page_21
As a result, SambaNova says it has a compelling inference product. DDR is used to mix expert model checkpoints. Because the DDR is on-board, SambaNova doesn't need to go to the host CPU to get that data. Or it would require more GPUs to hold all those checkpoints of expert models. This DDR really helps with the model switching aspect.
SAMBA NOVA SN40L HOT TIP 2024_Page_22
This is a slide about training.
SAMBA NOVA SN40L HOT TIP 2024_Page_23
It looks like I'm running a little late so I'll wrap it up shortly.
Final Words
Overall, this is cool stuff, it's cool to see the company's accelerator, and I'll update this post when I get some more close-up photos of the chip during the break.