Celebras Wafer Scale AI Hot Chip 2024_Page_40
Cerebras makes wafer-sized computing chips, and the infrastructure around them is used to build things much bigger than NVIDIA's GPUs. At Hot Chips 2024, we'll get to see more about the company's foray into the AI inference space. We got a sneak preview of the performance, and it feels almost ridiculous compared to H100 inference. Instead of scaling to multiple GPUs or moving to off-chip HBM memory, Cerebras just puts the entire model into the SRAM of a giant chip.
Please excuse any typos. This is being written live at Hot Chips.
Cerebras Enters AI Inference, Outperforming Smaller NVIDIA H100 GPUs
To summarize, Cerebras has a huge chip with 44GB of SRAM and a lot of cores – it's the biggest square chip you can make from a circular wafer.
Celebras Wafer Scale AI Hot Chip 2024_Page_03
Here's another look at the scale compared to a typical GPU: Whereas a GPU tries to break down a large wafer into smaller chips and then stitch them together, Cerebras keeps them assembled as wafer chips.
Celebras Wafer Scale AI Hot Chip 2024_Page_04
The box looks like this.
Celebras Wafer Scale AI Hot Chip 2024_Page_05
There are plenty of servers out there outside the box: Cerebras started with a cluster in Santa Clara that we cover.
Celebras Wafer Scale AI Hot Chip 2024_Page_06
This is a second cluster of similar size in Stockton, California.
Celebras Wafer Scale AI Hot Chip 2024_Page_07
The third system was in Dallas, Texas, and was five times larger.
Celebras Wafer Scale AI Hot Chip 2024_Page_08
The new cluster is in Minnesota and is eight times larger than the first cluster.
Celebras Wafer Scale AI Hot Chip 2024_Page_09
The chip was designed for training large-scale models.
Celebras Wafer Scale AI Hot Chip 2024_Page_10
And it happens every day.
Celebras Wafer Scale AI Hot Chip 2024_Page_11
Using SRAM instead of HBM allows Cerebras to scale beyond what HBM would allow.
Celebras Wafer Scale AI Hot Chip 2024_Page_17
Cerebras claims that the Llama3.1-8B is 20 times faster than cloud services using the NVIDIA H100, such as Microsoft Azure.
Celebras Wafer Scale AI Hot Chip 2024_Page_19
Google-generated AI searches are slow.
Celebras Wafer Scale AI Hot Chip 2024_Page_21
Therefore, accelerating inference is key to delivering a good user experience.
Celebras Wafer Scale AI Hot Chip 2024_Page_22
Here's a demo of the Llama3.1-70B Cerebras outperforming a DGX-H100 solution.
Celebras Wafer Scale AI Hot Chip 2024_Page_23
Here is the benchmark.
Celebras Wafer Scale AI Hot Chip 2024_Page_24
There is a difference between generating the first token and subsequent tokens, which is one of the reasons why the transition to HBM has been so slow.
Celebras Wafer Scale AI Hot Chip 2024_Page_27
Wafer scale offers large amounts of SRAM (44GB), which means Cerebras doesn't need to move to slower HBM memory.
Celebras Wafer Scale AI Hot Chip 2024_Page_28
Let's look at the WSE-3 core with SRAM.
Celebras Wafer Scale AI Hot Chip 2024_Page_29
That tiny core is then replicated across the die and wafer.
Celebras Wafer Scale AI Hot Chip 2024_Page_30
From an interconnect perspective, this is all on-chip, there is no need to move it off the chip into a separate package.
Celebras Wafer Scale AI Hot Chip 2024_Page_31
With Cerebras, there's no need to go through an HBM memory interface – everything is on-chip.
Celebras Wafer Scale AI Hot Chip 2024_Page_32
It is possible to aggregate memory bandwidth between H100s.
Celebras Wafer Scale AI Hot Chip 2024_Page_33
But even within the DGX H100 8x NVIDIA H100 GPU solution, many power-hungry serial interfaces are required.
Celebras Wafer Scale AI Hot Chip 2024_Page_34
By not going off-die, Cerebras gets more memory bandwidth at lower power.
Celebras Wafer Scale AI Hot Chip 2024_Page_35
Because it's not off-die, Cerebras doesn't need to go through high-speed serial links, PCBs, switch chips, etc. Instead, it just moves data through the silicon.
Celebras Wafer Scale AI Hot Chip 2024_Page_36
Read this article to find out why scaling the DGX H100 is difficult.
Celebras Wafer Scale AI Hot Chip 2024_Page_37
Cerebras charted how much memory bandwidth cloud providers are actually using with Llama3.1-70B inference, showing peak memory bandwidth utilization for the NVIDIA DGX H100.
Celebras Wafer Scale AI Hot Chip 2024_Page_38
Here's how Cerebras achieves this with a single chip.
Celebras Wafer Scale AI Hot Chip 2024_Page_40
An entire layer can be placed on a portion of the wafer, and placing the layers adjacently minimizes data movement.
Celebras Wafer Scale AI Hot Chip 2024_Page_41
As a result of memory bandwidth, we can run with a batch size of 1 instead of a larger batch size.
Celebras Wafer Scale AI Hot Chip 2024_Page_42
The idea here is that token generation happens across layers.
Celebras Wafer Scale AI Hot Chip 2024_Page_43
Here we go again.
Celebras Wafer Scale AI Hot Chip 2024_Page_44
And again (sorry for the live broadcast)
Celebras Wafer Scale AI Hot Chip 2024_Page_45
Once you're done, you can move on to the next token.
Celebras Wafer Scale AI Hot Chip 2024_Page_46
Meanwhile, Llama3.1-8B runs on a single WSE-3 chip.
Celebras Wafer Scale AI Hot Chip 2024_Page_48
For larger models like the Llama3.1-70B, you'll need to scale across four wafers.
Celebras Wafer Scale AI Hot Chip 2024_Page_50
Since a hop only involves wafer-to-wafer activation, it does not incur a significant performance penalty.
Celebras Wafer Scale AI Hot Chip 2024_Page_51
This scale-out approach allows Cerebras to scale into latency/throughput regions that GPUs cannot address.
Celebras Wafer Scale AI Hot Chip 2024_Page_53
This is why it works on Cerebras: a single user is only using a fraction of the chip's bandwidth.
Celebras Wafer Scale AI Hot Chip 2024_Page_54
As a result, multiple users can run simultaneously on the same chip.
Celebras Wafer Scale AI Hot Chip 2024_Page_55
This is image 3.
Celebras Wafer Scale AI Hot Chip 2024_Page_56
There are four users here.
Celebras Wafer Scale AI Hot Chip 2024_Page_57
You can also process prompt tokens in parallel.
Celebras Wafer Scale AI Hot Chip 2024_Page_58
Here's a slide that shows this.
Celebras Wafer Scale AI Hot Chip 2024_Page_59
Here, one user is performing multiple operations.
Celebras Wafer Scale AI Hot Chip 2024_Page_60
And these build slides continue.
Celebras Wafer Scale AI Hot Chip 2024_Page_61
Here, three users simultaneously work on multiple prompts with different layers on one chip.
Celebras Wafer Scale AI Hot Chip 2024_Page_62
Back to the chart.
Celebras Wafer Scale AI Hot Chip 2024_Page_63
You'll need to zoom out 10x to see where Celebrus is.
Celebras Wafer Scale AI Hot Chip 2024_Page_66
Cerebras says this is just the beginning: the company believes it can further improve speed and throughput.
Celebras Wafer Scale AI Hot Chip 2024_Page_67
You can do this right now with the Cerebras Inference Service.
Celebras Wafer Scale AI Hot Chip 2024_Page_69
Here's how to try it out: We don't control the QR codes, so of course you should exercise caution: It's very interesting to see the Llama-405B emerge alongside other larger models.
Celebras Wafer Scale AI Hot Chip 2024_Page_70
That's really amazing.
Final Words
Before the talk, I had the chance to sit down with Andrew Feldman (CEO of Cerebras) and he showed me a live demo. It's ridiculously fast.
This is important not just for humans to drive interactions: imagine a world of agents where computer AI agents interact with multiple other computer AI agents, where each agent takes a few seconds to produce an output, and there are multiple steps in that pipeline. When you think about an automated AI agent pipeline, you need fast inference to reduce the time throughout the chain.