FuriosaAI RNGD No Cooler_1
We are hearing more and more about sustainable AI computing, and FuriosaAI offers a solution with RNGD that is almost the opposite of many of the AI computing platforms we have learned about today: instead of striving for maximum power, it is a low-power computing solution.
This is the final talk of the day after over 10 talks, and it's being done live so please excuse any typos.
FuriosaAI RNGD Processor for Sustainable AI Computing
Here are the specs for the card: This is not specifically designed to be the fastest AI chip on the market.
Furiosa AI RNGD Hot Chips 2024_Page_05
Here is the card with the cooler.
FuriosaAI RNGD without cooler and with cooler
For air-cooled data centers, the target TDP is only 150W.
Furiosa AI RNGD Hot Chips 2024_Page_06
The build will be done using 12-layer HBM3 and TSMC CoWoS-S, as well as a 5nm process.
Furiosa AI RNGD Hot Chips 2024_Page_07
Rather than focusing on the H100 or B100, FuriosaAI is targeting the NVIDIA L40S. We've written extensively about the L40S before. The goal is not only to offer similar performance, but to offer that performance at a lower power.
Furiosa AI RNGD Hot Chips 2024_Page_08
Efficiency comes from hardware, software and algorithms.
Furiosa AI RNGD Hot Chips 2024_Page_09
One of the challenges FuriosaAI has been addressing is the abstraction layer between hardware and software.
Furiosa AI RNGD Hot Chips 2024_Page_11
Tensor contraction is one of the big operations in FuriosaAI; on BERT, it accounted for over 99% of the FLOPS.
Furiosa AI RNGD Hot Chips 2024_Page_12
Usually, instead of tensor contractions, you would use matrix multiplication as a primitive.
Furiosa AI RNGD Hot Chips 2024_Page_13
Instead, the abstraction happens at the tensor contraction level.
Furiosa AI RNGD Hot Chips 2024_Page_14
Furiosa adds low-level einsum to its primitives.
Furiosa AI RNGD Hot Chips 2024_Page_15
Now we multiply matrices A and B to produce C.
Furiosa AI RNGD Hot Chips 2024_Page_16
Furiosa takes this and schedules it on a real architecture with memory and compute units.
Furiosa AI RNGD Hot Chips 2024_Page_17
From here on, the entire tensor contraction becomes a primitive.
Furiosa AI RNGD Hot Chips 2024_Page_18
Considering spatial and temporal orchestration can increase efficiency and utilization.
Furiosa AI RNGD Hot Chips 2024_Page_19
According to Furiosa, it has flexible reconfiguration capabilities, which are important for keeping performance high as batch sizes change.
Furiosa AI RNGD Hot Chips 2024_Page_20
Let's look at the implementation of RNGD.
Furiosa AI RNGD Hot Chips 2024_Page_21
Here is the interconnection network for accessing the scratchpad memory:
Furiosa AI RNGD Hot Chips 2024_Page_22
Furiosa uses PCIe Gen5 xq6 for chip-to-chip communication, and P2P over a PCIe switch for direct GPU-to-GPU communication, so if XConn can get this right it will be a great product.
Furiosa AI RNGD Hot Chips 2024_Page_23
Furiosa supports SR-IOV for virtualization.
Furiosa AI RNGD Hot Chips 2024_Page_24
The company has worked on signal and power integrity for reliability.
Furiosa AI RNGD Hot Chips 2024_Page_25
Here is how the Furiosa LLM works in flowchart form.
Furiosa AI RNGD Hot Chips 2024_Page_27
The compiler compiles each partition that is mapped to multiple devices.
Furiosa AI RNGD Hot Chips 2024_Page_28
The compiler optimizes the model for better performance and energy efficiency.
Furiosa AI RNGD Hot Chips 2024_Page_29
The service framework performs operations such as continuous batch processing to increase utilization.
Furiosa AI RNGD Hot Chips 2024_Page_30
The company has a graph-based automated tool to assist with quantization. Furiosa can support a variety of formats, including FP8 and INT4.
Furiosa AI RNGD Hot Chips 2024_Page_31
This is the company's development methodology.
Furiosa AI RNGD Hot Chips 2024_Page_32
Final Words
There's a lot of information here, but the quick summary is that the company is using compilers and software to map AI inference onto low-power SoCs to deliver low-power AI inference.