Technician holding Cerebras Systems’ Wafer-Scale Engine, a giant computer chip
Cerebras Systems
AI is everywhere these days, and we’ve become accustomed to chatbots answering our questions like oracles or conjuring up magical images. Those responses are called inferences in the trade and the colossal computer programs from which they rain are housed in massive data centers referred to as ‘the cloud.’
Now, brace for a downpour.
Cerebras Systems, known for its revolutionary wafer-scale computer chip, big as a dinner plate, is about to unleash one of the top AI models—Meta’s open-source LLaMA 3.1—on its chip. Not beside it or above it or below it, but on it—a configuration that could blow away traditional inference.
What is more, Cerebras claims that its inference costs are one-third of those on Microsoft’s Azure cloud computing platform, while using one-sixth the power.
“With speeds that push the performance frontier and competitive pricing, Cerebras Inference is particularly compelling for developers of AI applications with real-time or high-volume requirements,” said Micah Hill-Smith, co-founder and CEO of Artificial Analysis Inc, which provides independent analysis of AI models.
Raindrops On The Water
This could create a ripple effect across the entire AI ecosystem. As inference becomes faster and more efficient, developers will be able to push the boundaries of what AI can do. Applications that were once bottlenecked by hardware limitations may now be able to flourish, leading to innovations that were previously thought impossible.
For example, in the realm of natural language processing, models could generate more accurate and coherent responses. This could revolutionize areas such as automated customer service, where understanding the full context of a conversation is crucial for providing helpful responses. Similarly, in fields like healthcare, AI models could process and analyze larger datasets more quickly, leading to faster diagnoses and more personalized treatment plans.
In the business world, the ability to run inference at unprecedented speeds opens new opportunities for real-time analytics and decision making. Companies could deploy AI systems that analyze market trends, customer behavior and operational data in real-time, allowing them to respond to changes in the market with agility and precision. This could lead to a new wave of AI-driven business strategies, where companies leverage real-time insights to gain a competitive edge.
But whether this will be a cloudburst, or a deluge remains to be seen.
As AI workloads move to inference and away from training operations, the need for more efficient processors becomes imperative. Many companies are working on this challenge.
“Wafer scale integration from Cerebras is a novel approach that eliminates some of the handicaps that generic GPUs have and shows much promise,” said Jack Gold, the founder of J. Gold Associates, a technology analyst firm. He cautions that Cerebras is still a startup in a room full of big players.
Inference As A Service
Cerebras’ AI inference service not only accelerates the pace of AI model execution but could also alter the way businesses think about deploying and interacting with AI in real-world applications.
In typical AI inference workflows, large language models such as Meta’s LLaMA or OpenAI’s GPT-4o are housed in data centers, where they are called upon by application programming interfaces (APIs) to generate responses to user queries. These models are enormous and require immense computational resources to operate efficiently. GPUs, the current workhorses of AI inference, are tasked with the heavy lifting, but they struggle under the weight of these models, particularly when it comes to moving data between the model’s memory and its compute cores.
But with Cerebras’ new inference service, all the layers of a model, currently the 8 billion parameter and 70 billion parameter versions of LLaMA 3.1, are stored right on the chip. When a prompt is sent to the model, the data can be processed almost instantaneously because it doesn’t have to travel long distances within the hardware.
The result? For example, while a state-of-the-art GPU might process about 260 tokens – pieces of data such as a word – per second for an 8-billion parameter LLaMA model, Cerebras claims it can handle 1,800 tokens per second. This level of performance, validated by Artificial Analysis, Inc., is unprecedented and sets a new standard for AI inference.
Cerebras WSE Inference Comparison
Cerebras Systems
“Cerebras is delivering speeds an order of magnitude faster than GPU-based solutions for Meta’s Llama 3.1 8B and 70B AI models,” said Hill-Smith. “We are measuring speeds above 1,800 output tokens per second on Llama 3.1 8B, and above 446 output tokens per second on Llama 3.1 70B—a new record in these benchmarks.”
Cerebras is launching its inference service through an API, to its own cloud, but it is already talking to major cloud providers about deploying its model-loaded chips elsewhere. This opens a massive new market for the company, which has struggled to get users to adopt its chip, called a Wafer Scale Engine.
Breaking The Bottle
The speed of inference today is limited by bottlenecks in the network connecting GPUs to memory and storage. The electrical pathways connecting memory to cores can only carry a finite amount of data per unit of time. While electrons move rapidly in conductors, the actual data transfer rate is constrained by the frequency at which signals can be reliably sent and received, affected by signal degradation, electromagnetic interference, material properties and the length of wires over which the data must travel.
In traditional GPU setups, the model weights are stored in memory separate from the processing units. This separation means that during inference, there’s a constant need to transfer large amounts of data between the memory and the compute cores through tiny wires. Nvidia and others have tried all sorts of configurations to minimize the distance that this data needs to travel—stacking memory vertically on top of the compute cores in a GPU package for example.
“10 times faster than anything else on the market”
Cerebras’ new approach fundamentally changes this paradigm. Rather than etching transistor cores onto a silicon wafer and slicing it up into chips, Cerebras etches as many as 900,000 cores on a single wafer, eliminating the need for external wiring between separate chips. Each core on the WSE combines both computation (processing logic) and memory (static random access memory, or SRAM) to form a self-contained unit that can operate independently or in concert with other cores.
The model weights are distributed across these cores, with each core storing a portion of the total model. This means that no single core holds the entire model; instead, the model is split up and spread across the entire wafer.
“We actually load the model weights onto the wafer, so it’s right there, next to the core,” explains Andy Hock, Cerebras’ senior vice president of product and strategy.
This configuration allows for much faster data access and processing, as the system doesn’t need to constantly shuttle data back and forth over relatively slow interfaces.
According to Cerebras, its architecture can deliver performance “10 times faster than anything else on the market” for inference on models like LLaMA 3.1, although this remains to be further validated. Importantly, Hock claims that due to the memory bandwidth limitations in GPU architectures, “there’s actually no number of GPUs that you could stack up to be as fast as we are” for these inference tasks.
By optimizing for inference on large models, Cerebras is positioning itself to address a rapidly growing market need for fast, efficient AI inference capabilities.
Picking Nvidia’s Lock
One reason why Nvidia has had a virtual lock on the AI market is the dominance of Compute Unified Device Architecture, its parallel computing platform and programming system. CUDA provides a software layer that gives developers direct access to the GPU’s virtual instruction set and parallel computational elements.
For years, Nvidia’s CUDA programming environment has been the de facto standard for AI development, with a vast ecosystem of tools and libraries built around it. This has created a situation where developers are often locked into the GPU ecosystem, even if alternative hardware solutions could offer better performance.
Cerebras’ WSE is a fundamentally different architecture from traditional GPUs, requiring software to be adapted or rewritten to take full advantage of its capabilities. Developers and researchers need to learn new tools and potentially new programming paradigms to work with the WSE effectively.
Cerebras has tried to address this by supporting high-level frameworks like PyTorch, making it easier for developers to use its WSE without learning a new low-level programming model. It has also developed its own software development kit to allow for lower-level programming, potentially offering an alternative to CUDA for certain applications.
But by offering an inference service that is not only faster but also easier to use—developers can interact with it via a simple API, much like they would with any other cloud-based service—Cerebras is making it possible for organizations just entering the fray to bypass the complexities of CUDA and still achieve top-tier performance.
This is in line with an industry shift to open standards, where developers are free to choose the best tool for the job, rather than being constrained by the limitations of their existing infrastructure.
It’s All About Context
The implications of Cerebras’ breakthrough, if its claims are borne out and it can ramp up production, are profound. First and foremost, consumers will benefit from significantly faster responses. Whether it’s a chatbot answering customer inquiries, a search engine retrieving information, or an AI-powered assistant generating content, the reduction in latency will lead to a smoother, more instantaneous user experience.
But the benefits could extend far beyond just faster responses. One of the biggest challenges in AI today is the so-called “context window”—the amount of text or data that a model can consider at once when generating an inference. Inference processes that require a large context, such as summarizing lengthy documents or analyzing complex datasets.
Larger context windows require more model parameters to be actively accessed, increasing memory bandwidth demands. As the model processes each token in the context, it needs to quickly retrieve and manipulate relevant parameters stored in memory.
In high-inference applications with many simultaneous users, the system needs to handle multiple inference requests concurrently. This multiplies the memory bandwidth requirements, as each user’s request needs access to the model weights and intermediate computations.
Even the most advanced GPUs like Nvidia’s H100 can move only around 3 terabytes of data per second between the high bandwidth memory and the compute cores. That’s far below the 140 terabytes per second needed to efficiently run a large language model at high throughput without encountering significant bottlenecks.
Andy Hock, senior VP of product and strategy at Cerebras Systems, holding the company’s wafer-scale … (+) computer chip.
Craig Smith
“Our effective bandwidth between memory and compute isn’t just 140 terabytes, it’s 21 petabytes per second,” Hock claims.
Of course, it’s hard to judge a company statement without industry benchmarks, and independent testing will be key to confirming this performance.
By eliminating the memory bottleneck, Cerebras’ system can handle much larger context windows and increase token throughput. If the performance claims hold true, this could be a game-changer for applications that require the analysis of extensive information, such as legal document review, medical research or large-scale data analytics. With the ability to process more data in less time, these applications can operate more effectively.
More Models To Come
Hock said that the company will soon offer the larger LLaMA 405 billion parameter model on its WSE, followed by Mistral’s models and Cohere’s Command-R model. Companies with proprietary models (hello, OpenAI) can approach Cerebras to load their models onto the chips as well.
Moreover, the fact that Cerebras’ solution is delivered as an API-based service means that it can be easily integrated into existing workflows. Organizations that have already invested in AI development can simply switch to Cerebras’ service without having to overhaul their entire infrastructure. This ease of adoption, if paired with the promised performance gains, could make Cerebras a formidable competitor in the AI market.
“But until we have more concrete real-world benchmarks and operations at scale,” cautioned analyst Gold, “it’s premature to estimate just how superior it will be.”