Ambitious artificial intelligence computing startup Cerebras Systems Inc. is upping the ante in its battle with Nvidia Corp., launching what it calls the world's fastest AI inference service, now available in the cloud.
AI inference refers to the process of running live data through a trained AI model to make a prediction or solve a task. Inference services are a mainstay of the AI industry and, according to Cerebras, are also the fastest growing segment, currently accounting for approximately 40% of all AI workloads in the cloud.
But existing AI inference services don't seem to meet all customers' needs. “There's a lot of interest in how to do inference faster and at a lower cost,” CEO Andrew Feldman said at a press conference in San Francisco on Monday.
The company intends to achieve this with its new “high-speed inference” service, which it believes marks a groundbreaking moment for the AI industry, saying that the 1,000 tokens per second speed it can offer is comparable to the introduction of broadband internet, opening up innovative new opportunities for AI applications.
Raw Power
Cerebras is well-equipped to provide such a service: The company makes specialized, powerful computer chips for AI and high-performance computing (HPC) workloads. Over the past year, it has attracted a lot of attention by claiming that its chips are not only more powerful than Nvidia's graphics processing units, but also more cost-effective. “This is performance that's just not possible with a GPU,” asserted co-founder and chief technology officer Sean Lie.
The company's flagship product is its new WSE-3 processor (pictured), announced in March, which builds on the earlier WSE-2 chipset that debuted in 2021. Built on an advanced 5-nanometer process, it packs 1.4 trillion more transistors than the previous-generation chip, with more than 900,000 computing cores and 44 gigabytes of on-board static random-access memory. The startup said the WSE-3 has 52 times as many cores as a single Nvidia H100 graphics processing unit.
The chip is available as part of a data center device called CS-3, which is roughly the size of a small refrigerator. The chip itself is roughly the size of a pizza, with integrated cooling and power modules. In terms of performance, the Cerebras WSE-3 is twice as powerful as the WSE-2, reaching a peak speed of 125 petaflops. One petaflop is the equivalent of 1,000 trillion calculations per second.
The Cerebras CS-3 system is the engine that powers the new Cerebras Inference service, and notably has 7,000 times more memory than the Nvidia H100 GPU, solving one of the fundamental technical challenges of generative AI: the need for more memory bandwidth.
Incredible speed at a low cost
It meets this challenge elegantly: the Cerebras Inference service is said to be extremely fast, up to 20 times faster than comparable cloud-based inference services that use Nvidia's most powerful GPUs. Cerebras says it delivers 1,800 tokens per second on the open source Llama 3.1 8B model, and 450 tokens per second on the Llama 3.1 70B.
Pricing is also competitive, with the startup saying the service can be available from just 10 cents per million tokens, delivering a 100x price/performance improvement for AI inference workloads.
The company says its Cerebras Inference Service is particularly well suited for “agent AI” workloads — AI agents that can perform tasks on behalf of users — because such applications require the ability to constantly prompt the underlying models.
Micah Hill-Smith, co-founder and CEO of independent AI model analysis company Artificial Analysis Inc., said his team has verified that Llama 3.1 8B and 70B running on Cerebras Inference achieve “quality assessment results” that are in line with native 16-bit precision from the official version of Meta.
“Its speed and competitive pricing that pushes the performance envelope make Cerebras Inference particularly attractive to developers of AI applications with real-time or high-volume requirements,” he said.
Tiered Access
Customers can access Cerebras Inference services through three available tiers, including a free service that offers application programming interface-based access and generous usage limits for those wanting to try out the platform.
The Developer Tier is for flexible, serverless deployments. The company says that this tier is accessed via an API endpoint and is available at a fraction of the cost of other offerings currently available. For example, Llama 3.1 8B is just 10 cents per million tokens, while Llama 3.1 70B is 60 cents. Support for additional models is planned, the company says.
There's also an Enterprise Tier that offers a fine-tuned model with dedicated support and custom service level agreements. It's for sustained workloads and can be accessed via a private cloud managed by Cerebras or implemented on-premise. Cerebras didn't disclose the cost of this particular tier but said pricing is available upon request.
Cerebras boasts an impressive list of early access customers, including organisations such as GlaxoSmithKline, AI search engine startup Perplexity AI and network analytics software provider Meter.
Another early adopter, Dr. Andrew Ng, founder of DeepLearning AI Inc., explained that his company developed a multi-agent AI workflow that required repeated prompting of LLM to get results. “Cerebras has built an incredibly fast inference capability that is extremely useful for workloads like this,” he said.
Celebrus' ambitions don't stop there. Feldman said the company is “in talks with several hyperscalers” about offering its capabilities on cloud services. “We want to get them as customers,” Feldman said, adding that it also wants to attract specialized AI providers such as CoreWeave and Lambda.
Beyond inference services, Celebrus has announced several strategic partnerships to provide customers with access to all the specialized tools they need to accelerate their AI development, including LangChain, LlamaIndex, Docker Inc., Weights & Biases Inc., and AgentOps Inc.
Cerebras says its Inference API is fully compatible with OpenAI's Chat Completions API, making it possible to migrate existing applications to its platform with just a few lines of code.
Reporting by Robert Hoff
Photo: Celebras Systems
Your vote of support matters to us and helps keep our content free.
With just one click below you can support our mission of providing free, rich, relevant content.
Join the YouTube community
Join a community of over 15,000 #CubeAlumni experts, including many notable figures and experts, such as Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more.
“TheCUBE is an important partner for the industry. You guys are really participating in our events. We really appreciate you coming. I think people also appreciate the content that you're creating.” – Andy Jassy
thank you