Site icon The Data Exchange

Beyond GPUs: Cerebras’ Wafer-Scale Engine for Lightning-Fast AI Inference

Hagay Lupesko on Wafer-Scale Architecture, High-Speed Inference, Enterprise AI, and Advanced Reasoning.

Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

Hagay Lupesko is the SVP for AI Inference at Cerebras Systems. In this episode, we delve into Cerebras Systems’ groundbreaking wafer-scale architecture that redefines AI hardware by integrating an entire silicon wafer into a single, powerful chip. The discussion covers how this innovation drives exceptional inference performance, rapid model deployment, and enhanced reasoning capabilities, making it an ideal solution for enterprise AI applications. Additionally, the episode explores the business implications and future directions for scalable, low-latency AI inference.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Related content:


Support our work by subscribing to our newsletter📩


Transcript

Below is a heavily edited excerpt, in Question & Answer format. In this Q&A, Hagay Lupesko, discusses how the company’s wafer-scale engine technology is revolutionizing LLM inference with speeds up to 70x faster than GPU-based solutions.

Q: What is Cerebras Systems, and what makes your inference platform unique?

A: Cerebras is a startup founded about nine years ago by five individuals including our CEO Andrew Feldman. The founding team made two key observations: first, that deep learning and large neural networks would become the future of AI, and second, that the accelerators used at the time (GPUs) weren’t designed specifically for neural networks but for graphics processing.

Cerebras took a completely different approach by designing the Wafer Scale Engine (WSE), a processor that uses an entire silicon wafer rather than cutting it into smaller chips. Our latest WSE-3 packs 4 trillion transistors, 900,000 AI cores, 125 petaflops of AI compute, and 44 gigabytes of on-chip SRAM memory. This architecture delivers inference speeds that are 10-70x faster than GPU-based solutions.

Q: What engineering challenges did Cerebras have to overcome with the wafer-scale design?

A: The main challenge was yield. In semiconductor manufacturing, there are always defects, and when your chip is the entire wafer, you can’t just discard defective units like you can with smaller chips. The Cerebras team developed innovative solutions to address this (detailed in a blog post on our website).

After solving the yield problem, we faced additional challenges: packaging such a large chip, cooling it effectively, providing sufficient power, and ensuring stability during long computations. These hardware, physical, and mechanical challenges required significant innovation.

Q: How does Cerebras achieve such dramatic speed improvements for inference?

A: The key advantage lies in memory architecture. When performing autoregressive generation with LLMs, you need to load parameters for each layer into compute cores, perform matrix multiplication, and repeat across all transformer blocks.

On GPUs, most memory resides outside the silicon in High Bandwidth Memory (HBM), which despite its name becomes the bottleneck for LLMs. Most of the GPU’s compute capacity sits idle while waiting for parameters to be fetched from HBM to compute cores.

In contrast, the Cerebras WSE packs 44 gigabytes of static RAM directly on the silicon itself—almost 1,000 times more than an H100. This SRAM is co-located across the wafer close to the compute cores. During inference, we don’t need to access external memory to load parameters because they’re already there and positioned near the compute cores, resulting in dramatically faster inference.

Q: What happens when models are too large to fit on a single wafer?

A: For large models that can’t fit entirely on one wafer, we implement pipeline parallelism, splitting the layers across multiple wafers so they all fit in the static RAM. The generation flows from one wafer to the next.

This approach doesn’t reduce token generation speed because it’s pipeline parallelism—at any given time, all wafers are fully busy computing their assigned portion. When using multiple wafers, the time to first token might increase slightly, but the token generation speed remains just as fast.

Q: Who is the target audience for Cerebras Inference?

A: Our target audience is companies and organizations that need extremely fast inference. We’re currently focused more on enterprises than individual developers, though we have about 100,000 developers with access to the platform with rate limits. We have triple-digit customers who have signed enterprise agreements with us, spanning industries from healthcare to finance to various agent-based applications.

Q: How does your performance compare to other providers?

A: According to independent benchmarks from Artificial Analysis, Cerebras delivers 2,314 tokens per second with the Open Llama 3.3 70B model. That’s approximately 70 times faster than Amazon Bedrock, which delivers 32 tokens per second with the same model. For latency, we achieve 170 milliseconds for time to first token, significantly faster than other providers.

Q: What models currently run on Cerebras Inference?

A: We have several models available. We recently announced partnerships with Mistral, where we power their Le Chat assistant product, and with Perplexity for their Sonar model (a fine-tuned version of Llama optimized for search tasks). We also support DeepSeek models, the Llama family of models, and we’re always open to onboarding new models based on demand and business opportunity.

Q: How quickly can new models be deployed to Cerebras?

A: If it’s a standard GPT-style transformer model (with or without MoEs and other standard components), we can typically get it running within days. We have a compiler (currently not publicly available) that we use to onboard models. For proprietary models from customers, either with standard architecture but custom weights or entirely new architectures, they would need to work with us for onboarding.

For fine-tuned versions of standard open-weights models, we can typically onboard them within 30 minutes.

Q: How does Cerebras address the rise of reasoning-enhanced models, which require more intensive inference?

A: Reasoning-enhanced models like OpenAI’s o1 or DeepSeek’s r1 generate extensive “thinking tokens” before producing their final answer, which significantly increases response time. This is where our fast inference provides substantial benefits—operations that might take a minute on GPU-based systems can be compressed to just a few seconds on our platform.

It’s important to understand that these reasoning models are still performing standard inference. They’ve just been trained to generate thinking tokens before producing answer tokens. The speed advantage of Cerebras becomes critical for practical applications where users or agent systems are waiting for responses.

Q: What’s next for Cerebras Inference?

A: We’re working on several fronts:

  1. Building out infrastructure to support credit card payments and serve the broader community of independent developers
  2. Rapidly scaling our capacity with new data centers (all currently in North America)
  3. Developing more advanced machine learning capabilities to further improve performance and reduce latency
  4. Focusing on reliability and operational excellence, which are critical requirements for our enterprise customers

Currently, we offer inference APIs as an end-to-end service, but we’re exploring options to provide more flexibility for customers who want to customize their inference stack, similar to how vLLM allows users to manage their own server deployments.

The demand we’re seeing is extraordinary, and we’re working to scale our infrastructure to meet it while maintaining our performance advantage and reliability.

Exit mobile version