Breaking the Memory Wall in the Age of Inference

Ben Lorica

2 months ago

Sid Sheth on Memory Bottlenecks, SRAM vs HBM, Digital In-Memory Compute, and the Future of Inference Hardware.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

In this episode, Sid Sheth, founder and CEO of d-matrix, discusses the company’s approach to AI inference hardware with a focus on solving the memory bottleneck problem. Sheth explains why d-matrix chose to build SRAM-based inference accelerators using digital in-memory compute (DIMC) technology, targeting low-latency applications in data centers rather than high-throughput workloads. The conversation covers the evolution from training-focused to inference-focused hardware, the limitations of HBM for inference, and the critical pre-fill/decode phases in LLM inference.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Trends shaping the future of AI infrastructure
The PARK Stack Is Becoming the Standard for Production AI
LLM Inference Hardware: Emerging from Nvidia’s Shadow
Hagay Lupesko → Beyond GPUs: Cerebras’ Wafer-Scale Engine for Lightning-Fast AI Inference
Jay Dawani → Bridging the Hardware-Software Divide in AI

Support our work by subscribing to our newsletter📩

Transcript

Below is a polished and edited transcript.

Ben Lorica: Today we have Sid Sheth, founder and CEO of d-matrix, which you can find at d-matrix.ai. Their tagline is “redefining performance and efficiency for AI inference at scale.” Welcome to the podcast, Sid.

Sid Sheth: Thank you, Ben. Thanks for having me.

Ben Lorica: Our topic today is hardware for AI, particularly for inference. Sid, I have a litmus test for people I talk to in the hardware space: I check their LinkedIn to see if they have actually built chips before. There was a wave of chip startups led by people who weren’t industry veterans, but hardware is not like software—you usually have to build, and perhaps fail at, a few chips before you know what you’re doing.

Sid Sheth: You’re absolutely right.

Ben Lorica: That is my way of introducing your background to the audience. Regarding inference hardware: around 2017 or 2018, there was a wave of startups focused on training. Cerebras is one of the few significant ones left from that era. Now, more people are targeting inference because of the rise of foundation models—every time you use one, that’s inference.

I noticed d-matrix focuses heavily on memory. For our audience familiar with SRAM or High Bandwidth Memory (HBM), can you explain what those are good at and where they fall short for inference?

Sid Sheth: I agree with your litmus test. To earn the right to build an AI chip company, you need to have lived through at least three chip cycles. I’ve done 15 plus in my career, taking them beyond tape-out to production and deployment.

The team at d-matrix is very experienced. When we started in 2019, we weren’t in that first wave of AI chip companies. We were in a “second wave” before ChatGPT. At that time, many companies were focused on edge inference for computer vision or on training. We decided it didn’t make sense to build another edge vision company or to challenge Nvidia in training. We focused on cloud inference because we didn’t see a dedicated solution for it in the data center.

Ben Lorica: This was pre-ChatGPT, so that focus was somewhat counterintuitive at the time.

Sid Sheth: Very counterintuitive. I remember investor meetings where I had to explain what inference even was and why it mattered in the data center. After ChatGPT, that conversation became much easier.

Back then, we looked at the workload from first principles. We realized inference is about massive, repetitive parallel compute trying to access memory. As workloads grow, you need to access that memory constantly. We decided to build a solution around memory and compute integration—putting them as close together as possible.

Ben Lorica: Why did you choose SRAM over HBM?

Sid Sheth: Once we decided to integrate memory and compute, we had to choose the technology. HBM doesn’t lend itself well to compute integration because it is power-hungry and costly. Integrating compute directly alongside HBM is difficult due to energy and cost constraints.

We went down the path of SRAM-based inference. We packed as much SRAM as possible close to the compute. Even back in 2020, before the current LLM explosion, we saw models like GPT-3 reaching 175 billion parameters. We knew the trajectory was heading toward even larger models.

Ben Lorica: But even with GPT-3, there weren’t many commercial users initially.

Sid Sheth: True, but the capability was becoming clear. Transformer models like BERT were already being deployed by Google for search. Hyperscalers were giving us feedback that these models were versatile, would become multimodal, and would continue to grow in size. We made a bet on transformer acceleration.

Our first product is an SRAM inference accelerator with ten times more memory capacity than other solutions. This allows us to accommodate larger models and provides the low latency the market is now demanding.

Ben Lorica: Early players like Groq and Cerebras also relied on SRAM, but they eventually had to add external memory because LLMs grew too large for a single chip.

Sid Sheth: Exactly. You can’t out-game model growth just by packing more memory into one chip. Everyone needs external memory eventually. We embraced this early by including a second tier of memory using LPDDR. We can support 256GB on a single card or 10 terabytes in a rack. This handles extremely large models and the exploding KV cache sizes required for large context windows.

Ben Lorica: You mentioned the two-stage process of “pre-fill” and “decode.” Can you explain that for our audience?

Sid Sheth: When generative transformers arrived, they introduced a specific behavior. Think of it like this: if you ask me a tough question, I have to spend a few seconds thinking before I speak. That thinking process is “pre-fill.” I’m processing your question and filling my memory with context. “Decode” is when I start speaking the answer.

Pre-fill is very compute-intensive because the model is processing the prompt. Decode is very memory-intensive because, to “speak” the next word, the model must access all that stored context (the KV cache). If I pause for two seconds between every word I say, you won’t enjoy the conversation. In AI, that’s latency. Memory-centric architectures like ours are designed to make that decode phase extremely fast.

Ben Lorica: For those not in hardware, what is HBM and why isn’t it optimal for inference?

Sid Sheth: HBM is a DRAM-based technology invented in the mid-2010s. It was designed to solve a bandwidth problem. Older DRAM had a very “thin pipe” between the processor and memory. HBM created a “multi-lane highway”—moving from a one-lane road to a 16-lane highway.

It was perfect for high-performance computing and training, which is why it became the standard for Nvidia GPUs. However, inference is about efficiency in three areas: money, time, and energy. HBM is very costly and energy-hungry. Furthermore, it’s actually not fast enough anymore for the latest AI applications. It can’t keep up with the speed the industry now requires. While it has a bright future in training, it won’t be able to serve the mass-scale needs of inference.

Ben Lorica: This leads to your technology: Digital In-Memory Compute (DIMC). Is that a marketing term or a technical one?

Sid Sheth: It’s both. We call it “in-memory compute” because we created a fabric where compute and memory are the same thing.

A traditional SRAM memory cell has six transistors and just stores one bit of data. We augmented that cell with more transistors to give it the ability to perform multiplication. It’s now a 10-transistor cell that can store a bit and compute at the same time. To the outside world, it looks like traditional SRAM, but it’s much more powerful. We can access all rows of the SRAM simultaneously, which gives us much higher throughput.

Ben Lorica: How does that benefit the user during pre-fill and decode?

Sid Sheth: Most AI math is matrix multiplication. In our chip, the model parameters stay inside the memory and the math happens right there. Because you don’t have to move data back and forth between a processor and memory, you save massive amounts of time and energy. During the decode phase, when you need to spit out tokens quickly, this “compute-in-place” approach is a huge advantage.

Ben Lorica: How does this look in a data center rack?

Sid Sheth: The DIMC is like a Lego block. A single PCIe card has 2,048 of these blocks. A full rack contains 64 of those cards. In terms of model size, we can run a 100-billion-parameter model entirely out of the SRAM tier in a single rack. That is 5 to 10 times better than other solutions on the market.

Ben Lorica: Does this support reasoning models or multimodal models?

Sid Sheth: Absolutely. Reasoning models require a lot of “internal thinking” before they generate a response. That means they generate many internal tokens. Reducing latency is critical for that user experience, and our architecture is perfectly suited for it.

Ben Lorica: Is there a tradeoff between low latency and high throughput?

Sid Sheth: There is always a tradeoff. If you want to serve thousands of users simultaneously with extremely high throughput, a GPU is a great solution. We focus on the “medium throughput, low latency” space. If someone tells you there is no tradeoff in their chip, they probably haven’t designed many chips.

Ben Lorica: Let’s talk about software. Nvidia has CUDA. If I use d-matrix hardware, how do I know my models will run?

Sid Sheth: Nvidia has spent over a decade building their programming model and kernel libraries. For researchers who just want to experiment and get a model running, a GPU is the best on-ramp.

However, once you move to production and need to optimize for cost, energy, and speed, that’s where we come in. We focus on portability. Transformer architectures share many common components. We provide a kernel library for the most popular models, and we also give sophisticated customers the ability to write their own kernels for our hardware.

Ben Lorica: Who are your typical users?

Sid Sheth: We focus on applications that require extremely low latency. Our customers include hyperscalers, “neo-clouds,” and sovereign entities. Some ask us to port specific models (like Llama, Qwen, or DeepSeek) for them. Others, like hyperscalers with proprietary models, use our tools to port the models themselves.

Ben Lorica: Why not set up your own cloud, like Cerebras did?

Sid Sheth: We don’t feel it’s necessary. Our hardware is designed to be “plug and play” within standard industry servers, like those from Supermicro. We prefer a collaborative approach. Companies have been building racks and servers for decades; we don’t need to reinvent that. Our advantage is in the silicon and the software.

Ben Lorica: What should we watch for in the future of inference?

Sid Sheth: We are entering the “Age of Inference.” We’re moving toward agentic AI—layers that sit on top of CRM or ERP tools to make decisions and improve productivity.

We are also moving toward interactive video generation. Currently, video generation is an offline, “wait-and-see” experience. When it becomes truly interactive, it will require massive memory bandwidth. That is an application where our architecture could really shine.

Ben Lorica: What about high-bandwidth flash memory? Could that solve the memory capacity problem?

Sid Sheth: It’s a promising technology for “client” applications—like laptops—where you want to run a model locally and need cost-effective capacity. However, flash isn’t very programmable, making it difficult for dynamic data center workloads.

The “memory wall”—the gap between compute speed and memory speed—has been a problem for 30 years. AI has just made it much more pronounced. d-matrix was founded specifically to break through that wall.

Ben Lorica: When will we see official customer announcements?

Sid Sheth: Stay tuned. We announced the product a year ago, and this is the year you will start hearing about deployments and customer partnerships.

Ben Lorica: Can we squeeze more performance out of interconnects?

Sid Sheth: Scaling compute is a huge topic. Inside a server, we connect cards via PCIe. Between servers in a rack, we use our own Ethernet-based solution called JetStream. The ecosystem is also moving toward standards like Broadcom’s eSUN and the Ultra Accelerator Link (UAL). My sense is that Nvidia’s NVLink and Ethernet-based eSUN will dominate the landscape.

Ben Lorica: Is d-matrix embracing open standards?

Sid Sheth: Yes. On the software side, we use PyTorch, MLIR, and OpenBMC. On the hardware side, we’ve embraced UCIe, PCIe, and Ethernet. We are focused on our accelerators, so it makes sense for us to be open and collaborative with the rest of the ecosystem.

Ben Lorica: Thank you, Sid. Again, the website is d-matrix.ai.

Sid Sheth: Thank you, Ben. Great chatting with you.

Sid Sheth on Memory Bottlenecks, SRAM vs HBM, Digital In-Memory Compute, and the Future of Inference Hardware.

Transcript

Share this: