Bridging the Hardware-Software Divide in AI

Jay Dawani on A New Software Stack for AI Development.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Jay Dawani is CEO and founder of Lemurian Labs, a pioneering startup building a software stack for developing advanced AI systems, focusing on pushing the boundaries of computational capabilities and model performance. Their work involves exploring cutting-edge hardware solutions and innovative software frameworks to tackle the challenges of large-scale AI training and deployment. The discussion covers a range of topics including GPU clusters, heterogeneous computing, memory bandwidth bottlenecks, and specialized AI supercomputers. Additionally, the conversation delves into software stacks, domain-specific languages, and the Python ecosystem for AI development. The podcast also touches on foundation model training, scaling laws, and the intricacies of AI model deployment and inference optimization.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Related content:

A video version of this conversation is available on our YouTube channel.
AMD’s Expanding Role in Shaping the Future of LLMs
Specialized Hardware for AI: Rethinking Assumptions and Implications for the Future
Beyond Nvidia: Exploring New Horizons in LLM Inference
LLM Routers Unpacked
Dylan Patel → The Open Source Stack Unleashing a Game-Changing AI Hardware Shift
Nir Shavit → LLMs on CPUs, Period
Tim Davis → Redefining AI Infrastructure
Andrew Feldman → The Rise of Custom Foundation Models

If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter:

Transcript.

Below is a heavily edited excerpt, in Question & Answer format.

What megatrend prompted you to start Lemurian Labs?

After a decade working in AI, including early foundation models in 2018, I identified a critical disconnect between semiconductor companies, software development, and AI companies. The challenge was clear: AI models needed to grow 100,000x bigger, requiring approximately 400,000 GPUs. This raised fundamental questions about compute infrastructure, architecture, and cost efficiency.

I realized the core issues were threefold: the architecture itself, fragmented and brittle software stacks, and hardware built under assumptions different from what modern AI software requires. These tectonic shifts were driving unsustainable costs and energy consumption. The solution needed to be reimagined computer architectures better suited for AI workloads, though we eventually discovered that software was an even bigger bottleneck.

How do you view the scaling laws debate in AI development?

Scaling laws are real – models do need to get bigger. More capacity equals more degrees of freedom and a larger state space to search for useful neural circuit representations. There are three critical factors: inductive bias, scaling, and priors. When compute or data is limited, you focus on inductive bias and priors, essentially enforcing your knowledge into the neural network structure. But for generality and broader intelligence, you want less structure and bias, allowing data to shape the model.

As models scale, we see emergence – capabilities that weren’t explicitly programmed. However, scaling doesn’t uniformly improve everything. Some tasks get worse with scale, some see no impact, and others plateau regardless of how much compute or data you throw at them.

Are you focused primarily on training or inference optimization?

We’re tackling both. While training remains the dominant cost today and will be for a while, inference optimization is equally important. For inference, we know how to make large models significantly cheaper through knowledge distillation, blocking, tiling, fusing, quantizing, and exploiting sparsity. Essentially, modern AI is high-performance computing inside cloud environments, requiring us to bring decades of HPC knowledge into the ML domain.

Our software stack aims to make optimizations easier or automatic for developers. We can’t solve everything through our system, but we can significantly reduce the complexity for engineers while continuing to benefit from improving hardware.

How did your company’s focus evolve from hardware to software?

We started as a semiconductor company, developing a number format called PAL that offered efficient ALUs with good representation and numeric properties. We designed a distributed dataflow architecture around it – one of the first of its kind and extremely powerful.

However, we realized ML developers have established workflows and are accustomed to working in PyTorch, TensorFlow, JAX, etc. A good product needs to fit their world. With over 2,000 operators in PyTorch alone, you need comprehensive coverage running on your hardware. This is extraordinarily difficult to build while simultaneously developing a chip.

Most clouds want to see 2-3 generations of production silicon with working software before adoption, so we decided to flip our approach and focus on software first. This positions us better for the trend toward heterogeneity in computing.

What do you mean by heterogeneity in computing?

Heterogeneity refers to the increasing diversity in compute architecture. In modern supercomputing, you have CPUs, GPUs, interconnects, PCI buses, different memory architectures from HBM to GDDR, and we’re moving from multi-chip modules to chiplets and eventually 3D stacking.

Each of these changes affects how you write software and move data. Programming models, execution models, and developer workflows all need to adapt. We need to enable large-scale co-design from workloads to software to hardware and systems, especially as these components evolve at faster cycles.

What are you building at Lemurian Labs and how is it different?

We’re building a unified software stack called Title. Developers shouldn’t have to manage dozens of open-source libraries that sit between PyTorch and lower-level systems like CUDA, ROCm, or MKL, which constantly break with dependencies.

Today, training clusters and inference clusters are different because they’re optimized for different characteristics – memory bandwidth versus throughput versus latency. Less than 2,000 people globally really understand performance engineering for these systems, with 80% concentrated at one provider.

Our stack allows seamless operation across different hardware platforms, amortizing software costs across providers and making it simpler for developers. We’re targeting a 3-8x improvement on training runs for clusters with over 1,000 nodes – pure software optimization without changing interconnects or hardware.

What hardware platforms do you currently support?

We’re currently up and running on CPUs (Intel and AMD) and beginning work on GPUs (NVIDIA and AMD). We’re getting early results showing we’re at parity or better than most CPUs for different workloads running straight from PyTorch. We’re expecting to have solid GPU support by the end of the year, with approximately 80% feature parity compared to systems like Triton.

How are you avoiding the challenge of platform-specific kernels?

We’ve taken a fundamentally different approach than what ROCm did in trying to build their own version of CUDA. Platform-specific kernels force you to think about different workloads, hardware-specific optimizations, and require thousands of engineers to build and maintain.

Instead, we’re focused on non-platform-specific approaches, thinking about how to launch kernels at runtime for different hardware types, with different elaborations based on the specific needs. This allows us to achieve much more with fewer resources.

What’s your approach to licensing and open source?

Parts of our stack will be open source, particularly the top layers where people need to contribute and build an ecosystem. We’re keeping some proprietary components closed, especially those offering unique value. We’re currently engaged with many semiconductor companies and cloud providers to build partnerships that will help drive adoption.

How do you see the future of AI hardware and model development?

I expect a world where things change faster than people anticipate. We’ll see both model companies (like OpenAI, Anthropic) and data companies (building specialized models for verticals or proprietary use cases).

While transformers are incredibly powerful, they won’t be the last architecture we create. Every significant AI group is exploring alternatives, though much of the knowledge is already open – it’s how you piece components together and scale them that becomes the secret sauce.

Regarding challenges like memory bandwidth, our PAL format helps by covering FP16 precision in just 8 bits, saving 2x on bandwidth with negligible overhead for decoding.

The edge will also become increasingly important, with intelligence distributed across different contexts – from self-driving cars to household robots to healthcare applications. Our stack is designed to work from edge to cloud, helping connect these worlds.

This post and episode are part of a collaboration between Gradient Flow and Lemurian Labs. See our statement of editorial independence.

Jay Dawani on A New Software Stack for AI Development.

Transcript.

Share this:

Like this:

Discover more from The Data Exchange