LLMs on CPUs, Period

Ben Lorica

2 years ago

Nir Shavit on tools for sparsifying and quantizing LLMs for efficient CPU inference.

Subscribe: Apple • Spotify • Overcast • Google • AntennaPod • Podcast Addict • Amazon • RSS.

Nir Shavit, Professor at MIT’s Computer Science and Artificial Intelligence Laboratory, is also a Founder of Neural Magic, a startup working to accelerate open-source large language models and simplify AI deployments.

Subscribe to the Gradient Flow Newsletter

Sparsity and quantization are methods to optimize large language models (LLMs) for efficient inference on CPUs. Sparsity reduces computations by zeroing out connections with small weights, while quantization compresses models by using lower precision numbers to represent weights and activations. Neural Magic has shown that applying both sparsity and quantization to LLMs enables high-performance inference on mainstream hardware, without needing specialized GPUs. This tandem approach makes deploying powerful LLMs much more practical on CPUs with limited resources.

Interview highlights – key sections from the video version:

Related content:

A video version of this conversation is available on our YouTube channel.
Apple’s AI Leap: Bridging the Gap in On-Device Intelligence
Daniel Lenton: Ivy – The One-Stop Interface for AI Model Deployment and Development
Philipp Moritz and Goku Mohandas: Navigating the Nuances of Retrieval Augmented Generation
Waleed Kadous: Best Practices for Building LLM-Backed Applications
Ivy: Streamlining AI Model Deployment and Development
Best Practices in Retrieval Augmented Generation
OpenAI Developer Conference: Customizable AI Sparks Excitement and Concern
Expanding access to Frontier Models with software and hardware optimizations
Open Source Principles in Foundation Models
Michele Catasta: Software Development with AI and LLMs

If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter:

Nir Shavit on tools for sparsifying and quantizing LLMs for efficient CPU inference.

Share this: