LLMs on CPUs, Period

Nir Shavit on tools for sparsifying and quantizing LLMs for efficient CPU inference.

Subscribe: AppleSpotify OvercastGoogleAntennaPodPodcast AddictAmazon •  RSS.

Nir Shavit, Professor at MIT’s Computer Science and Artificial Intelligence Laboratory, is also a Founder of Neural Magic, a startup working to accelerate open-source large language models and simplify AI deployments.

Subscribe to the Gradient Flow Newsletter

Sparsity and quantization are methods to optimize large language models (LLMs) for efficient inference on CPUs. Sparsity reduces computations by zeroing out connections with small weights, while quantization compresses models by using lower precision numbers to represent weights and activations. Neural Magic has shown that applying both sparsity and quantization to LLMs enables high-performance inference on mainstream hardware, without needing specialized GPUs. This tandem approach makes deploying powerful LLMs much more practical on CPUs with limited resources.

Interview highlights – key sections from the video version:

 

Related content:


If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter: