Nir Shavit on tools for sparsifying and quantizing LLMs for efficient CPU inference.
Subscribe: Apple • Spotify • Overcast • Google • AntennaPod • Podcast Addict • Amazon • RSS.
Nir Shavit, Professor at MIT’s Computer Science and Artificial Intelligence Laboratory, is also a Founder of Neural Magic, a startup working to accelerate open-source large language models and simplify AI deployments.
Sparsity and quantization are methods to optimize large language models (LLMs) for efficient inference on CPUs. Sparsity reduces computations by zeroing out connections with small weights, while quantization compresses models by using lower precision numbers to represent weights and activations. Neural Magic has shown that applying both sparsity and quantization to LLMs enables high-performance inference on mainstream hardware, without needing specialized GPUs. This tandem approach makes deploying powerful LLMs much more practical on CPUs with limited resources.
Interview highlights – key sections from the video version:
- Sparsification & quantization during fine tuning
- Why target CPUs for LLM inference
- Accuracy post sparsification & quantization
- Sparsification and quantization of pre-trained LLMs (foundation models)
- Performance improvements
- Democratizing AI compute: Neural Magic, Llama.cpp, and more
- Developing technology to sparsify and customize upstream AI models for easier downstream use.
- Speed and efficiency focused on minimizing compute, delays, data movement, and storage for conversational AI interactions over time.
- CPUs can match GPUs
- Multimodal models
Related content:
- A video version of this conversation is available on our YouTube channel.
- Apple’s AI Leap: Bridging the Gap in On-Device Intelligence
- Daniel Lenton: Ivy – The One-Stop Interface for AI Model Deployment and Development
- Philipp Moritz and Goku Mohandas: Navigating the Nuances of Retrieval Augmented Generation
- Waleed Kadous: Best Practices for Building LLM-Backed Applications
- Ivy: Streamlining AI Model Deployment and Development
- Best Practices in Retrieval Augmented Generation
- OpenAI Developer Conference: Customizable AI Sparks Excitement and Concern
- Expanding access to Frontier Models with software and hardware optimizations
- Open Source Principles in Foundation Models
- Michele Catasta: Software Development with AI and LLMs
If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter: