Site icon The Data Exchange

LLMs on CPUs, Period

Nir Shavit on tools for sparsifying and quantizing LLMs for efficient CPU inference.

Subscribe: AppleSpotify OvercastGoogleAntennaPodPodcast AddictAmazon •  RSS.

Nir Shavit, Professor at MIT’s Computer Science and Artificial Intelligence Laboratory, is also a Founder of Neural Magic, a startup working to accelerate open-source large language models and simplify AI deployments.

Subscribe to the Gradient Flow Newsletter

Sparsity and quantization are methods to optimize large language models (LLMs) for efficient inference on CPUs. Sparsity reduces computations by zeroing out connections with small weights, while quantization compresses models by using lower precision numbers to represent weights and activations. Neural Magic has shown that applying both sparsity and quantization to LLMs enables high-performance inference on mainstream hardware, without needing specialized GPUs. This tandem approach makes deploying powerful LLMs much more practical on CPUs with limited resources.

Interview highlights – key sections from the video version:

  1. Sparsification & quantization during fine tuning
  2. Why target CPUs for LLM inference
  3. Accuracy post sparsification & quantization
  4. Sparsification and quantization of pre-trained LLMs (foundation models)
  5. Performance improvements
  6. Democratizing AI compute: Neural Magic, Llama.cpp, and more
  7. Developing technology to sparsify and customize upstream AI models for easier downstream use.
  8. Speed and efficiency focused on minimizing compute, delays, data movement, and storage for conversational AI interactions over time.
  9. CPUs can match GPUs
  10. Multimodal models

 

Related content:


If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter:

Exit mobile version