Pablo Villalobos on lack of data and other bottlenecks for scaling machine learning models.
Pablo Villalobos is a Staff Researcher at Epoch, and lead author of the recent paper “Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning”. We discuss the key findings in this paper, as well as a related study Pablo conducted on scaling laws. The term “scaling laws” pertains to the correlations between the functional aspects of interest – typically, the test loss or the performance metric for fine-tuning tasks – and the architecture or optimization process characteristics such as model size, width, or training compute. Leveraging these laws can aid in the creation and training of deep learning models, while also providing valuable insights into the underlying principles.
Interview highlights – key sections from the video version:
- What is the paper about and why did you write it?
- Amount of data used by common deep learning architectures
- Trends in amount of data for NLP
- Trends in amount of data for computer vision applications
- What has been the reaction to your paper from the ML research community?
- What are “scaling laws” in the context of machine learning?
- Trends pertaining to the amount of compute for AI
- The role of synthetic data
- Which of the bottlenecks concerns Pablo the most?
- How can listeners contribute to easing some of these bottlenecks?
- Privacy and the emergence of decentralized custom models
- A video version of this conversation is available on our YouTube channel.
- Neil Thompson: The Computational Limits of Deep Learning
- Jinsung Yoon and Sercan Arik: Generating high-fidelity and privacy-preserving synthetic data
- FREE Report: 2023 Trends in Data, Machine Learning, and AI
- Yashar Behzadi: Synthetic data technologies can enable more capable and ethical AI
- Gabriela Zanfir-Fortuna and Andrew Burt: Preparing for the Implementation of the EU AI Act and Other AI Regulations
- Peter Norvig and Alfred Spector: Data Science and AI in Context
- Jian Pei: Pricing Data Products
If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter:
[Image: Livestreamers, generated with Stable Diffusion.]