Robert Nishihara on Multimodal AI, Scaling Infrastructure, and Post-Training Optimization.
Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.
Robert Nishihara is co-founder of Anyscale and co-creator of Ray, the open source project that has emerged as the AI Compute Engine. This episode dives into critical aspects of AI development, emphasizing the paradigm shift toward data-centric practices, the challenges of handling multimodal and large-scale datasets, and the importance of scalable infrastructure. Key trends like video generation, synthetic data, and AI-driven data curation are explored, alongside insights into scaling laws and reasoning capabilities. The discussion provides actionable insights for practitioners building real-world AI solutions. [This episode originally aired on Generative AI in the Real World, a podcast series I’m hosting for O’Reilly.]
Interview highlights – key sections from the video version:
-
- The Paradigm Shift: Data’s Role in AI
- Challenges of Multi-Modal Data Tooling
- AI-Centric Workloads and Evolving Data Pipelines
- Scaling and Distributed Processing for Unstructured Data
- Enterprise Adoption of AI and Infrastructure Challenges
- Experimentation in AI: From Data Collection to Evaluation
- AI Development Lifecycle: From Ideation to Production
- The Next Data Type to Dominate AI: Images and Video
- Understanding and Mining Video Data
- Scaling Laws and Their Current Limitations
- Improving Data Quality and Reasoning Capabilities
- Looking Ahead to AI Developments in 2025
- The Role of Post-Training in Enhancing Foundation Models
- Future Predictions: The Evolution of Foundation Models
Related content:
- A video version of this conversation is available on our YouTube channel.
- Paradigm Shifts in Data Processing for the Generative AI Era
- Vaibhav Gupta → Unleashing the Power of BAML in LLM Applications
- Deepti Srivastava → Beyond ETL: How Snow Leopard Connects AI, Agents, and Live Data
- Petros Zerfos and Hima Patel → Unlocking the Power of LLMs with Data Prep Kit
If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter:
Transcript.
Below is a heavily edited excerpt, in Question & Answer format.
What major shift are we seeing in how AI practitioners approach model development compared to a decade ago?
The importance of data in AI has been recognized for a while, but we’re experiencing a significant paradigm shift. A decade ago with benchmarks like ImageNet, the focus was on model architecture and optimization algorithms. The dataset was considered static – you’d split your ImageNet dataset into train and test data, then innovate on the model architecture.
Now we’re in a different paradigm. While there’s still some innovation in model architectures (transformers being common), most innovation is happening on the data side. This includes acquiring different data sources, evaluating which sources make sense for training, curating data, generating synthetic data, and filtering out low-quality data.
How has data processing for AI evolved beyond traditional approaches?
Previously, data preparation was limited – you might crop or scale images to augment your dataset. Now we’re using AI itself to process and curate data. For instance, if you’re working with video data and want to filter out low-quality content, that’s an AI task.
Consider autonomous vehicle companies with millions of miles of driving footage – not all data is created equal. Driving down an empty road for long stretches is less valuable than capturing rare scenarios like unusual pedestrian behavior. Extracting the most informative, interesting, highest quality data is very much an AI task.
How will AI change traditional data insights and processing workflows?
Companies have accumulated vast data stores because they want to extract value from them. Today, they primarily do this by running SQL queries and simple analytics on structured data – not because they lack unstructured data, but because SQL queries can’t operate on PDFs, videos, images, or arbitrary text documents.
In the future, companies will get insights from data using AI to read the data, reason about it, and draw conclusions. This represents a shift from SQL-centric to AI-centric workflows, which means moving from CPU-intensive to GPU-intensive or mixed CPU/GPU workloads. The tooling for these multimodal, GPU-intensive, data-intensive workloads is still very immature or practically non-existent.
What are the resource challenges in processing multimodal data at scale?
Processing multimodal data isn’t just about running inference with an LLM – there are many other operations. With video data, you might need to decompress the video, re-encode it for downstream consumption, find scene changes, do transcription, run vision-language models to generate descriptions, and apply various classifiers to extract structured information.
Some stages will be GPU-bound, others memory-bound (videos are massive), and some CPU-bound. You need an architecture that can disaggregate these resources, scaling up GPUs where they’re the bottleneck and CPUs where they’re the constraint. You also need to stream processing stages together to keep your GPUs busy. And with large datasets – billions of images or hundreds of hours of video – you must scale out computationally because the data won’t fit on a single machine.
Are these data challenges primarily theoretical or are companies facing them today?
Companies are running into these challenges today. Previously, we were getting value primarily from structured data because we had the tools to process it. We weren’t thinking much about unstructured or multimodal data’s potential. Now that generative AI is unlocking value from diverse data types, we’re collecting more data.
Companies are looking at scaling laws and realizing that training bigger models with more compute and more data leads to better results. Many are saying they need to enable training on 100x more data, but their internal systems weren’t built for that scale. This puts tremendous pressure on ML infrastructure teams who are now on the critical path for delivering results.
Will these multimodal data challenges impact regular enterprises or just tech-forward companies?
While companies like ByteDance and Pinterest are at the cutting edge, every company has valuable data about its business and customers. Every company will want to get insights from this data and use it for decision-making. The value and potential are there – the challenge is having the tooling and infrastructure to make it possible.
Why is experimentation crucial throughout the AI pipeline?
Customization isn’t just about fine-tuning the model or optimizing the prompt. There are countless decisions at every pipeline stage: what data to collect, how to segment or chunk it, what embedding functions to use, how to do retrieval, what to include in the context for RAG applications, how to rank context, which model to use, how to fine-tune it, and whether to fact-check outputs.
To iterate quickly, you need to try different choices and evaluate how well they work. We recommend companies over-invest in evaluations early on, doing steps manually to evaluate not just end-to-end performance but individual pipeline stages. Without the right computational foundation, these experiments become impossible to execute.
How do scaling needs change as companies mature in their AI implementation?
The need for scale changes depending on your AI maturity. You can prototype and experiment at a small scale using APIs and limited data to validate product-market fit. But once you’re moving to production, considerations change dramatically. Companies start caring about cost, scale, reliability, and model upgrades – presenting a very different set of challenges than during the ideation phase.
What’s the next data modality likely to gain traction beyond text?
Image data is becoming ubiquitous. Many common LLM use cases will benefit from visual input – copilots seeing screenshots, customer support viewing damaged products. Text, images, video, and audio are the big modalities, with PDFs being significant but ultimately converted to text and images under the hood.
Video will be the most challenging but potentially most rewarding modality. It combines images, audio, and often text while containing enormous amounts of information. It poses hard infrastructure challenges due to its size, but it’s a format people naturally interact with. Once we have the tooling to really leverage video data, it’ll be indispensable.
What’s the current state of video generation and what challenges remain?
Companies like Runway have impressive video generation models, but model quality remains the primary focus for improvement. Current models can generate awesome videos (typically 5-10 seconds at a time, though segments can be stitched together), but they’re still difficult to control – translating a clear idea in your head to the exact video you want remains challenging.
These tools will be widely used in various parts of the video creation process very soon, though Academy Award-winning AI films may still be some way off.
Where do we stand with scaling laws and what’s the path forward?
The concept of scaling laws – that putting more compute, data, and bigger models yields better results – is one of the major AI breakthroughs of the past decade. Previously, many researchers focused more on clever algorithms rather than scaling simple techniques.
Any given technique, model architecture, or strategy (like generating synthetic data) may have a finite scaling window with diminishing returns. Continued progress requires new ideas and strategies. When people say we’re running out of data, remember we can generate data, license data behind firewalls, and improve existing data by correcting mistakes and separating high-quality from low-quality content. On the reasoning front, there’s still significant mileage to gain by improving techniques that enable models to leverage inference-time compute.
What AI developments should we expect to see in 2025?
We’ll see significantly better reasoning capabilities – math is one benchmark where there’s huge interest and promise. Multimodality will become ubiquitous; while most people currently use text-based models, text-and-image multimodal models will be everywhere.
Companies will increasingly use AI to extract insights from previously underutilized data like internal documents, design files, and recorded meetings. The investment in data curation, processing, and preparation will explode – growing even faster than spending on model training.
Is extensive data preparation also important for fine-tuning and other post-training techniques?
Absolutely. People do continued training for a reason, which blurs the line between pre-training and fine-tuning. If you want to inject more knowledge and intelligence into model weights, you’ll benefit from having more high-quality data.
Most companies should optimize their platforms for post-training since very few will train models from scratch. Many companies will train lots of smaller models, and many will do post-training work. Much of what we appreciate about foundation models can be attributed to the post-training stage, which is becoming increasingly complex.

