From Web Video to Real-World Robots

Changan Chen on Foundation Models for Robotics, Video Prediction, and Real-World Deployment.


Subscribe: AppleSpotify OvercastPocket CastsYouTube •  AntennaPodPodcast AddictAmazon •  RSS.

Ben Lorica speaks with Changan Chen, co-founder and Chief Research Officer at Rhoda AI, about what it takes to build foundation models for robots that can operate in the real world. They discuss why Changan’s team uses web-scale video rather than language as the core training signal, how 10 to 20 hours of teleoperation data can adapt a model to specific industrial tasks, and why video prediction may offer both better interpretability and a practical route to deployment. The conversation also covers world models, safety, reinforcement learning, and the challenge of moving robotic systems from controlled lab demos into production environments.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript



Related content:


Support our work by subscribing to our newsletter📩


Transcript

Below is a polished and edited transcript.

Ben Lorica: All right, today we have Changan Chen, co-founder and Chief Research Officer at Roda AI, which you can find at rhoda.ai. The tagline is “redefining robotic intelligence,” with a focus on deploying robotic systems into the real world. With that, welcome to the podcast.

Changan Chen: Thank you for having me.

Ben Lorica: Let’s start with a bit of clarification. For the rest of the discussion, what do you mean by “robot”? A robot can be a humanoid, an arm, or a vacuum cleaner. In the context of what we’re about to talk about, how would you define a robot?

Changan Chen: In this conversation, the robot I’ll be talking about is mostly a mobile robot with arms and a torso, like a humanoid—not a Roomba or an entertainment robot.

Ben Lorica: There are also robots in factories that only have a specific function, like just an arm. Does what we’re going to talk about go beyond just an arm?

Changan Chen: Technically, we can use our model on those robots as well. Our focus is more on the intelligence layer, and that intelligence can be applied to any robot, even a single arm.

Ben Lorica: Okay. So it assumes that the robot has some sensors, including visual sensors. Based on what you just said, it sounds like what you’re building is what listeners will be familiar with under the term “foundation model.” Is that correct?

Changan Chen: Yes, I would say we are building a foundation model tailored toward robotics.

Ben Lorica: “Foundation model” is a fairly new term. Large Language Models (LLMs) had been around for a while, but let’s face it, no one really paid attention until GPT-3 or 3.5. Where are we with the equivalent foundation model for robots? Are we at GPT-1 or GPT-2?

Changan Chen: I think we are around GPT-3. These models are becoming more and more generalizable, but not to the extent that you can just give an arbitrary instruction and the robot can do things automatically in a fully generalizable way. There is a lot of effort in the industry right now trying to scale up these foundation models.

There is some effort on Vision-Language-Action (VLA) models. That involves taking a language model and post-training it to add additional action outputs. But we are doing it in a different way. Similar to large language models, we are building large vision models. Our model is more natively vision-driven. Instead of trying to understand text and semantic knowledge, our model is pre-trained on web-scale video data. It tries to understand what’s happening in the video, the dynamics, and the interactions with the world. With that knowledge, we apply it to robotics.

Ben Lorica: There are other startups, like Physical Intelligence, doing similar things. But there’s also another category of startups that use the term “world model.” What is the overlap between what you’re doing and world models? I wrote a post about this recently—there’s no single definition of a world model. Fei-Fei Li uses it in a different way than others, for example. How would you position yourself against these so-called world models?

Changan Chen: “World model” is a very broad term. In its broadest definition, it just means anything that can model how the world behaves. If you take an action, this model can return the next state of the world. It’s like a simulation; it simulates what’s going to happen next.

First of all, our model is a native video generation model. There are different uses for video generation models. In the industry, some people use them for content creation—like Sora, Runway, or Pika. People can interact with these virtual worlds online by moving around and doing things, making gaming one of their main use cases. But the world model we are exploring is for robotics, aiming for physical interaction with the real world.

When it comes to robotics, there are two different use cases for video models. One is using the video model as a policy model. That means the model knows what to do; it has an intention, like “solve this task” or “cook this dish.” The other is using it as a world model. In that case, the model itself doesn’t carry the intention of the task; it only reflects what will happen if you take a specific action. A policy model is biased toward making sure something specific happens, while a world model is unbiased—if you take an action and fail to grasp an object, the model should reflect that failure.

The primary usage of our model is as a policy model. We want the video model to roll out policies or generate video that can complete a task. However, with a very small tweak, our model can also be used as a world model for simulation, evaluation, and understanding how well the model is performing without having to deploy it onto a real robot.

Ben Lorica: Let’s make this concrete, and then we can dive deeper into the “how.” To make the “what” more concrete: what are some early examples of usage for your foundation model? Give us one to three examples of how it can be used.

Changan Chen: We are doing that in-house. We have application teams that take our model and try to solve customer tasks. On our website, we show three different use cases. One is called “decanting” in the manufacturing space. A manufacturing line receives boxes containing a lot of bearings, and a person usually needs to decant the box and sort the trash. The boxes are very heavy, around 10 kilograms, and workers usually have to do this 24/7.

We showed that with our model, we can solve this task very efficiently. By training on just 10 to 20 hours of data, we can run the model continuously for hours without requiring any human intervention. One of the biggest benefits of this model is that it’s extremely data-efficient. With 10 to 20 hours of robot data, the model can achieve hours of autonomous operation that meets production standards, which is huge. That’s what the industry cares about the most—getting to 99.9% reliability.

Ben Lorica: So in this example, Changan, as I interpret what you just said: you have a foundation model, and I have a task. For this particular task, all I need is 10 to 20 hours of the equivalent of fine-tuning data. Once I fine-tune it on those 10 to 20 hours, I can deploy it to production?

Changan Chen: Yes, exactly.

Ben Lorica: And that’s the typical cycle you see? You provide a foundation model, and it can be customized using a certain amount of video data.

Changan Chen: Yes. We show a couple of tasks on the website. Another task we are working on is container breakdown. You take a box, rotate it, break down each panel, and take out the trash. Another task is return processing—receiving a package, opening it, dumping out the clothes, inspecting both sides, folding the item, putting it back in the bag, and sealing it. All these are long-horizon tasks, taking a few minutes to complete. But with just 10 to 20 hours of data, we can get them running very effectively.

Ben Lorica: So the 10 to 20 hours of fine-tuning data have to be spot-on in terms of being related to the exact task, right?

Changan Chen: Let me briefly explain how the training stages work. We have a pre-training stage that uses web-scale data. We take pretty much any video data and use video generation as the modeling objective. The model is tasked with taking a video sequence and predicting what’s going to happen in the future.

Once the model is trained on web data, we post-train it on robot data. By robot data, I mean teleoperation, where a human controls the robot to perform specific tasks. That’s the 10 to 20 hours of data I’m talking about. This data is generated by humans performing the task via teleoperation on a robot. It includes not only vision but also additional modalities like state and action data. In the post-training stage, we add these additional modalities to the model. Given the state, action, and video data, the model tries to predict what’s going to happen next, converts that prediction into actions, and executes those actions on the robot.

Ben Lorica: Regarding this teleoperator data, is it possible to reach the same state using just virtual training?

Changan Chen: I think that’s something we are going to witness in the next couple of years. In fact, we already published a blog post about this, which included a section on “demo following.” Right now, training robot policies requires a human to collect data with the robot—either directly on the robot or with some sort of proxy like an end-effector.

When humans learn a new task, we watch how other people perform it and learn by directly observing their behavior. This is actually something we can replicate. In our blog post, we demonstrated that a human can perform a pick-and-place task—such as picking up a certain object and putting it into a bin, or drawing a shape on a whiteboard. The model can follow the human video demonstration and learn to solve that task without a human directly teleoperating the robot.

Ben Lorica: Are you saying this is like one-shot learning, as long as the example is very close to the target task?

Changan Chen: Yes. In the long term—say, 5 to 10 years—robots will reach a state where they don’t require any human teleoperation. A human will just perform a task, record a video of themselves doing it, and put that video into the robot’s memory. The robot will then try to perform the task based on that demonstration. I think that’s highly doable in the next 5 to 10 years.

Ben Lorica: To clarify, you developed the foundation model, but are you building the actual robots? In other words, do you partner with people who build the hardware?

Changan Chen: We do both. We use off-the-shelf robots, but we also have internal efforts to push the boundaries of hardware. For example, one area that needs more exploration is how to transition from a simple gripper to a more dexterous manipulation end-effector, like human hands.

Ben Lorica: You should read Rodney Brooks’ blog posts; he’s very skeptical about dexterity. Let’s go back to the foundation model. In the LLM space, as most people understand it, models use a lot of web data scraped from the internet. What are the key sources of data for your foundation model?

Changan Chen: We also use web data. The whole premise of our foundation model is that robot intelligence requires understanding physics and how to interact with the world. That’s not something you have to learn solely from robot data; that part is shared with humans. You can learn from human interactions with the world, and basically, all videos contain the same physics. So during pre-training, by watching web data, the model learns basic physics and dynamics.

Ben Lorica: But Changan, what happens if the web gets flooded with deepfakes and low-quality content that violates the laws of physics?

Changan Chen: Video generation models are getting better and better. Sometimes I can’t even tell whether a video is a deepfake.

Ben Lorica: So what is the quality control process for your dataset? Obviously, if you look at text, there are a lot of Reddit threads where the signal-to-noise ratio is pretty poor. How much do you invest in making sure the pre-training video data is high quality?

Changan Chen: We put a lot of effort into filtering and processing the data to ensure high quality. For deepfakes, even if it’s hard for human eyes to tell, AI detection models are extremely effective at identifying them. We take raw videos, process them to ensure high quality, and then use them for model training.

Ben Lorica: Since the foundation model relies heavily on video, it’s predicting the next frame, not the next action. Can you explain that distinction to our listeners and why it may or may not matter?

Changan Chen: Great point. One of the biggest issues in robotics is the lack of data—specifically, robot data that includes both actions and perception states. When we started a year and a half ago, we thought this problem could be broken down into two stacks: one for video prediction, and another for taking the predicted video and converting it into an action.

It turns out the second part is extremely simple. With just a few hours of data, we can get the video-to-action translation working very effectively. The post-training phase trains two models. One is the video model (which is pre-trained first, then post-trained), and the other is an Inverse Dynamics model. The Inverse Dynamics model converts the predicted video into an action. During post-training, this Inverse Dynamics model picks up these signals very quickly and effectively. It has no issue extracting the correct action from the generated videos.

What we discovered is that the Inverse Dynamics action model is not a bottleneck. That part is very simple once you have a certain amount of data. The rest of the robot control is basically reduced to video prediction. That is something that scales very well with web data, even without corresponding action data. That’s how we decouple video prediction from action extraction.

Ben Lorica: For a specific task, can you tell whether or not it’s doable? If I give you 10 to 20 hours of video, how do you judge whether a task is within the realm of possibility?

Changan Chen: I can first talk about where the model is limited. The primary limitation is at the hardware level. If a task is not achievable by the hardware—meaning a teleoperator cannot do it—then obviously the model can’t do it either. But if a teleoperator can achieve the task, our model is currently very capable of handling it. In the post-training step, the foundation model is essentially just imitating the teleoperator.

Ben Lorica: What if the teleoperator, in the course of recording those 10 to 20 hours, encounters certain edge cases that aren’t fully covered?

Changan Chen: That 10 to 20 hours of data includes intervention or targeted data. We collect data to train the model first, and then we evaluate it. Naturally, there will be some corner cases. We collect data on those corner cases—intervention data—and feed it back into the training dataset. So, those 10 to 20 hours include both normal teleoperation data and intervention data from humans.

Ben Lorica: Is Reinforcement Learning (RL) part of the pipeline?

Changan Chen: Not yet. That’s something we are exploring. It depends on what kind of reinforcement learning we’re talking about—RL in simulation or RL in the real world. As of now, reinforcement learning in the real world is extremely difficult. This isn’t because of the learning process itself, but because when a robot explores the real world, it moves around and might collide with a table or damage itself. To me, that physical risk is the bigger issue right now for real-world reinforcement learning.

Ben Lorica: For listeners who aren’t familiar with this area, give us a sense of the scale of your foundation models. In the LLM world, we’re talking hundreds of billions or even trillions of parameters. How should people think about foundation models in the robotics space?

Changan Chen: Our model is not that big. We are in the range of hundreds of millions to tens of billions of parameters. The scaling laws for video models and language models are quite different, and this scale is already working very well for video. Of course, the use of video models for robotics is still in its early stages. For the field to truly advance, models need the ability to explore new solutions or strategies on their own, which brings us back to the topic of reinforcement learning.

By relying solely on teaching by demonstration, a robot cannot exceed human capabilities. However, once reinforcement learning is incorporated, the robot might discover new, more efficient strategies.

Ben Lorica: So we haven’t had our “AlphaGo moment” yet, where the model surprises humans with a novel move? Maybe at some point, your model will figure out a smarter way to do a task. Has that happened yet?

Changan Chen: Not yet. The whole field is still in an early stage. For that to happen, the model needs the ability to explore new solutions by itself.

Ben Lorica: What about interpretability, safety, and red teaming—all the things you need to do before deploying to production to make sure the robot doesn’t hurt anyone?

Changan Chen: I’ll address interpretability and safety separately. On interpretability, our model has a huge advantage because we visualize what the robot is going to do through video prediction. This is not something traditional models can do. Traditional models take a past video and predict actions directly, meaning you don’t know what will happen until you run the action on the robot. Because our model directly visualizes the future through video prediction, it offers excellent interpretability. Even without running the model on the robot, you can gauge the quality of the policy just by looking at the generated video.

Regarding safety, it depends on the application. Are you trying to make a “cobot” (collaborative robot) that people can walk around and interact with? If it’s deployed in an environment with humans, multiple safety measures must be implemented. At the hardware level, if the robot detects a collision with the environment or a person, it needs to stop immediately. At the software level, the policy itself needs to have safe interaction knowledge baked in. For example, if a human is nearby, the robot might intentionally slow down its execution to avoid issues rather than just pausing completely.

Ben Lorica: So what’s next? It sounds like, for now, the foundation model is working in a very targeted way—robotic arms mimicking specific tasks, rather than full humanoids. What is on your 6- to 12-month roadmap?

Changan Chen: There are two major milestones. First, we are trying to deploy the robot into real production environments later this year. That’s going to be a critical moment. Right now, many robotics labs are still demonstrating their robots in controlled lab environments without transitioning to real production. Going into production requires two things: the policy itself must be highly reliable—we’re talking 99.9% reliability—and you need solutions to integrate this policy into the broader system of an industrial environment.

We will be providing the robot as a service. We’ll deliver the hardware alongside the software to solve specific customer tasks. For the hardware side, we will be partnering with other companies.

Changan Chen: I’d like to add a comment here. We’re talking about narrow tasks, but from what we’re seeing in the industry, many of these tasks are highly repetitive. There are currently two extremes. On one hand, you have programmed machines that follow a specific trajectory—like pick-and-place in manufacturing. But that only works if the process has high volume and zero variation. For everything else, the tasks are performed entirely by humans. Even simple pick-and-place or decanting jobs rely on human labor. Performing these narrow, constrained tasks 24/7 is actually incredibly draining for workers.

Ben Lorica: I think the first wall you’ll hit will be tasks that require a lot of dexterity. I highly suggest reading Rodney Brooks’ skeptical post on whether foundation models can achieve true dexterity. But other than that, this has been a great conversation. Thank you, Changan.

Changan Chen: No problem. Thank you so much for having me.