World Models Are Here—But It’s Still the GPT-2 Phase

Jeff Hawke on Continuous Simulation, Interactive Control, Scaling Challenges, and Future Applications.


Subscribe: AppleSpotify OvercastPocket CastsYouTube •  AntennaPodPodcast AddictAmazon •  RSS.

In this episode, host Ben Lorica speaks with Jeff Hawke, CTO at Odyssey, about world models — a category of AI that generates continuous, interactive simulations from images or text prompts. They explore how Odyssey 2 Pro differs from video generators and spatial intelligence models, current capabilities and limitations (currently predicting 1-2 minutes of coherent video), and near-term applications from gaming to robotics. Positioned as the “GPT-2 era” of world models, the discussion covers training approaches using public video, computational requirements, and how developers can experiment via the Odyssey API.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript



Related content:


Support our work by subscribing to our newsletter📩


Transcript

Below is a polished and edited transcript.

Ben Lorica: All right, today we have Jeff Hawke, CTO at Odyssey. The website is odyssey.ml. Their taglines are “We’re building the World Simulator” and “From any starting image, Odyssey 2 Pro, a frontier world model, generates continuous interactive simulations.” With that, Jeff, welcome to the podcast.

Jeff Hawke: Thank you. It’s a privilege to be here.

Ben Lorica: We’ll talk about world models in general, but I’d like to understand more about this new frontier model, Odyssey 2. I’m sad to say I just got my API key a few minutes ago, so I haven’t tried it yet, but I will. For our developer-focused listeners, you get an API key. Can you walk us through the basics of what a developer should expect with Odyssey 2 Pro?

Jeff Hawke: Sure. World models are a new category of AI model that sits somewhere between an LLM, as you might know them today, and generative video models, which model discrete clips. A good way to think about this is an AI model that predicts potential futures and gives you a continuous stream of intelligent pixels that you can interact with. The important things to note here are, firstly, it’s a continuous stream—meaning it’s not just a bounded clip—and secondly, you can interact with it as the stream evolves. That is a property uniquely possible by training this as a world model. When you get your API key and test it, you’ll be able to simulate potential futures and manipulate them as they evolve. You might seed a world with an image or text, and as the world updates, you’ll see a video stream back to you. You’ll be able to adjust how it evolves and interact with it, like manipulating the scene.

Ben Lorica: It uses the phrase “frontier model,” just like people are familiar with foundation models and frontier models that are mainly LLM-based for text. In the LLM case, our listeners are broadly aware that these are trained on internet data; there’s lots of data to train with. In the case of a model like yours, what is the training data?

Jeff Hawke: Large-scale public video is the majority of the data. If you think about the best source for learning a world model, where do you find that? The volume of video produced by the world is enormous, which is great, and that gives us a different set of constraints from language. To be more specific, using some napkin math, the volume of video produced every single day on the internet is greater than any one company could possibly train on when building a world model. This is a very unique situation compared to language, where the volume of training data is proportionally much smaller. You can think of this as learning how the world evolves from trillions of visual observations—video clips and the transitions that happen within them.

Ben Lorica: Text has a temporal nature to it. If I ask for the most recent analyst reports, or I want to learn more about Nvidia, I can tell the language model to give me its most recent data. In the case of video, I suppose there’s also a temporal nature to some extent. There are current events as well as historical events. You could ask, “Here’s an image, place it in the middle of the US in 1920.” Do you folks also keep track of that temporal nature?

Jeff Hawke: Yeah, I would say…

Ben Lorica: The timestamp of when the video was created, in other words.

Jeff Hawke: We do, but the primary area of focus in world model development is more about the sequence of how the world evolves from visual observation. We do keep track of the timestamps of videos, but the bigger factor is the more granular timescale that most R&D in this topic focuses on.

Ben Lorica: It also depends on the use case. The historical video example I alluded to would be relevant if you’re making a documentary. You’re very early on, but as far as you can tell, what are some of the use cases that people are trying?

Jeff Hawke: The list is long. One particularly compelling thing is that if you build these models as open-ended, general-purpose models—much like an LLM—you get a very long list of potential applications. From my perspective, what’s exciting is that this will be a new category on the scale of LLMs, with a vast array of possibilities. As a category, these models originated from more narrow domains, like autonomous driving or world models built for platformer games. What we’ve seen in the past six to nine months is a broadening into general-purpose models. It is early; my mental model is that we are in the GPT-2 era of world models. This is a phase of mass exploration, not mass commercialization.

To give you some examples: gaming is a huge industry, and this technology provides a large variety of ways to create experiences that are currently very difficult to build. Think about choose-your-own-adventure games or interaction modes that are hard to simulate. Even behavioral interactions with other characters and people—you can do that with current tools, but it’s immensely difficult and expensive, so in most cases, the game experience is kept simpler.

Other examples include retail, where you might interact with a digital billboard in a shop, or live events with displays that react to crowd behavior. All of these use cases can be enabled by a world model, in addition to the industries where they originated, such as robotics.

Ben Lorica: That last example of interaction basically requires real-time inference that looks realistic with no lag. I would imagine one big area would be content generation. People need stock video just like they need stock photos, and Hollywood could potentially use it for special effects. But as you said, we’re in the early stages. At a high level, architecturally… is this similar to what people regard as the backbone of LLMs, which are transformers?

Jeff Hawke: Yes, we use transformers. Transformers underpin almost every frontier AI model that exists today. They’re a very generic, expressive form of model architecture; it’s more about how you frame the problem in your training regime. But yes, we use transformers.

Ben Lorica: At a high level, a lot of people think of LLMs as a next-token prediction machine. Would this be like next-frame prediction?

Jeff Hawke: I would think about it as predicting the next future. It’s a little bit more general than next-frame. Thinking about “what could happen next” is a good way to frame it. You know something about the state of the world, you have some local context, maybe some global context, and the model essentially says, “Given some change that is going to happen in this environment, what’s going to happen next?” You’re querying it for potential futures. You can query it at different times and test counterfactuals depending on what action conditioning you provide. It gives you quite a lot you can do as a foundational piece of technology.

Ben Lorica: For the younger listeners who weren’t around for GPT-2, can you remind them of some of its limitations? My recollection was that it was very prompt-sensitive. If you just included an extra space, the whole output could change dramatically. What were some of the limitations of GPT-2 as you recall?

Jeff Hawke: To index it in time, Jasper is a good example of a company that was working on marketing copy in that era.

Ben Lorica: Oh yeah, it was a very hot company at one point.

Jeff Hawke: Hugely hot company. It was a great example of people exploring what could be done with the models and realizing that this technology was not too far away. Regarding limitations of that era, you’re completely right to point out the sensitivity to prompts. The models were definitely more prone to hallucination. We hadn’t yet worked out how to bring recent data into the models so they could interact with information beyond their training window.

Similarly, there was another problem—which is actually quite similar to world models—where you would go into a “doom loop.” The models would cycle, you wouldn’t get a stop token, and it would infinitely spit out garbage. There was a lot of work to be done, but all of those problems were solved within a year or two of that era, certainly by the time we reached post-GPT-4 models.

Ben Lorica: And the context window was super small relative to what we have now, right?

Jeff Hawke: Very short. Correct.

Ben Lorica: In your case, what are some of the limitations? I suspect one limitation is that you can only predict the near future.

Jeff Hawke: Yes, exactly.

Ben Lorica: How do you measure the near future? If I give you an image and a really detailed spec and prompt, what can you do with that image?

Jeff Hawke: There are two main difficult development areas that have really only become possible to address in the last nine to twelve months. One is the length of video, or the length of continuous state you can generate without the models becoming unstable. This is actually quite similar to the failure mode we saw with LLMs spinning out garbage.

Ben Lorica: And unstable in this context means the behavior is just completely unpredictable?

Jeff Hawke: Correct. What is happening under the hood is you’re accumulating error in your tokens. As a result, you go out of distribution, and the model generates garbage. The way this manifests is you simulate forward, and it eventually degrades into noise or mush.

Ben Lorica: Literally visually.

Jeff Hawke: Visually, correct. You would get weird colors—imagine psychedelic rainbows.

Ben Lorica: But Jeff, can you then say, “Okay, the first three seconds were great, I’ll keep that and feed it back to the model.” Will that help?

Jeff Hawke: The big unlock in research has been a shift where a huge amount of progress has been made by us and the academic research community. The ability to get a stable video stream running for much longer periods of time is substantially better now. I won’t claim it’s completely solved; it is definitely still a research problem.

Ben Lorica: And define “much longer.”

Jeff Hawke: State-of-the-art used to be 15 to 30 seconds at most in terms of contiguous prediction. Today, I would say it’s one to two minutes. There are ways of getting further stability, but they place other constraints on your simulation.

Ben Lorica: And this generation is, I would imagine, compute-intensive on GPUs, right? You can’t offload that kind of inference to CPUs.

Jeff Hawke: No, we work our Hopper GPUs quite hard.

Ben Lorica: So to get those few minutes, you have to have a lot of hardware already.

Jeff Hawke: Yes, it is quite a different computational profile from LLMs. When you’re generating a next frame as your potential future, you’re generating a whole batch of tokens all at once. That’s a different access paradigm compared to LLMs, where you typically have a large fetch to memory and a small amount of compute. With world models, you have a small fetch to memory and a lot of compute. As a result, we tend to be more FLOPs-limited or compute-limited on the GPUs, and we run them very close to their maximum computational output.

Ben Lorica: You use the phrase “world models,” which is a bit of a confusing marketing term. It’s like “big data” back in the day. There are world models in your sense, which is a world simulator. There’s the world model in the Sora sense, which is a video generator. Then there’s the Fei-Fei Li world model, which I think is a 3D mesh concept. The robotics people may also have their own usage of the phrase. But anyway… let’s just take those two. How do the Sora and Fei-Fei Li world models differ from the world model you’re building?

Jeff Hawke: There’s really only one canonical definition of a world model. It was coined in a 2018 paper by Ha and Schmidhuber.

Ben Lorica: That’s what you think, but now marketing has taken over.

Jeff Hawke: That’s true. I liked your blog post on this. I think more work needs to be done to define this category, and you’re completely right that the term is being used quite loosely.

Ben Lorica: I think even Yann LeCun uses his own notion of it, right?

Jeff Hawke: Yeah. My mental framework is that there are four major categories of models. First, there’s the canonical definition of a world model, where the model is essentially learning how the world evolves. This category stemmed from the robotics and reinforcement learning community, starting with that 2018 paper, and was picked up by others. You can think of this category as a world or neural simulator. Specifically, it means you have some learned state, you provide the model with an action, and the model predicts a potential future. It’s a pretty simple, generic framework, though quite difficult to get working well. Yann LeCun’s JEPA paper, which you referred to, was a specific proposal in 2022 on how to build a world model. He suggested adding a few things to that framework. I think the jury is still out on whether those additions are right or wrong; my view is we’ll follow the research results we’re seeing.

The second category I would define is spatial intelligence. This is more the World Labs category. I think of these as models that learn how the world appears. That is uniquely different from how the world evolves. On one hand, you have dynamics and the prediction of future states; on the other hand, you have structure and visual appearance.

Ben Lorica: Let’s stop here for a second. For our audience, what would be the difference in use cases between those two?

Jeff Hawke: World models, as a category, lead to this: imagine going to your favorite LLM, and instead of a chat box, you’re presented with an intelligent stream of pixels. You can talk to it, it talks back to you, it morphs into what you want, and you can interact with it live. The canonical definition of world models leads to that future.

Spatial intelligence is a wholly different category of technology. Its main applications are in 3D toolchains. Anywhere you need a specific 3D environment, it’s applicable there.

Ben Lorica: Like Autodesk…

Jeff Hawke: Yes, Autodesk, current game engine environments like Unreal and Unity, etc. Any use case where you need explicit 3D structure.

Ben Lorica: So it’s very specific.

Jeff Hawke: In my opinion, yes, a lot more specific. What it particularly enables is integration with current 3D toolchains.

Ben Lorica: Okay. Great. What about Sora?

Jeff Hawke: I think of Sora as a category of proxy world models. You might sometimes hear people say that an LLM has a world model inside it. There is some evidence of this; for example, there’s a paper showing that in the internal representation of a model, there’s some notion of a grid for solving a maze-like problem. However, these models aren’t trained with detailed, high-volume visual observations of what the world actually looks like. As a result, they will always be a proxy or an abstraction of a world model. It doesn’t mean they’re not useful, but it is a quite different thing. I would personally consider generative video, like Sora, to be a wholly separate category from world models. Similarly, when the term is applied to LLMs, I would consider that a proxy world model rather than a true world model.

Ben Lorica: You alluded earlier, Jeff, that the training data is all this online video, since people generate tons of video daily. But what about other data streams? For example, in the self-driving car world, they have different sensors like LiDAR. Are those useful inputs too? If you had access to a lot of that kind of data, would it help improve the model?

Jeff Hawke: As a machine learning person, I will never say no to data, as a rule of thumb.

Ben Lorica: Are there people working on world models in the self-driving car world who have access to both video, LiDAR, and other sensor data?

Jeff Hawke: I spent the majority of my career developing algorithms in autonomous driving and was lucky enough to help build a company called Wayve, which pioneered the use of world models. In my opinion, they have done the most work in this field, specifically using world models for autonomous driving. They’ve publicly shared work on a family of models called GAIA that effectively do this. So sensor data is certainly useful, 100%. However, if you’re trying to build a general-purpose world model, which is what we’re doing at Odyssey, adding extra dependencies on other sensors essentially limits the amount of data you can use.

If you’re looking for the broadest field of data you can get, video is the most abundant. LiDAR is probably not the best decision for an input in that context. That said, as a machine learning person, we like data, so thematically it is always interesting and useful. But at the same time, it’s multiple orders of magnitude smaller than the scale of observation you get through cameras.

Ben Lorica: The robotics world is also somewhat data-starved. As best I can tell, they use a bunch of tricks: virtual training in digital twins, having operators teach the robot, and using synthetic data. Are these kinds of things interesting to people building world models?

Jeff Hawke: You’re completely right that robotics is a data-starved field. One unique thing about autonomous driving as a robotics industry is that it’s comparatively easy to scale up data. We have production vehicles, and people know how to drive them. Scaling up a large data source isn’t easy or cheap, but it’s a relatively tractable route for autonomous driving.

In other forms of robotics, it’s quite different. You’re dealing with many different environments. Your kitchen probably looks different from my kitchen. If you need to make a robot work well across numerous consumer or industrial environments, you face an existential problem of how to gather enough data. From my perspective as a roboticist, world models are by far the best solution to this. I think they will become the intelligence infrastructure for robotics.

By that, I mean using world models to bring in all this extra observational information about how the world works and evolves. This allows you to take a robot with a relatively small amount of training data for its specific platform and environment, and much more quickly reach a deployable level of performance. In the last three months, there’s been an increasing body of evidence showing that using world models for this purpose—instead of vision-language-action models (VLAs)—significantly improves sample efficiency. That is the volume of data you need to get your robot to a certain level of performance in an environment. That is super exciting to me.

Ben Lorica: Would you classify the foundation model for robotics being built by people like Chelsea Finn, Sergey Levine, and Physical Intelligence as a world model?

Jeff Hawke: No, I would consider that a behavior model. In machine learning, this is often called policy learning. Essentially, you’re learning what action to take given a certain state. You pipe that into your robot’s actuators, the robot performs an action, you observe the world, and you repeat. This is the same category of model that Wayve’s driver falls into. They are very useful models and hard to develop. I think world models are an accelerant for them, but they are a different category. I would consider them behavior models rather than world models.

Ben Lorica: Based on what you’ve said so far, world models are in the GPT-2 phase. We’re in an experimental phase. You have an API so developers can start playing around with the current generation of world models, but ultimately, they aren’t quite useful yet. Is that a fair assessment?

Jeff Hawke: There are definitely things you can do with them today. We hosted a hackathon at our offices in Silicon Valley, and external developers built things with our API. For example, one person built an interactive weather app that provided a richer application experience. It’s a simple application compared to robotics, but there are absolutely use cases that are possible today.

Similarly, you can build scaffolding to turn this into new versions of computer games. If we think about the birth of computer gaming as an industry, early games were quite simple. We often think games need to be 4K pixels and highly detailed, but often it boils down to simple, fun concepts—using a computer to provoke emotion, generate interest, and create interaction between people. I don’t think we’re a million miles away from these models being highly useful. There is certainly a lot you can do today, and the pace of development is fast. There’s a lot more in the pipeline, and I expect major changes over the coming 12 months.

Ben Lorica: It seems like generating assets for something like that weather simulation is computationally heavy to do in real time, so maybe it won’t scale as an application available to millions of people. But if you use it to produce content for a news report—saying, “A hurricane is coming, here’s the forecast, and this is what might happen”—having generated video for a public service announcement might be very effective. It was computationally heavy to generate, but you only have to do it once.

Jeff Hawke: Exactly. It’s also quite nice having the tailwinds from LLMs. In many respects, that industry has pushed the scale of GPU infrastructure up around the world. It means people are already solving how to make efficient inference engines. These are difficult problems, and if we were starting four years ago, we would probably have to go through the same journey as the LLM developers. The great thing is that all the work in that industry helps us. The time it takes to go through that scaling journey and maturation will be faster than it was for LLMs. There is still research complexity ahead, but we’re in a more fortunate position than the LLM folks were four years ago.

Ben Lorica: You’re still in the phase where you have to get the foundation model done, and then follow up with making inference efficient at scale. By the way, one interesting thing to see is on the audio side… audio inference is actually already efficient. You can do text-to-speech and transcription on-device. Do you think video will ever get to the point where the kinds of things you want to do can run on-device?

Jeff Hawke: In time, yes. There’s a continuum. There will always be a place for the bleeding-edge frontier. Everyone wants to use, say, Claude 4.6 Opus, and that’s probably not going to run on-device anytime soon. The same will be true for world models: you’ll have a frontier of large models that realistically need to be served from a data center. But equally, there will be a continuum with smaller models too.

You mentioned audio. Audio models tend to be quite a lot smaller than world models, video models, or LLMs. The size of the model needed to reach high performance is smaller because it’s a more structured problem, making it a lot easier to deploy. That said, the industry is chipping away at different parts of the same problem. The fact that the audio folks have worked out how to run models on-device is good for us. It means device vendors will be more mature by the time we need to talk to them. It will ultimately be a portfolio outcome.

Ben Lorica: Feel free to decline answering this, but in the frontier model space, people measure progress by scale—scale in data, compute, and model size. What kind of model sizes are we talking about for the leading-edge world models in your category? Answering this might imply the size of your own model, so feel free to say no.

Jeff Hawke: I don’t mind answering. If you look at public resources, there are clear markers in published research. World models are a lot smaller than language models right now. I think it’s partly down to algorithmic maturity; we’re just earlier in the process. As a field, we haven’t gone through the same R&D journey that LLMs have. If you look at models in published academic literature, they tend to be in the single-digit billions. Cosmos models, for example, fit in that category, scaling up to double-digit billions. To my knowledge, no one has yet published or publicly shown anything with three-digit billions—meaning 100 billion parameters or more.

Ben Lorica: Remind our listeners, what was the size of GPT-2, and where are we on model size today? I actually don’t remember how big GPT-2 was. It was probably single-digit billions, right?

Jeff Hawke: I want to say it was in the single-digit billions. It might have even been in the hundreds of millions, but it was definitely at most a single billion.

Ben Lorica: Yeah.

Jeff Hawke: So the GPT-2 analogy is actually quite good. In many respects, we have a lot of room to run. As I mentioned earlier, we have a unique situation: we don’t have a problem with data production; we have a problem with data selection. As a result, there is a lot of room to run in scaling.

That said, two things run in parallel. Scaling gives you progress along an algorithmic trajectory and pushes performance up. But progress is also governed by how you build the models. World models have a huge amount of low-hanging fruit in R&D regarding how you structure and train them. It won’t just be a scaling curve of how quickly we can increase FLOPs, but rather how quickly we can ramp up performance through algorithmic innovation.

Ben Lorica: One of the challenges right now, even for the LLM people, is memory, right?

Jeff Hawke: Do you mean GPU memory, or do you mean…

Ben Lorica: Just access to memory in general, because model sizes are getting bigger and bigger. You folks will have that problem too, right?

Jeff Hawke: Yes. One of the things that makes this difficult is encoding information from a series of images into tokens. A token is just a vector of information, but encoding a single image requires a few thousand tokens. Rather than predicting token by token, we’re predicting a few thousand tokens all at once, repetitively. Simple arithmetic tells you that you don’t need many frames before you hit a very long context window to manage that state.

If you want to run that in real-time, it adds further complications. Real-time performance is something we consider very important, partly because the two biggest markets for world models demand it. In gaming, imagine having horrendous latency. Robotics is an industry that also deeply cares about runtime performance. Even thinking more broadly about general-purpose AI, imagine going to ChatGPT and waiting two and a half minutes for a response. It would be pointless, and we wouldn’t see nearly the same degree of adoption. Interactivity is tightly coupled with real-time performance and the ability to serve outputs in an appropriate timeframe, which is why we think it’s super important.

Ben Lorica: Isn’t it plausible to imagine that in the future, a world model is basically just part of a broader foundation model? For example, I go to Gemini or ChatGPT, and based on what I’m asking, the system routes my request to a world model. I wouldn’t even know it; I’m just using Gemini, but inside Gemini is a world model.

Jeff Hawke: Yes. To a large degree, you can think of this as the straightest vector toward what is often called multimodal AI. That means handling different types of data—vision, audio, and so on.

Ben Lorica: They’re already using routers right now. You can imagine a system seeing that I uploaded an image and requested something best served by a world model, so it routes the request accordingly.

Jeff Hawke: I absolutely expect that. There’s some work to be done to get to that point, but in terms of the direction of travel, I think that’s an accurate way of thinking about it. I don’t think too many people thought about world models in that framing until quite recently. Often, it’s been thought of as a singular category—”world models for X.” But if you think about it as a general-purpose world model, as we have been doing, it leads to the outcome you’re describing.

Ben Lorica: What are some of the key research problems that need to be overcome? Going back to the GPT-2 analogy, what did GPT-2 lack that we have now? I guess RLHF, among a lot of other things.

Jeff Hawke: A lot of things. The analogy holds for how you should think about world models. We are early in the journey. This is not GPT-5.2 in terms of maturity, and we’ve got a lot of room to run.

There are a lot of problems people talk about in terms of world models that we need to understand better. From my perspective, there are two core ones. Going back to the analogy I used at the beginning—describing this as a model that gives you an intelligent stream of pixels you can interact with—two things emerge.

Number one, you have to be able to generate this repetitive, coherent stream of pixels in a way that is plausible. Maybe it becomes temporally incoherent over a long period, but certainly, in a short period, you shouldn’t have weird visual discontinuities. In my opinion, that’s a foundational piece of work that just has to be solved. We’ve made a huge amount of progress on that, but there is more work to be done, and it’s the foundation on which all of this is built.

The second side is interactivity. Now that you have your intelligent stream of pixels, how do you manipulate it? At one level, you could think about this as prompt control, similar to a generative video model. The temporal nature of how these world models are structured means you have to think about it quite carefully. Those are the two main research problems that I think are most important.

A third problem you might hear about is long-term memory. My view is that it is important, but it’s not the right problem for today. If you’re still changing the foundational piece of how you deal with the streaming and prediction of pixels—and by pixels, I also mean audio and other formats—that’s going to affect how your memory is structured. While it’s great that some folks are researching long-term memory today, it’s a better problem to solve once those core foundations are locked in, and we still have more work to do there.

Ben Lorica: Are world models, in essence, just brute-force relearning the laws of physics and nature from video? What if there’s a prompt injection or a poisoning attack where you feed it a bunch of videos that clearly violate the laws of physics, and it starts producing content based on those bad videos?

Jeff Hawke: You’re right. However, there are cases where you do want to violate the laws of physics. I imagine most car chases you see in a movie don’t actually have realistic tire coefficients; the cars skid more than they really should. That’s a deliberate decision by the director for cinematic effect.

Ben Lorica: Yeah.

Jeff Hawke: This is one of the challenges of building an open-ended world model. The consistent pattern across every category of AI I’ve ever worked with is that the general approach—one that is simple and scales with data—eventually wins. It’s often harder to structure it that way for the sorts of examples you describe.

I personally believe that is the right move to make. You have to ask, is this realistic physics, or is this film physics? In film physics, you want a car to skid in a way that might not be physically coherent because it looks great. That’s highly desirable if you’re using it for games, marketing, or film. Conversely, if you’re trying to use it for an engineering simulation, violating physics makes it a poor simulator. It depends on the use case, and you have to think carefully about the data domain.

Ben Lorica: We’re several years into LLMs now, so people have learned a bunch of workarounds. There’s retrieval-augmented generation (RAG) and a bunch of things that fall under post-training—fine-tuning, reinforcement learning, distillation. Do any of these ideas map over to world models? For example, suppose I have a repository of video that can be used to enhance the world model. Will there be a RAG equivalent down the road for world models?

Jeff Hawke: 100%. It will look different from how you think about it in terms of text, but the concept is very similar. Essentially, it’s using an external database as an additional source of memory.

Ben Lorica: Right, I might be a game studio or a film studio with all these proprietary assets.

Jeff Hawke: Exactly. You can also think of it as a potential solution for long-term memory. If I’m playing a choose-your-own-adventure game using a world model and I suddenly get back to the start, maybe I don’t want to keep all that information in the active context. Instead, it gets pushed out to a database and brought back into context at an appropriate time. That philosophical framework can work in a number of cases. To my knowledge, I’m not aware of anyone who has tried that yet, but I would bet on it coming in the next year.

Ben Lorica: What about the bucket of post-training? Do any of those ideas map over?

Jeff Hawke: Strongly. We think of our training pipeline as pre-training, mid-training, and post-training. I don’t think these categories are perfectly agreed upon between companies, or potentially even within teams at the same company.

Ben Lorica: Yeah.

Jeff Hawke: We view it as solving different problems in terms of adding capabilities or features to the model, and the algorithmic training methods differ at each stage. That mental view of how you build a training pipeline carries over quite nicely from LLMs. Similarly, using reinforcement learning (RL) to improve outcomes works very well.

To give you a couple of examples, you can do RLHF or GRPO-style preference learning to improve the quality of the outcome. You can also make this a verifiable RL process if you use a 3D-consistent environment. You can provoke the model to do something and generate a reward based on computationally verifiable metrics, so you don’t have to rely on a human for a response. A lot of those methods from LLMs transfer beautifully to world models.

Ben Lorica: In the LLM world, you have proprietary models and open-weights models, which are mainly coming from China right now. I don’t know the precise metric to use, but let’s say the open-weights models are six months behind. And according to Anthropic, they’re just doing model distillation on Claude anyway.

Jeff Hawke: Yes, I read that in the news. Anthropic does not seem happy.

Ben Lorica: Are there open-weights world models, and how far behind are they compared to the commercial ones?

Jeff Hawke: I actually think some of the open-weights models are not too far behind the commercial ones.

Ben Lorica: Who is building the open-weights models? Also Chinese labs?

Jeff Hawke: Currently, mostly Chinese labs. There’s one called Lingbot World; I might be misremembering, but I’m 99% sure they released the model parameters for it.

Ben Lorica: And these are world models according to your definition, right?

Jeff Hawke: They are structurally similar. A lot of world model developers have looked at capabilities differently. For example, we’ve prioritized photorealism and real-time performance as key, non-negotiable requirements. Another design choice we made was not being opinionated about embodiment.

If I look at world models from other developers—Genie is a good example—they are quite opinionated about the scene and the embodiment. You specify an entity within the scene that you directly control through keyboard presses. Because we view this as an open-ended problem, we haven’t prioritized that by design. Even among commercial models like Odyssey and Genie, there isn’t a common capability set unique across all of them. The open-weights models also tend to fit in that category, making them a bit difficult to directly compare.

I do think we’ll see a convergence of some of these features in the coming months. We’ll also start to see the emergence of public benchmarks, which will help align capabilities and allow for better comparisons between companies. At the moment, it’s honestly difficult to do like-for-like comparisons. We can compare our models to video models, but it’s not the same thing, and it’s not a fair fight.

Ben Lorica: Yeah, and there’s no LLM Arena for…

Jeff Hawke: Not yet.

Ben Lorica: Right, because that relies on a voting-style paradigm. What about the metaverse? Is that a place where world models can play a role in generating assets and content?

Jeff Hawke: Conceptually, yes. It’s a bit different, though. Current VR headsets always run a 3D engine internally. With world models, you are typically getting a 2D stream out—a projection. That means there are differences in how they could be used for that purpose, but thematically, I expect a lot of metaverse developers to pick this technology up in the coming months.

Ben Lorica: By the way, I just realized there’s going to be an explosion of video content from smart glasses.

Jeff Hawke: Yes, absolutely.

Ben Lorica: Because Apple is supposed to be coming out with a pair.

Jeff Hawke: Oh, I actually didn’t know that.

Ben Lorica: Yeah, apparently Tim Cook is wearing them all the time now, but we’ll see how it plays out. In closing, Jeff, for our listeners who go to odyssey.ml to sign up for an API key, what are the limitations? What’s the free tier, what’s the paid tier, and what are the API limits?

Jeff Hawke: Currently, anyone can go to our website. There is a sandbox environment where you can try the model out directly through a web interface without going via the API. If you scroll to the bottom of the webpage, you’ll find a button. You can also sign up for an API key directly; all you need is an email. We are not currently offering this in pricing tiers. Essentially, we’re offering it for people to experiment and explore.

We’re really looking to see what people can do with this. Sometimes people think about AI purely as a productivity enabler for reducing costs. I think it would be disappointing if that’s the only thing we do with it. I view this more as: what can humanity do with this technology that it cannot do today? As a technologist, that is the most exciting thing, and it’s certainly how we think about it at Odyssey. I firmly believe world models are a step toward enabling wholly new things. We’ve thought of some, but I guarantee there are people out there with ideas we haven’t even discussed. We want to understand where this fits and discover the use cases people haven’t thought of yet—just like we saw with LLMs, where some use cases were obvious and others weren’t. It’s a very interesting time to be alive and working in this industry.

Ben Lorica: For listeners who are more interested in the technical backbone, are you using the same open-source projects that are popular in the LLM space—like PyTorch, Ray, and Kubernetes? What are some of the basic components of your infrastructure?

Jeff Hawke: We train on a lot of GPUs, primarily Nvidia. We use PyTorch, which is the dominant framework, particularly for early R&D. The only other framework I’ve seen get extensive use is JAX.

Ben Lorica: JAX, but mostly just inside DeepMind.

Jeff Hawke: A little bit outside, but mostly, yes, you’re right.

Ben Lorica: And the Chinese have their own deep learning frameworks. What else?

Jeff Hawke: That covers the training software side of things. For data processing, we run clusters using tools like Ray or Flyte. These orchestrate what are actually quite computationally intensive data processing and curation workloads. Then we use a bunch of proprietary tooling for managing the resulting datasets and linking them to our experimental processes.

Ben Lorica: Do the Ray people know about you? I’m an advisor to Anyscale; they should know.

Jeff Hawke: Oh, yeah. We’re in touch with them.

Ben Lorica: Good. And Kubernetes?

Jeff Hawke: Yes. We actually run everything, including training, on Kubernetes. To my knowledge, we might be a bit of an outlier there. I know some labs still use Slurm for training orchestration, but I have PTSD from using Slurm during my PhD, so…

Ben Lorica: That seems like a sign that it’s a research-heavy project! Are you hiring?

Jeff Hawke: We are hiring. We believe that small, highly focused teams are the most effective. So we are hiring, but…

Ben Lorica: Remote or on-site?

Jeff Hawke: We hire out of two hubs. Our headquarters is in Silicon Valley, and we also have half the team in London.

Ben Lorica: Okay. And with that, thank you, Jeff.

Jeff Hawke: You’re welcome. Thank you.