Reading the Tea Leaves: What the World’s Top AI Researchers Are Really Working On

Nick Vasiloglou on Data Markets, Small Language Models, and the Rise of AI for Science.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • YouTube • AntennaPod • Podcast Addict • Amazon • RSS.

Ben Lorica talks with Nick Vasiloglou, VP of Research at Relational AI, about Nick’s deep dive into NeurIPS 2025 and what people in industry should actually pay attention to. They discuss why the gap between research and production has collapsed, then dig into under-the-radar themes including data attribution and data markets, model composition, the rise of small language models, AI for science, and the growing role of synthetic tasks and post-training in building useful AI systems.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Nick Vasiloglou’s analysis of Neurips 2025: Linkedin post and Google Drive with the results
Tudor Achim → Building Mathematical Superintelligence
Jeff Hawke → World Models Are Here—But It’s Still the GPT-2 Phase
Kay Zhu → Your First AI Employee Is Already Clocking In
Nestor Maslej → 2025 Artificial Intelligence Index

Support our work by subscribing to our newsletter📩

Transcript

Below is a polished and edited transcript.

Ben Lorica: All right, today we have a great treat. We have Nick Vasiloglou, VP of Research at Relational AI, which you can find at relational.ai—what a great domain name. Their tagline is “High-stakes decisions deserve frontier intelligence.” Today we’re going to talk about one of Nick’s annual side projects, where he does a deep dive into the NeurIPS conference presentations.

For our listeners who aren’t familiar, NeurIPS has historically been one of the main gathering places for people interested in what used to be called machine learning, and is now called AI. It’s grown over the years. Historically, it was an academic conference; now, increasingly, a lot of the presentations are given by industry professionals, but it’s still fundamentally a research conference. Am I mischaracterizing NeurIPS, Nick?

Nick Vasiloglou: No, not at all, Ben. Thanks for hosting today. NeurIPS stands for Neural Information Processing Systems. It’s almost 40 years old, and it was originally supposed to reconcile the systems world—like machine learning—with how the brain works. In the beginning, it had a large sector dedicated to cognitive science and how neurons work, and they were trying to reconcile that with statistics and machine learning. Forty years later, here is what we have. It used to be a rather geeky conference, yes.

Ben Lorica: It has historically served as a great bellwether for what’s coming next. It has gone through waves of fashion, right? I remember NeurIPS years ago where the Bayesians dominated, and then the kernel methods—the support vector machine guys—dominated.

Nick Vasiloglou: Graphical models.

Ben Lorica: Graphical models dominated, right. And now, obviously, deep learning and its variants are the main reason people go to this conference.

Nick Vasiloglou: I think it’s worth mentioning that the Nobel Prizes and the Turing Awards we’ve seen recently are all coming from that community. So it’s a truly foundational and important event to follow.

Ben Lorica: As I mentioned at the outset, it’s increasingly becoming a conference where industry people go as well. As you know, our listeners are not necessarily academics or researchers. So why should people in industry care about NeurIPS at all? Isn’t this just for professors and researchers?

Nick Vasiloglou: That’s a very good question. As someone who understands its value, I’ve had a hard time convincing people to pay attention to it. Back in 2000, if someone published an idea—like support vector machines or kernel methods—you needed four to five years, maybe more, before those things would start hitting the industry and you’d see them experimentally in deployments.

But now, what happened at NeurIPS last December is what your company is using—or should be using—today. This past year was the year that AI truly scaled out of the conference. The time from lab to production is now measured in weeks or months. If you wait a year or so, whatever NeurIPS produced might already be obsolete. We are living in an exciting time where something DeepMind, Google, or a university researcher is thinking about right now can be a differentiator for your business immediately, not in two or three years. It’s crucial to pay attention to what’s going on there.

Ben Lorica: Also, Nick, historically the way these academic conferences worked is they had a certain cadence and calendar. NeurIPS is at the end of the year, ICML is in the middle of the year. All these academic labs and groups revolved their lives around that calendar and the submission deadlines, which tend to be months in advance. To your point, why should I care about a conference where the deadline for submitting a presentation was six months ago?

Nick Vasiloglou: It’s actually a very good point. There are three major conferences, all organized by the same people. The year starts in May with ICLR. In July, we have ICML, which is bigger. And NeurIPS is usually in December, at the end of the year.

It is true that they have a deadline for submission, and people do publish their results on arXiv before they even present them. But keep in mind that each conference is massive—NeurIPS has around 10,000 papers, and ICML has about 5,000. That raises the question: “Nick, if I need to follow all these papers, when am I going to work?” You simply don’t have the time.

What I realized is that you don’t really have to go to all of these conferences. Yes, things are moving fast, but ideas are also recycled. If you attend one of those at the end of the year, you’ve got it. The conference has two main sectors: the original published work, and the workshops, tutorials, or expos. The latter tend to be more general and inclusive. For example, at NeurIPS in December, there was work that had been published at ICML, as well as work people were submitting to ICLR for May. So you don’t have to overwhelm yourself. Doing a checkpoint once a year at one of these conferences is sufficient to get a solid understanding. Things are moving fast, but not so incredibly fast that you can’t follow them.

Ben Lorica: Before we drill down into some of the key observations you made, give us a sense of the nature of your study. Did you look at every single presentation? Did you look at every single submission, including the ones that did not make the conference?

Nick Vasiloglou: This was a very different year. First, the volume has increased significantly. Second, this year we had coding assistants and very powerful language models to help us. I spent about 200 to 300 hours going through the content. We’re talking about roughly 8,000 to 9,000 papers, including the workshops and the main track.

Ben Lorica: Wait. 8,000 to 9,000 papers that were submitted or accepted?

Nick Vasiloglou: Accepted. Total. The acceptance in the main track is about 5,000, and there are roughly another 4,000 in the workshops.

Ben Lorica: And give us a sense of the selectivity. What percentage gets accepted?

Nick Vasiloglou: It’s about 26% to 27%. By the way, if you go to my LinkedIn—and we can post the link for the listeners—I have divided the overview of the conference into 20 topics. I’ve scraped all the talks and transcribed them, so people can download them and start querying them if they want.

Let me give more details about what people can find. In the past, I used to go through every workshop, look at the titles, and use my judgment to say, “Okay, I think this is useful, I need to pay more attention to this one.” For the papers, as you can imagine, even just reading the abstracts for 5,000 papers is about a week of work. When it was around 2,000 papers, I used to read all the abstracts, select some, and reiterate.

This year, I tried to pick up on what the speakers were referring to. Whenever you have keynotes or workshops, people tend to reference specific papers. Obviously, you have the best paper awards. I used web searches to see what other people were talking about, but I also relied on my own judgment. There are some topics that haven’t attracted a lot of attention yet but are very important. One of them is data markets. I don’t think people fully grasp the amazing breakthrough we had this year: when you are predicting a token, you can now figure out which training data was responsible for that generation. To me, this is a great achievement.

So I go through the content, extract it, and scrape it—which I wouldn’t have been able to do without the help of Claude, Code, and Codex. Then I start asking questions and doing deep research on it.

Ben Lorica: So do you actually transcribe the video as well?

Nick Vasiloglou: Yes. I transcribe the videos and the presentations. I run the presentations through a language model to get a summary of what each slide is about. All of that is available for people to use.

This actually raises a philosophical question about whether the traditional way we present papers at a conference is still the right approach, which we can discuss if you want.

Then I create 20 topics. I scan all the material and ask, “Okay, you have this five-minute video. Is there anything interesting here, and which topic would it be relevant to?” It does a machine reading to ensure nothing is missed. After that, with the help of Gemini and other models, I structure it into a report. From that report, I create a presentation. Then I generate images—which was one of the most challenging parts—create an audio podcast on top of that, and finalize the presentations.

Ben Lorica: This is interesting. For our listeners, I will link in the episode notes to Nick’s LinkedIn post and where all of his results can be found in a Google Drive. Actually, one thing you should do—I don’t know if you’re familiar with the annual AI Index report from Stanford?

Nick Vasiloglou: I am, yes.

Ben Lorica: You should take everything you’ve done and create one comprehensive report. Right now, when I look at what you’ve done, I have to go into individual folders for each topic. There’s no single executive summary or report.

Nick Vasiloglou: There is an executive summary, though maybe it’s a little buried in the details. But thanks for the feedback, I’ll look into that. I’ll talk to the Stanford guys and see if we can collaborate.

Ben Lorica: Yeah, or just adopt the format of their final report. You already have all the ingredients; you just have to compile it into a single document.

Speaking of data markets, obviously our podcast is called The Data Exchange, which is inspired by data markets. Let’s dive in. Let’s try to highlight the topics you think are currently under the radar but will explode in importance over the next 6 to 12 months. The first one you mentioned was data markets, right?

Nick Vasiloglou: Yes. Data markets are not a new concept. We know how to sell our cookies and personal data really well. People find value in it because if I want to sell something to you, Ben, and I need to know about you, I can create amazing models for targeted advertising—that’s how the internet works.

But it has always been a problem. There are ethical considerations about privacy, of course. But the other issue is building predictive models. Let’s say you have some data, and I want to use it to create a model. There are several problems. First, how do I securely transport the data from your servers to mine? This is something that clean rooms and companies like Snowflake seem to be solving well. Second, how much money do I pay you? I don’t really know the value upfront. Maybe you give me your data, I put it in my model, and I don’t see any lift because I already had that information.

Ben Lorica: And several years ago, there was a wave of startups trying to address these problems using the blockchain, right?

Nick Vasiloglou: Yes, that’s true. I went to a bunch of meetups to see how to get started, and the number of hoops you had to jump through was incredible. You had to open a wallet and do all these other things, and I just gave up. Pricing was definitely one of the main problems. By the way, I hosted a workshop at ICML back in 2020 on the pricing of data and how to respect privacy, called Ecopaddle: Economics of Data Privacy.

Beyond predictive models, the rise of language models has introduced huge problems regarding creators. There are pending lawsuits with Anthropic, OpenAI, and the New York Times. We want to incentivize people to write books and create content so we can train language models, but these models have to pay dividends back. Eventually, the lawyers figured out some pricing, but I don’t know if content creators got anything significant out of it.

Over the past three or four years, there’s been enormous work on attribution. To do pricing properly, you have to be able to do attribution. In machine learning, we used to call this explainable AI. But now we have “influence functions,” and there are many papers on this, including variations of LoRA (Low-Rank Adaptation), which is a way of fine-tuning models.

Right now, as you ask a question to a language model, it can figure out exactly which pre-training and post-training data was responsible for the answer.

Ben Lorica: And when you say it can do this, can it do it in a practical way? Is the latency low enough for real-time use?

Nick Vasiloglou: Last year they were able to do it, but it was very slow and required a lot of resources. This year, they managed to crack the computations and make it fast enough to be done in real time.

There was a phenomenal example where someone was threatening a language model to shut it down, and the model reacted defensively. They realized it reacted that way because it had been trained on a sci-fi novel in a completely different language! That’s the level of sophistication we have now. We can trace that a Lithuanian novel is responsible for the specific answer being generated.

Ben Lorica: Is what you’re describing a research prototype, or is there an open-source GitHub repo that someone can use today? What’s the status?

Nick Vasiloglou: I believe there is a repo for it. If I were a publisher like Random House or McGraw Hill going to court, I would argue, “The technology exists. As a language model provider, you need to implement these practices to protect my copyright.” This is something that can be leveraged now. There is still some development to be done, but we’re not far away.

Ben Lorica: So the idea is that in the future, when you use a language model, you can ask it to make sure it attributes its answers?

Nick Vasiloglou: Yes, exactly. There are also other technologies doing this. I think Google’s Dino model does the same thing for images.

Ben Lorica: Oh yeah, I was going to ask if this applies to images as well.

Nick Vasiloglou: For language models, we see it for text, but there’s a big shift happening in image generation from diffusion models to transformer architectures. I don’t think anyone has done attribution for image generation on transformer-based models yet, but I believe it’s the next step. For static images, Dino had a great paper and presentation last year, with a follow-up this year. It essentially uses a nearest-neighbor approach. They decompose the image in the semantic layer so well that you can trace exactly what a prediction was based on.

Ben Lorica: What’s your sense of the reliability of this attribution? Will there be false attributions?

Nick Vasiloglou: There might be. But from what I’ve seen, it’s much better than what we currently have. As a content provider, I would accept a small error rate if it guaranteed me an income. You could structure it so that a certain percentage of revenue is distributed equally among creators, and another percentage is based on attribution, to hedge the risk of errors.

Ben Lorica: Obviously, step one is attribution. Step two is figuring out the economics and how much you actually pay the content creator. Let’s face it, Spotify has 100% attribution, but I don’t think artists are super happy with the economics of that system.

For our listeners, what part of your report covers this topic?

Nick Vasiloglou: There are two folders: Data Attribution and Data Markets. There’s overlap because attribution can be used for other purposes. In the Data Markets folder, you’ll also find papers written by economists about how to make a market and create incentives. Pricing is fundamental for trading, but there are other factors involved in creating functional markets.

Now, there is another type of market: the market of models. There’s a very interesting line of research that I saw coming…

Ben Lorica: Just to clarify, this is a completely separate topic. We’re now moving on to the next thing you think will become important in six months.

Nick Vasiloglou: Yes, and I think it’s already becoming important. It is connected, and let me explain how. It stems from an observation made three or four years ago: let’s say I have a small language model that translates from English to Russian, and another that translates from Russian to Greek. If I take these models and simply add their weights, the resulting model can translate from English to Greek.

Of course, this was a naive approach and wasn’t perfect, but it sparked a whole line of research into how we can compose smaller language models to complete complex tasks. Last year they had a competition, and this year there was a tutorial on it.

It’s fascinating because the companies training these massive language models don’t talk much about how they do it. But just like a large software project is broken down into libraries, teams, functions, and APIs, these models are likely built through composition.

Ben Lorica: And from an end-user or application builder perspective, there’s the notion of routing. People have come to accept that when they hit an LLM API, they aren’t necessarily hitting the largest model possible.

Nick Vasiloglou: Exactly. This is the mixture of experts approach. The way it’s presented to the public is that you just take a trillion tokens, run gradient descent, and pre-train a massive model all at once. But that’s not how it happens. During a tutorial, the presenter admitted that while companies don’t release their exact methods, we should accept that trillion-parameter models are built by training smaller models and aggregating the weights, or by building a router on top of a mixture of experts.

There is a proper software engineering process for building large language models that follows traditional engineering principles: splitting the data, dividing the work, combining components, and using trial and error.

This creates a market for models. Cohere recently announced that they trained a model distributedly—one team worked here, another there, and then they combined the models. This gives us hope that we won’t have to rely solely on the massive frontier labs. People can start building specialist communities that collaborate and share resources.

There’s also research addressing how to combine models across inhomogeneous hardware. Colin Raffel is a key researcher behind this work. You can watch his tutorial or my summary of it. It’s definitely a line of research that is picking up momentum.

Ben Lorica: Of the two you’ve laid out so far—data markets and model markets—I think data markets are more promising. I’m a bit skeptical about the model market, but who knows?

Nick Vasiloglou: Let me make you less skeptical by segueing to the next topic.

Ben Lorica: Okay, what’s the next one?

Nick Vasiloglou: The next one is the technology of small language models. It’s truly amazing what’s happening there.

Ben Lorica: I’m a big fan of small language models.

Nick Vasiloglou: I think everybody is a fan, because we all want them to work. It would be sad if only trillion-parameter models were effective, because that leaves most of us helpless without the necessary hardware.

Ben Lorica: There are pragmatic reasons too. I want something that can run locally on my phone or laptop.

Nick Vasiloglou: The good news is that they are working. What I found truly fascinating this year is how different architectures are being combined. We knew about LSTMs and state-space models like Mamba. We obviously know about transformer models and attention mechanisms. Quadratic attention scales poorly, which is a problem, so we saw technologies like flash attention and KV caching emerge.

Last year, Sepp Hochreiter, the father of the LSTM, presented an 8-billion parameter LSTM that worked great. I wondered why we hadn’t seen more of it. What happened—and this largely goes to Alibaba’s Qwen family, as well as Gemma and Microsoft’s models—is that researchers took everything we knew about language models and combined these technologies to increase both efficiency and accuracy.

Think of it like computer memory hierarchy: you have L1 cache, L2 cache, RAM, a hard drive, and the cloud. You have to manage which one you use because you can’t fit everything into the fastest layer. The same thing is happening with small language models. Crafting one is now a delicate exercise in resource management. You can’t use quadratic attention everywhere because it’s too costly, even though it yields better results. State-space models are faster and more efficient, but struggle with precise location pointing. Linear attention scales better but misses long-range connections.

Modern small language models are starting to look like operating systems. There is essentially a controller that decides on the fly: “Do I need to increase the context of my quadratic attention? Do I route this to a more or less complex part of the model?” Because researchers have done such a great job managing this complexity, we now have incredibly capable 5-billion and 8-billion parameter models.

The open-source community, particularly DataComp, has also done an amazing job curating high-quality datasets so we can train these models more efficiently with less data. I truly believe this is the year of the small language model. Qwen is currently leading the race, and while the community is trying to reverse-engineer why they are so good, there’s still a lot of work being done to fully understand them.

Ben Lorica: Obviously, Alibaba itself seems to have changed directions, much like Meta with LLaMA. It seems like they’re stepping back and potentially even walking away from Qwen to focus on something else, though I’m not exactly sure what.

Regarding small language models, the way I think of it is that I’d like to use them more, but if I don’t mind paying for a more capable model and don’t care about latency, I’d rather use the larger model. I’ll probably stick with that until it gets to the point where they are basically interchangeable.

In terms of my own personal usage patterns, because I mainly use open code with OpenRouter, I default to the larger models for the most part, unless I think a specific route is going to cost a lot of money. There’s still something psychological about preferring the larger models, I think.

Now, one area where smaller language models become really favorable in my mind is with agents. Increasingly, working with agents isn’t really about the model anymore; it’s about the harness, the tools you use, and the memory. From that perspective, small language models optimized for this sort of setting make a lot of sense. In that environment, you’re not relying entirely on the model to do everything. Instead, the model acts more as an orchestrator and a reasoner.

Nick Vasiloglou: Yeah, and again, if we consider OpenAI as a big control plane, right? As I mentioned earlier, you have a language model with different controllers in its guts trying to manage complexity—adding or removing layers. Think about OpenAI: every 30 minutes it wakes up and does some really simple work, like reading files. Then it decides to do something more complicated. You can imagine using a local model just to read the logs, and then calling a larger model if you need to fix a bug.

In the grand spectrum of complexity, they have a specific position. Another area of great personal interest to me is retrieval models. You start with traditional RAG, move to RAPTOR, or even what I did with NeurIPS. Using embeddings is great, but you’re basically summarizing and disconnecting things. What I really want is to read everything. Doing multiple forward passes with Anthropic or OpenAI to build knowledge graphs—extracting entities, relations, and reconnecting them—is very expensive. A small language model can do that work really well. If you need to do complex reasoning afterward, you can hand it off to the massive GPT models.

Ben Lorica: What other things from NeurIPS should people in industry care about?

Nick Vasiloglou: This is the big one. I haven’t touched on it yet, but this was truly the year of AI for Science. About 20% of the papers were focused on this. A year or two ago, you might have seen 50 papers on physics, chemistry, or biology. This year, there was an absolute explosion.

Ben Lorica: Are these papers being written by machine learning people, or is it ML people collaborating with scientists?

Nick Vasiloglou: A lot of scientists. I remember the days when it was very easy for an AI or machine learning person to go into other domains and publish because we knew the algorithms. Now, I see the intrusion coming from the other side. There was a massive influx of researchers from the physical sciences.

Investors often ask me about the economics of AI, the cost of startups, and the idea of one-person unicorns. People are still figuring that out. But I think the biggest power of generative AI right now is in material and drug discovery. The Biohub team, led by Eric Xing, presented the first cell simulator, which is tremendous. You can now simulate how different medicines will affect cells.

I have a particular interest in AI for Physics. Looking at it right now, it feels like NLP did 10 years ago—there are a hundred different architectures being tested.

Ben Lorica: My personal interest in this area, coming from more of a math background, is following what’s happening in mathematics. There is a growing recognition among research mathematicians that these tools are here to stay. It’s similar to what happened with programmers two or three years ago—initially, there was a lot of resistance. But now, mathematicians are increasingly realizing that these tools aren’t going anywhere, and they are figuring out how to use them.

This means that in many of these scientific disciplines, it’s like what happened in programming: if you don’t embrace the tools, you get left behind. This also opens up the possibility that the future Nobel Prize winner or Fields Medalist might not just be the person with the highest raw intellectual horsepower, but the one who knows how to use these AI tools effectively. You have to develop this new skill to remain at the forefront of your scientific discipline.

Nick Vasiloglou: I agree with you. Donald Knuth admitted that Claude managed to solve one of the conjectures in his book, which made him revise his opinion. Terence Tao also found a new proof with the help of ChatGPT.

I’ve talked to a bunch of startups in this field, so I have a sense of what’s real and what’s not. AI isn’t going to do the math from end to end on its own. But there is a recognition that these tools are incredibly useful, and you’re better off accepting them so you can get a lot more done.

Ben Lorica: There’s also a very practical use of math for language models. The community has accepted that if you want to train or fine-tune a language model to do biology or finance, including math theorems and proofs in your training data—even if they are completely irrelevant to the topic—makes the model reason better.

This is why you see companies putting so much attention into math benchmarks and math training data. Some do it for marketing and glory, but many do it for practical reasons. Just like telling a kid who wants to be a lawyer to study math because it sharpens the brain, the same thing applies to language models. Adding a small amount of mathematical training data acts like a good alloy; it makes the model reason much better.

The math community has also built supporting tools, like Lean and other formal theorem provers, that lend themselves well to interacting with these models.

Regarding these AI for Science workshops, were they attended mostly by science people, or were AI people going to them too?

Nick Vasiloglou: You have to look at it pragmatically. Obviously, there were researchers, people from national labs, and investors. But put yourself in the shoes of a graduate student. If you’re a PhD student without 10,000 GPUs in your lab, you need to find a new domain for your research. AI for Science naturally attracts a lot of graduate students looking for a pragmatic path to a PhD.

Right now, doing core research on transformers or language models is very difficult because the field is so settled. The transformer architecture has been around for years and is very secure in its position. But if you look at AI for Physics, it’s like deep learning back in 2015—there are a hundred different architectures for different tasks. It’s very fragmented, and it will remain so until we converge on a standard model like the transformer.

There are two different schools of thought here. One believes we need to find entirely new architectures for scientific problems. The other says, “We’ve invested so much money in traditional transformers; let’s just tokenize the scientific problems.” Whether it’s physics or simulation, they verbalize it—saying “Molecule A moved to that position at that time”—and use token prediction to capitalize on the knowledge we already have from OpenAI, Anthropic, and Qwen.

This provides a good segue to another domain that has found its path: tabular models. For years, XGBoost was the undisputed king of predictive models for tabular data, and no deep learning method could beat it. We also had Graph Neural Networks (GNNs), which are still used mostly in the physical sciences.

Ben Lorica: Our mutual friend Jure Leskovec would disagree, right? Kumo.ai is apparently thriving, based on a recent episode of this podcast.

Nick Vasiloglou: Yes, though I believe they are shifting from GNNs to foundational relational models for structured data. He doesn’t explicitly say what that entails, but the hint is that the earlier work on GNNs plays a role.

Ben Lorica: It does, and Relational AI competes in that same space with predictive models.

Nick Vasiloglou: Exactly. There was a workshop called Tabular Representation Learning (TRL) that I’ve been following for the past three years. We’ve seen more and more models produced that are basically transformer-based, or very close to it. My personal belief is that tabular data will eventually be completely absorbed by the transformer architecture. But right now, these new models are starting to perform much better than XGBoost.

Ben Lorica: In past years, and I think this past year as well, there have been workshops on foundation models for time series, right?

Nick Vasiloglou: Yes, absolutely. Time series is another important area. We should also mention TabPFN, which is a very popular tabular model that had a Nature publication.

Ben Lorica: I call out time series separately because there seem to be enough industry applications that it generates a lot of interest—more work than any other form of structured data.

Nick Vasiloglou: There is a lot of progress, but it’s still a very niche area, which is why you don’t see it discussed as broadly.

I want to make a specific point here. I love research, but I work in the startup world, and things have to be pragmatic. Customers want one thing: they don’t want to deal with ten different technologies. Every time someone creates a new architecture, it creates technical debt. That’s why I particularly like work that takes existing models like Claude, ChatGPT, or Qwen and adapts them—even through verbalization. It allows the architect or user to work within one unified platform and framework. This is crucial for maintaining a single security perimeter.

This brings us to post-training, which is becoming the Holy Grail right now.

Ben Lorica: Have you seen services where the equivalent of fine-tuning is now, “I’m going to build an agent by giving it a Docker container, some tools, and some tasks, and then I’m going to go away for lunch and come back”?

Nick Vasiloglou: I’ve seen companies doing this. I don’t know if we’re allowed to advertise, but I have a friend from Berkeley who has a company called Bespoke providing exactly these services. At this point, you still need humans to create the definitions for these tasks.

Ben Lorica: Well, you needed humans to create the fine-tuning datasets in the previous era, right?

Nick Vasiloglou: Yes. The nice thing is that this new type of fine-tuning is more about reasoning traces. It’s not just “Here’s the question, here’s the answer.” It’s “Here’s the question, and the model has to generate long reasoning traces.” There are ways to do this with reinforcement learning, or with meta-models like Gemma or Superalignment that try to optimize the prompt or the skill.

The big debate right now is between two directions: Do I create task definitions and use reinforcement learning to modify the weights of my model? Or do I try to create dynamic skills that adapt to the context to solve the problem in a more evolutionary way? There are papers exploring both approaches.

Ben Lorica: I think the opening here is for someone to create a service that doesn’t require a lot of technical knowledge. You just need domain expertise to know what you want the agent to do, democratizing the process.

Nick Vasiloglou: You don’t even have to be a deep expert. There is a rise of a new profession that I call “fake experts.” For example, let’s say you work in infrastructure. You know the tools, and you want your coding assistant to figure out what’s going wrong when a service goes down. Previously, you needed experts to read the logs. Now, people are intentionally breaking environments—knowing exactly what they broke—and creating synthetic tasks. They tell the model, “Here are the symptoms, go fix it.”

If you know how to work well with a language model, you can become dangerous enough to start generating these tasks. I’m seeing a big rise in companies and professionals whose job is basically to create hard tasks—whether in optimization, business logic, or elsewhere—that require extensive reasoning to train these language models. Essentially, we’re living in the age of transferring our knowledge to language models through realistic tasks.

Ben Lorica: So what you’re describing is basically the next generation of Scale AI-type companies, right?

Nick Vasiloglou: Exactly. My friend Alex Dimakis from Berkeley already has a company called Bespoke doing just that.

Ben Lorica: And with that, I will obviously place all the links in the episode notes. Thank you, Nick.

Nick Vasiloglou: It was a pleasure, Ben.

Nick Vasiloglou on Data Markets, Small Language Models, and the Rise of AI for Science.

Transcript

Share this:

Like this:

Discover more from The Data Exchange