Why Your AI Agent Isn’t Ready to Ship (And How to Know When It Is)

Zhou Yu on Simulation, Evaluation, Memory, and the Future of AI Agent Testing.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • YouTube • AntennaPod • Podcast Addict • Amazon • RSS.

In this episode, host Ben Lorica sits down with Zhou Yu, associate professor of computer science at Columbia University and co-founder of Arklex AI. They discuss the critical challenges of evaluating and testing multi-turn AI agents, explaining why manual testing fails to scale and how simulation-based evaluation provides a more rigorous alternative. The conversation also explores the complexities of multi-agent systems, the evolving role of agent memory, and the shifting landscape of computer science research in both academia and industry.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Stop upgrading your LLM. Start fixing your data.
Why Your AI Agents Fail in Production (And How to Actually Test Them)
The Missing Layer: Why Your AI Agent Fails — and What Actually Fixes It
Richard Garris and Barry Dauber → The Gap Between AI Hype and Enterprise Reality
Arum Kumar (of UCSD and RapiFire AI) → Are Multi-Agent Systems More Complex Than They Need to Be?

Support our work by subscribing to our newsletter📩

Transcript

Below is a polished and edited transcript.

Ben Lorica: All right, so today we have Zhou Yu, who is an associate professor of computer science at Columbia University. But more importantly for today’s podcast, she is the co-founder of ArkLex AI, which you can find at arklex.ai. Full disclosure, I am a small angel investor in ArkLex. Their tagline is “Simulation-based agent evaluation: generate realistic multi-turn conversations with your AI agents, evaluate every turn, ship with evidence, not hope.” And with that long-winded introduction, Zhou, welcome to the podcast.

Zhou Yu: Thanks, Ben. It’s great to catch up.

Ben Lorica: All right, so when we first spoke—about a year ago, as we were discussing before we started recording—you folks were focused on helping people build agents. Then you learned that there are actually a lot of tools for helping people build agents, but the more important bottleneck is knowing when your agents are ready to go live. Before we go deeper into that topic, for the purposes of this discussion, what sort of agents are we talking about in general? Are they chatbots? Obviously, the most hyped agents right now are the coding agents, but that’s a separate category. So what sort of agents are we talking about?

Zhou Yu: It’s actually very hard to cut agents into different categories. Some people cut it by use cases, some cut it by who the end users are. But what we specifically are focusing on are multi-turn agents. This means that the user is not just giving one instruction and it’s done; it could be multiple instructions in order to complete a goal. We do support vibe coding instances—like using natural language back and forth to chat with an agent to build some piece of code or build an application, for example, building a virtual wallet. All these things are considered within our use case.

The real thing that we stress is that we work on simulations, but the most important part is we emphasize user simulation. We are targeting agent cases where users are very diverse and require multi-turn interactions in order to complete the task. These are pretty wild from a technology perspective, but if you map it to real use cases, the obvious ones are probably customer service agents, sales agents, HR agents, or more consumer-facing agents. One agent is serving so many different people, and different people behave differently and have different needs. Those are agents that require more rigorous testing and more complete coverage testing.

Ben Lorica: So generally, what we’re talking about here, Zhou, are agents that are not just conversing with you; they’re actually either using tools, doing deep research (the so-called deep research agents), or there’s a bit of back and forth. It’s almost like the agent is some sort of coworker in this scenario.

Zhou Yu: Yeah, in some sense. They are helping you to complete a task for a specific goal. They can call tools, they can do other things. Deep research is just one example. We mostly wanted to support people in having a better understanding—before you deploy—of how a user would interact with your agent. And after you deploy, how do you iteratively accumulate better testing cases so that you always know how to improve your agents in the right direction.

Ben Lorica: One of the buzzwords these days is this thing called “harness engineering.”

Zhou Yu: Yes.

Ben Lorica: Which basically, at a high level, expands the notion of AI to say that an AI application is more than a model. It’s a bunch of things that surround the model—it might be context, rules, tools, or constraints. Do you think what you do falls under this umbrella of harness engineering? In the sense that you’re also outside of the model, but you are helping me understand the agent much more.

Zhou Yu: We’re helping people understand how well their agents are working under these harnesses. And if you change different harness components, how the outcome will look. For example, you change your system prompt, you add a different tool, and then you want them to do the same thing. How do they compare to the previous version?

Ben Lorica: You know, in the early days of this resurgence of AI, the “Hello World” example was RAG. And in RAG, it turned out there are a lot of knobs to turn—there’s chunking, information extraction (like what PDF extraction library should I use?), how do I chunk the data, what’s my search algorithm, what’s reranking, and so on. It sounds like what you’re describing in the world of agents is similar. When you build an agent, there’s a bunch of levers that you can test and optimize in your harness.

Zhou Yu: Exactly.

Ben Lorica: And so it’s almost like the old notion of hyperparameter tuning applied to this scenario. Is that kind of what we’re talking about?

Zhou Yu: Definitely. If you think about it, an agent is not traditional software that you build, deploy, and don’t change until six months later. You are actually constantly changing your agent because there are so many new things you can add to it. You can change your prompt, add new tools, change your underlying model. Your development and updates of your agents are quite frequent. But the key question is still: how do you know this version is better than that version?

Ben Lorica: And by the way, there are certain things outside of your control, or semi-outside your control. Like if your favorite model provider updates their model, you want to rush to embrace the new model, but obviously you can’t do that right away. You have to do some testing.

Zhou Yu: Exactly. Evaluation is so important these days.

Ben Lorica: So Zhou, what is the typical scenario for you folks? You come into a company—do they already have an agent, or is it still a demo, or do they want to build a more complicated agent? What’s the typical scenario?

Zhou Yu: People are in different stages, but people feel the pain the most if they have gone through the entire process from demo to production. If you think about the traditional process of how you test agents: you build an agent (let’s say a sales agent), you push it to development, and then you invite your engineers, your product managers, and your human sales agents to test the agent.

Ben Lorica: Wait—you push to development, and then you test after?

Zhou Yu: Yes, you push to development. You invite the rest of the company and say, “Hey, we’re thinking of deploying this, let’s all hammer this agent.”

Ben Lorica: Exactly—and see how you can break it, or if it’s doing the things you wish it was doing. Like penetration testing within the company.

Zhou Yu: Not only penetration testing, but also quality testing. Is the agent actually doing the things that you think it’s doing the right way? In some sense, this process is very manual.

Ben Lorica: So you’re saying that at this stage in the development of AI, most companies already have an agent and are at this stage?

Zhou Yu: Yes, they have been experiencing the pain of quantifying the agent.

Ben Lorica: And then?

Zhou Yu: Once people report these errors, the hard part is that during testing, it’s difficult to replay these errors or collect the information in a systematic way. Everyone is using spreadsheets or something similar. Once you get this information, you try to fix your bot. Once you fix your agent, you want to test it again, but you can’t replay the original scenarios. So ideally, what happens is you push it to dev again and say, “Hey, can you actually test it again?” It’s back and forth, and sometimes you miscalibrate and miss some edge cases along the way. So it’s very non-standard.

Ben Lorica: And there are a lot of configuration details too, right? Like, “I used this agent, it called this tool—what version of the tool?”

Zhou Yu: Yeah. When you are doing the testing, there might be different things going on on the agent side that the testers don’t know about, which causes all these mismatches. But the most important part is that the completeness and coverage of the testing cannot be guaranteed unless you have a good product manager who handles all the operations and separates different use cases and scenarios. It’s actually a fairly tedious process with a lot of overhead to communicate between the engineering teams and the QA testing team. And these QA testing teams are usually non-technical people who are experts in customer service, sales, or operations. It’s hard for them to align and agree on, “These are the test cases that we should all test for, and these are going to be a golden set that we always test the new version of agents on.” That’s the hard part.

Ben Lorica: I’m curious—in your conversations with companies, obviously this notion of reliability is top of mind. That’s why they bring you in. But do you come across any teams that care about regulations and compliance? Because that’s also part of what you can help them with.

Zhou Yu: Yeah. When we say qualities, regulations and constraints are part of those qualities. You have different verifiers you care about.

Ben Lorica: That means the legal team, the Chief Legal Counsel, might also be interested in saying, “Before we sign off and let you deploy this thing, we want to make sure that you go through a bunch of compliance or regulation-related testing as well.”

Zhou Yu: Oh yes, for sure. When we actually worked with one of our customers, Pearson, their legal team was also looking into disclaimers and corner cases to make sure that the agent was saying things within bounds.

Ben Lorica: You alluded to the fact that when you go into a company, they already have an agent they want to deploy, and they’ve actually tried to do this reliability and eval testing on their own. As you described, they do it manually, which quickly becomes painful and impossible. Do you come across teams that say, “Hey, we’ll just build it ourselves”?

Zhou Yu: Yes, we have seen teams that are trying to build it themselves.

Ben Lorica: What’s the challenge there?

Zhou Yu: The challenge is always: do I just build it for my one use case, for this one agent? Or should I build it to be more generalizable for all the potential agents my company is going to use in the future? They don’t have a good idea of what kind of other use cases they’ll build. So many of them are building it in smaller pieces and then trying to coordinate them together, but it’s not very centralized.

Ben Lorica: And I guess the other question is, who’s building the agents inside companies? Is it individual departments or is it centralized?

Zhou Yu: That’s a really good question. Some of the more tech-forward teams may have an agent infrastructure team that connects developer tools together to support different agent product teams. Some are more segregated because their business is more segregated, so they let the product team use whatever they want and go from there.

Ben Lorica: But even these teams that have agent platform teams—do they try to build this reliability and eval on their own as well?

Zhou Yu: At this point, everyone has some testing pieces already, but most of them are building it in a very scrappy way. It’s a minimal thing they build that is good enough for them to move forward. Some are still staying on single-turn—like question-and-answer type evaluation. Some may be doing prompt-based LLM user simulations. But most of them are just building these very scrappy things, and there’s no real guarantee about how well the user simulation works.

Ben Lorica: So you help people do the reliability, evaluation, and testing of their agents. If you were to explain this to a CTO, what are some of the key components of a solution like this? I’m assuming one of them is the ability to do some simulation—simulate edge cases, basically the equivalent of test coverage in QA.

Zhou Yu: Exactly. We’re a platform. First off, it’s a platform that helps you manage your test cases.

Ben Lorica: So everyone can have the same dashboard.

Zhou Yu: Exactly. The product manager, the domain expert, the engineers can all have a good view of what the golden test cases look like. They can manage and iterate together on the same platform. You can build different evaluation sets for each agent and save them. You can change a different model for LLM-as-a-judge evaluation.

Now we’re talking about two pieces. One is the simulation piece: given a scenario, we usually break it down into user profile, user intent, and user context. Based on this information, we build a user agent. Once you specify this information, we automatically build this user agent for you. Then the user agent will interact with your product agent to generate synthetic traces. That is the simulation.

The second part is the evaluation. The good thing about simulation-driven evaluation is that because I know your premise—who you are and what you want to do—I can base the evaluation on your scenario. Your LLM-as-a-judge has the context of these things. For example, if the intent of this user is returning a product, my verifier is: did the user successfully return the product, and did this process adhere to the rules of the company? These are the verifiers for each individual scenario.

Of course, we also let people design their own metrics so that non-technical product managers or evaluators can do this themselves. They can use natural language to describe what they want to test, get a simulation, design their metrics, and run the metrics to see which ones passed and which ones didn’t. They can do manual reviews, add annotations saying “this is wrong,” and kick it back to the engineering team. They can say, “I actually didn’t like this simulation, let’s change it to this one,” and the tool will take their natural language description and regenerate the simulation until they’re happy with it. It makes the entire development and testing process more transparent to domain experts and non-technical users. It’s a collaboration process for the engineering and product teams.

Ben Lorica: For the non-technical people, I’m assuming the type of metrics they design are outcome-based, right? Coincidentally, there are some people who build agents who use outcome-based pricing.

Zhou Yu: Yes, yes, yes.

Ben Lorica: So you’re helping me test the agent. Is the outcome of this just, “Hey, your agent is not ready to be deployed”? Or are you also going to make suggestions? Like, “Hey, the problem with your agent right now is the tool you’re using is wrong,” or “Your prompt can be optimized.” What remedies do you provide?

Zhou Yu: First, we detect errors and give suggestions. Is it a tool call failure? Is it a prompt that doesn’t generalize to this case, so they can go back and fix it? Or is it missing information within the knowledge base, so they can go back and correct that information, and then run it again until it passes?

Ben Lorica: One of the key challenges with agents in the enterprise is pretty basic: a lot of enterprises don’t have their systems ready. In other words, a lot of the things that an agent may need might still be in legacy systems or systems that are hard to connect to. I don’t suppose you can help me understand that my problem is, “Hey, you’re trying to resolve this issue, but what you need to do is hit the CRM system over here.”

Zhou Yu: Given the way you’ve currently designed the agent, we can provide suggestions based on that configuration. We’re not going to test on use cases your agent isn’t supposed to handle, because that was defined in the product descriptions. We only test you on the cases that you think the agent should do. After you have all this information over time, you can also train smaller models to reduce your costs. But the most important part is having these standard testing cases to make sure your quality stays consistent. Because every day things might change, and you want to run regression tests to make sure your agent’s performance doesn’t degrade.

Ben Lorica: You mentioned earlier that the obvious challenge here is trying to do something very manual, which doesn’t scale to the number of agents or the complexity of the agent. Beyond the fact that manual testing doesn’t scale, what else do people underestimate about this area?

Zhou Yu: I think people also underestimate how difficult it is to build a good user simulation. Many of them think, “Oh, I can just write a prompt and have large language models generate scenarios.”

Ben Lorica: Yeah, I can use a prompt to generate scenarios.

Zhou Yu: You can generate scenarios, and you can prompt the large language model to behave like a user, but it’s not easy. The user also needs access to information, tools, and profiles. These need diversity. For example, in a shopping context, if a person hasn’t purchased anything, they can’t do a return. You have to test all these different things—like if they have two orders and they’re only checking on one of them. It really is about coverage. That’s the hardest part. Many people are only testing on the happy path, not testing for full coverage. That’s why when they deploy, they run into all these problems, get discouraged, and say, “Oh, it’s impossible.”

Ben Lorica: I think a lot of people get silently fooled by the fact that if you’re not testing systematically in a principled way, it looks like, “Hey, this thing is passing all our tests, we can deploy to production,” and then the agent just fails catastrophically.

Zhou Yu: Yeah, totally. That happens all the time. A small prompt change could change a lot of things in your agents. It’s almost like whack-a-mole. You push it down, something else pops up, and you don’t know what’s going to pop up unless you actually test all of those potential things you’re going to encounter.

Ben Lorica: Is it still common, Zhou, to encounter teams where they go, “I can actually just cleverly prompt my way out of this”? The models are getting better and better, so I just use the best model and cleverly prompt my way out.

Zhou Yu: I guess the key is there are so many different problems and difficulties with this agent-building process. First of all, what kind of tools do you need? From the very design principles level, what are my agents going to do? How would my end user interact?

Ben Lorica: To this end, Zhou, is the typical thing that people do, “Okay, we want an agent to automate this workflow. How would a human do this workflow? Then we’ll just build an agent that mimics a human.” Is that typically what these teams do?

Zhou Yu: Different teams approach it differently. Some of them look into existing workflows and try to automate small components of it. That’s a more conservative approach. And some people are really reimagining the workflow.

Ben Lorica: Yeah, maybe the human wasn’t doing it the most efficient way.

Zhou Yu: Not the most optimal way. And there are two types of agents that people talk about. There are more constrained workflow agents—like, “You’ve got to do this first, and then that.” But more autonomous agents are given a goal and a set of tools and left to figure it out.

Ben Lorica: Outside of coding—because everyone has read about coding agents—what are the most interesting agents that you’ve seen in the enterprise? Ones that are kind of like coding agents in the sense that they’re doing a lot of the work of a certain role?

Zhou Yu: That’s a good question. I think the most important thing is really about outcomes—what are some AI applications that can be validated through outcomes? A lot of paperwork-based agents could be really, really useful. For example, underwriting insurance, underwriting loans, finding fraud, fighting financial crimes. These all involve paperwork, information aggregation, and identifying information inconsistencies. If agents can serve as another pair of eyes and find more of these discrepancies, that’s always valuable. And you can have a human verify at the end of the day. You can automate the processes of requesting additional information because it’s also multi-step. You’re doing deep research to understand a particular case, and then you can verify information through multiple different channels. Agents might be able to do that better than human beings.

Ben Lorica: I’m actually surprised that—and maybe people are going to announce these in the future—but areas like accounting. You have the Generally Accepted Accounting Principles, which is a book, and then you have the tax code, which is also a book. At some point, you might be able to have tax agents. It seems like that’s an area that would lend itself to something like this. Obviously, you might still have a human check, just like coders don’t check every line of code, but they have certain tests they can run or they can review some of the code. Same thing with some of these rules that are very quantitative in some ways.

Zhou Yu: I totally agree with you. There are just so many different things agents can do these days that could be potentially very helpful and complementary to how humans work.

Ben Lorica: A platform like ArkLex—we discussed how it can help you during the process of determining whether your agent is reliable and safe enough to deploy. But obviously, at some point, if you use ArkLex and say, “Okay, this agent is good to go, we’ll deploy,” will ArkLex help me after I deploy?

Zhou Yu: After deployment, you have to update your CI/CD testing cases as well. You want to do regression testing to make sure your agents are still performing well. And when you update to new versions, you also need to run these tests. So it’s an iterative deployment process. We don’t only help you pre-deployment, but also continuously throughout the iteration process and the deployment phase monitoring.

Ben Lorica: Let’s say you give me the green light, I deploy, and things are working for a few weeks. But at some point, drift or some unpredictable event happens—a war, a once-in-a-hundred-years flood event that I could not have tested for. The agent starts degrading, and there might be an incident. In cybersecurity, they have this thing called “incident response.” The whole point of incident response is not just to identify the incident and isolate it, but the holy grail is to lower the mean time to recovery—to fix and then redeploy. Can ArkLex help me with incident response?

Zhou Yu: That’s a really good question. Right now, we monitor your production traces and find the failed ones, and then give you suggestions on what went wrong. Currently, we don’t do auto-fix.

Ben Lorica: But you can at least say, “Okay, this might be an area you should look at.”

Zhou Yu: Yes, exactly. And then we’ll add a regression test for that particular use case to our CI/CD pipeline as well. So the next time you’re deploying a new version, that case is tested for as well.

Ben Lorica: When you talk to people in enterprises, how do they use public benchmarks? Whether it’s for coding or LLM Arena—how do people actually use those?

Zhou Yu: Because enterprises aren’t really building foundation models, they’re building their specific use cases. So these public benchmarks really don’t work for them. They have to build their own benchmarks. That’s why we come in to help them build their benchmark. If you think about Chatbot Arena, it’s also a simulation process or benchmark—we just make it into a more pipeline-based one for them.

Ben Lorica: You know, in the coding area, people late last year and moving forward are rallying around something like Terminal Bench, where it’s much more of a realistic scenario around software engineering—here’s a test where the agent has to install a few packages and do more complicated things than just fix one isolated bug. It seems like this is the kind of thing that an enterprise will need to build, and maybe you’re helping them build the equivalent of Terminal Bench for their agent, right?

Zhou Yu: Yeah, in some sense, yes. We’re building these sandbox simulations for these agents, and these environments also update because your user base may drift, or you may have more features you want to test for. So in some sense, we give you a testing tool that you can always rely on.

Ben Lorica: What percentage of the people you talk to have gone beyond a single agent to applications that use multiple agents?

Zhou Yu: I think most of the people have multiple agents these days.

Ben Lorica: Are they multiple agents talking to each other? Or is it more like, here’s an agent that does this, here’s another agent that does that?

Zhou Yu: I think people are trying to isolate agent building by function—one for sales purposes, one for customer service purposes, one for more specialized users…

Ben Lorica: Or if it’s an internal agent, it’s just an agent that moves data from here to there.

Zhou Yu: Yeah. Most of them are trying to connect these agents together and are exploring multi-agent systems. But multi-agent systems are actually harder to evaluate and test.

Ben Lorica: Can you give the audience a high-level explanation for why that is? Why do multiple agents pose a particularly difficult challenge?

Zhou Yu: For example, there’s information leakage. What information can be seen by which agent? What memory can be shared among these agents? This is something we must be careful about. And then, because it’s multi-agent, it’s intrinsically going to generate longer traces and long-horizon reasoning. So it’s harder to debug and attribute which errors belong to which agents. Finding the cause or provenance of these errors is hard, because when the system makes an error, it’s not necessarily the most recent agent that caused it; it might be a different agent further upstream. It’s similar to how you write code—it’s harder to debug when you have multiple functions chained together.

Ben Lorica: The domain where people are using multiple agents is coding, right? Because individual programmers are burning through thousands of dollars of tokens because they have multiple Claude instances going. But they’re not necessarily talking to each other—they’re doing different things. I’m interested in seeing how many people actually have agents that are functionally teams.

Zhou Yu: Agent swarms, right?

Ben Lorica: Yeah, have you come across any?

Zhou Yu: I think it’s still in an initial phase that people are exploring. I haven’t seen a very compelling success case that people talk about consistently. People are still experimenting.

Ben Lorica: Since you’re at Columbia and based in New York, do you talk to a lot of people in finance? Do you have a sense of what their adoption of agents looks like?

Zhou Yu: I think finance is traditionally a little more conservative compared to tech-forward companies because of reliability and infrastructure constraints and regulations. But I do see pressure from financial institutions like insurance companies and banks to adopt AI from the top down. People are actively exploring innovations and trying to figure out their strategies. Do they engage with a SaaS agent company, or do they hire AI engineers to build it themselves, or mobilize their previous data scientists to become AI engineers? How do you come up with different use cases to start with? I’ve seen many companies that have already been deploying things. For example, I’ve seen a company deploying virtual humans to discuss their sales reports and market analysis for clients. I’ve seen sales use cases, internal operations use cases, and things like that.

Ben Lorica: So is everything we talked about, Zhou, functionally QA for agents? Is that what it is?

Zhou Yu: Yes, it’s definitely quality assurance, for a lot of reasons. But it’s not the traditional quality assurance that people just do once every six months. It’s more about continuous quality assurance because the entire process of updating agents requires this.

Ben Lorica: And the advantage is that if you treat testing as a first-class citizen and you get really good at it, your agents become more capable, and you might be more confident adopting new models or new tools. But actually, at the end of the day, you might also be able to save a lot of money, right? Because if you are very good at testing, you might go from using the expensive Claude models to a cheaper open-weights model.

Zhou Yu: Yes, yes. That’s intrinsically what we wanted to promote as well. Even for companies that have a large consumption of these agents, they can train their own smaller models for these use cases. Because for them, it’s just limited domain knowledge and common sense, plus some regulations, and that’s good enough. A small model sometimes doesn’t have sufficient safety properties, but a large model has more intelligence than you really need. You want something in between: small enough, but with all the safety guardrails in place.

Ben Lorica: Let me close by talking about one of my favorite topics, which I’m going to spring on you: the future of computer science in academia. Obviously, you came of age before Transformers, is that fair?

Zhou Yu: Yes, yeah.

Ben Lorica: So then Transformers disrupted your entire area of NLP and language. Let’s just stick with NLP. What do the NLP people do now?

Zhou Yu: That’s a really good question. I have been teaching Intro to Natural Language Processing for nine years, and I’ve obviously been doing research for a similar number of years. It’s really exciting to see the changes that have been happening. But to me, NLP is an application area. We are using machine learning algorithms to solve language problems. Traditionally, people were really just looking at a very limited number of problems that people could get data for or that people cared about—like machine translation, speech recognition, document summarization, entity extraction. But now, because large language models have opened up so many more applications, you can do agent workflows, multi-turn interaction chatbots, customer service, emotional support. It opens up so many different applications for NLP people to work on.

And of course, there are the fundamental problems like: how can we make large language models more efficient? How can we make machine learning models more sample-efficient? Even for these large language models—for example, in my research area—a lot of it is about how to improve data efficiency and sample efficiency for RL algorithms. How can we use a limited amount of data and compute to reach similar outcomes as larger models? These are all still very useful problems for us to solve.

Of course, there was a disruption where people thought, “Wow, we’ve been doing all this work for nothing.” If you think about an application like parsing, nobody’s talking about parsing anymore, because parsing was an artificial intermediate task designed for machine translation back in the day. Nowadays, people are looking at more real-world, problem-driven tasks.

Ben Lorica: So I’m a second-year grad student. Should I continue to a PhD or should I just accept that job at OpenAI, Anthropic, or DeepMind?

Zhou Yu: That’s a really good question.

Ben Lorica: I think that’s a dilemma I’m hearing from my computer science professor friends—a lot of the best students have this choice.

Zhou Yu: Yeah, I’ve definitely experienced this myself with my own students. We’re also seeing all these frontier labs approaching junior faculty as well.

Ben Lorica: Let me give you the answer of one friend who I will not name, but from a top computer science department. His advice to his students is: “If you can finish your PhD in three or four years, go for it. Otherwise, you might want to go into one of those labs to pursue the more interesting frontier research.”

Zhou Yu: I guess the intrinsic goal of a PhD is really about an education that helps you learn how to pick a real, impactful problem, how to solve it, and then how to get people to adopt it and show the impact. I think it’s a very valuable learning experience. But of course, if you think that you can do similar things in frontier labs, that’s not a problem either. It’s really about: if you join a frontier lab, are the things that you do meaningful, and do you think you would do it in the long term? I think joining these frontier labs to really understand how the sausage is made is very appealing.

Ben Lorica: And also having access to compute.

Zhou Yu: Yes, access to the compute, the data, the infrastructure to build things. It’s very exciting. They definitely provide a lot of resources that academia cannot provide. But what academia can provide is really time and autonomy. You can work on a two or three-year project for the long term, not just looking for a small win, but a more planned long-term win. And you have more support from people across interdisciplinary areas. For example, many of my students are working at the intersection of systems and ML, so you actually have experts from both sides to guide you on longer-term things. For example, how do you do better branching for these agents? How do we have a better data infrastructure that is built for agents, not for humans? SQL is designed for humans, not for agents. How do you build a better memory architecture for these agents, especially for multi-agent communication? These are some of the more long-term questions.

Ben Lorica: Have you noticed a change in the profile of the students who are interested in pursuing a PhD these days?

Zhou Yu: I think there is still a lot of diversity there as well. We have seen people who have been working in frontier labs but wanted to do a PhD because they wanted to go more in depth. We have seen people who have been…

Ben Lorica: Plus they already got the stock options, so…

Zhou Yu: Yeah, yeah. I think overall, the pursuit of a PhD is really about the ability to do top-tier research. And some people are more intellectually driven.

Ben Lorica: Yeah. By the way, I forgot to ask you this earlier, but I’ll use it as a closing question: agent memory. Through all your work with customers and all the testing that you’ve done, what are some key takeaways about memory?

Zhou Yu: I think memory is going to be more and more important over time. It’s not only about your current agent’s memory, but also about how you integrate previous data about your user with your current agent, so that it’s incorporated as context. Your user would expect, “Oh, I’m using your new product agent, it definitely knows my previous interactions and preferences,” and so on. The hard part is really that making the agent intelligent requires a lot of context. The immediate memory is just one part, but the long-term memory—I like the idea that everything is user-driven. You have to have personalized memory for each individual user, and how you can structure that in a more efficient way is an open challenge.

Ben Lorica: And to expand on that—there’s also the notion of institutional memory and team memory. You can imagine being in a team, just like in a software engineering team: if I’m going to do something, I ask some other people on the team because some of them might have done something similar before, and they may even have code for me to start with. So it seems like anything that can reduce the amount of tokens we burn…

Zhou Yu: Yes, and also increase the context awareness.

Ben Lorica: Yeah. And with that, thank you, Zhou.

Zhou Yu: Nice to chat with you, Ben. Always. Bye.

Zhou Yu on Simulation, Evaluation, Memory, and the Future of AI Agent Testing.

Transcript

Share this:

Like this:

Discover more from The Data Exchange