Coding Agents, Observability for Agents, Type Safety, and the Reality of MCP.
Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.
In this panel discussion from the PyTorch conference, Ben Lorica speaks with Samuel Colvin (Pydantic), Aparna Dhinakaran (Arize AI), Adam Jones (Anthropic), and Jerry Liu (LlamaIndex) about the current state of Agentic AI. The group debates the necessity of complex multi-agent frameworks versus simple composable code, explores the critical role of type safety and observability, and explains why “Computer Use” is the next frontier. They also share controversial takes on why building dedicated AI teams might be a mistake and how enterprises should actually approach agent deployment.
Related content:
- A video version of this conversation is available on the PyTorch YouTube channel.
- Agentic AI Applications: A Field Guide
- Beyond Black Boxes: A Guide to Observability for Agentic AI
- Why Your Multi-Agent AI Keeps Failing
- Heiko Hotz and Sokratis Kartakis → When AI Agents Need to Talk: Inside the A2A Protocol
- Jakub Zavrel → How to Build and Optimize AI Research Agents
- Andrew Rabinovich → Why Digital Work is the Perfect Training Ground for AI Agents
Support our work by subscribing to our newsletter📩
Transcript
Below is a polished and edited transcript.
Ben Lorica: All right, so we’re here with many of the people who built some of our most favorite tools, and they are all on the front lines of building agents. I think I’m going to start on a positive note and have them comment on the things they’re seeing that are working. Starting with Samuel at the far end here: what are some architectural patterns or agents that you’ve seen in the wild that really impress you, and what are some of the key lessons from them?
Samuel Colvin: I get to go first and I get to use the really easy one, which is obviously coding agents are working much better than I think anyone would have predicted at the beginning of this year. I said I would get this in as a hobby horse and I’m getting it in right at the beginning: I think one of the things they absolutely love is type safety. No one would start a project in JavaScript today if you were building JavaScript/TypeScript; you would always start with TypeScript. I think type safety is equally important, maybe even more important, in Python, and it’s particularly important for coding agents. The problem is it’s agents all the way down, right? So you start with using a coding agent to build something with an agent framework. If that agent framework is type-safe, you’re going to have a way more successful time than if you’re trying to use an unsafe library.
Aparna Dhinakaran: Awesome. Hey everyone. I mean, probably biased here, but one of the architectural patterns I’ve been seeing work really well is that teams who use evals are the only teams I see with agents that actually work. We’ll talk about the coding agent hot takes, but I don’t see a single team that doesn’t use evals if they’re serious about building an agent today. So it’s absolutely important and critical to getting your agent to work.
Adam Jones: Things that I’ve seen work really well are where you’re not an AI team trying to shove AI somewhere, but you are a product team trying to add value to customers. Where you are looking at your users, asking them what problems they have, and solving that with AI. Not saying, “Oh, AI is the solution, what problems can we bash? What nails do we have that we can hit with this hammer?” It’s actually: what problems do we need to solve? And then working backwards.
Jerry Liu: I totally agree with Adam’s point. I think the best teams can translate a business process into some sort of agentic workflow in a carefully crafted manner to make it actually work. Besides that, I think the method of providing context to any sort of agent has solidified a lot in the past year or so. I mean, there’s MCP, right? There’s also the Claude skills.md. There are a few different ways to both pass values through an API call or through some of the flexibility of the CLI; you could just dynamically load unstructured text and reason over that. So I think that’s super exciting. I know we’re going to talk about what’s not going to work in the next question, so I’ll save my thoughts. But in terms of context, there’s a ton of stuff there. Like every SaaS vendor out there has an MCP server these days. There are a lot of new techniques around retrieval and how do you process and structure unstructured data. So I do think we’ve made quite a big, uh, some strides there.
Ben Lorica: I’ll just open it up to anyone here. Multi-agents: when should people go multi-agents and when should they avoid it?
Samuel Colvin: I’ll go, cause I have a strong opinion on this. First of all, I would say there are three definitions of an agent that are prevalent. Amongst AI people, it is an LLM calling tools in a loop. Amongst engineers, it is something you can put in a microservice. And amongst business people, it’s something that you can replace a human with. If you ignore the business one and talk about the other two, at the beginning of this year, we thought our microservice would have one agent inside it. I think now, although we’re still in the year of the agent, we’re now in the year of multiple different agents. So if you go and build a deep research agent and you do it sensibly, you end up with five or six, maybe more individual agents that make up that thing that might also be called an agent. I don’t think we need some special framework in general to connect them together. We can call agents within tool calls. We can get a result back from an agent which is structured and then call another agent with it. We don’t need to reinvent the wheel. It’s still just composability, which we’ve had in software forever.
Aparna Dhinakaran: I mean, I think the part that’s hard—and this is from our own experience building an agent—is that planning across these agents is not a solved problem today. So how do you pass context from one agent to another? And then how do you do the handoff properly? Is still very much a hard problem to solve. It’s not easy to know exactly—there are all sorts of different techniques. You have some teams that are passing some sort of recent N-minus-one conversations or information. Then you have some teams that are doing the more summarization type approach. I don’t think that standard practices for context handoff between agents is something that’s going to be generalized.
Ben Lorica: Yeah, inter-agent communication seems very difficult.
Adam Jones: I think a lot of people overcomplicate agent communication. Actually, I think you should focus on making sure your agent can do all the possible things, and then with models getting smarter and smarter, they should be able to figure out how they want to communicate or how they want to sort out a lot of their structuring. I maybe actually have a spicy take that you don’t need a load of agents that have different roles and different personas. Actually having kind of one agent that maybe can duplicate itself or call itself in parallel is sufficient for a lot of tasks provided it’s flexible enough. We also hear a lot of customers who set up very complex message buses or ways to pass context between agents. But we have found in our own deployments that Slack is actually a great platform for having agents communicate with other agents, with humans. And also you get a bunch of observability for free. Maybe things evolve in future, but I’d say if you’re starting with something, start simple, start with what you know, and then build only as you actually need that.
Jerry Liu: I think there’s room for multi-agents. I do think for the bulk of your use cases to start with, they’re kind of overrated. If you just start off prompting Claude Code or your favorite agent like ChatGPT, and you have a nice prompt with a set of tools, you’re going to get decently far. The next step is actually building some sort of workflow that encodes the process that you want to tackle. I wouldn’t really model every step there as an agent. As Sam said, it’s basically just a program, right? You just write step one, write step two, and then you just do the thing. You don’t need to use some super fancy multi-agent framework. There is value in context engineering, not necessarily the technical details of how you break down context, but more about injecting your prior knowledge so that the agent can actually solve the task at hand. I think that’s super important in the translation layer of business process to some agentic architecture that realizes value. But yeah, I think some super complex multi-agent swarm thing is overrated. I mean, I’m going to be wrong in a year from now, but for 90% of your use cases, you definitely don’t need it.
Ben Lorica: Memory and state management. What have you seen that works? It seems to me that there’s no consensus. People are doing all sorts of different things. What’s working in this area? You can start, Sam.
Samuel Colvin: I’ve gone first every time, but I’ll keep going first and have the easy answer. I come back to my previous point that it’s still engineering. I don’t think for the most part we need to reinvent the wheel. Most people I hear who have gone out and done embeddings and done vector search have either stepped back from that entirely or have done some kind of hybrid vector search plus some other kind of filtering. I don’t think the place we were perhaps a year ago when you thought everything had to—you know, to build an agent we had to have RAG and RAG was vector search—we’re not in that world anymore. I think that people get a long way with—sure there are places where embeddings are an incredibly powerful tool. If it wasn’t for the rest of GenAI, we would all be eulogizing over embeddings. They’re just not as powerful as chat-based LLMs. So there are places for them to work, but again, it comes back to: don’t start with a hammer of embeddings and look for a nail. Start with the problem you have and work out what’s going to work. If it turns out that Postgres or Elastic or whatever your normal full-text search will solve your problem, then you don’t need it. In terms of state management, I’ll admit to my bias, which is that we don’t have stuff built into Pydantic AI to do all of that. That is for the most part because when applications get to the point of going into production, our impression is most of the solutions are custom enough that giving too much in the way of abstraction actually slows things down long term rather than accelerating you. But you may disagree on LlamaIndex’s kind of different take on that point.
Jerry Liu: Oh, no, I agree. I don’t think there’s a real abstraction for memory. I think right now it’s either just using a storage layer, using a SQL database, or putting an object store. I think there’s probably some future bet on memory compaction and stuff. Like I think Claude Code does the thing where it summarizes context and progressively iterates on that. That’s kind of a nice thing to have. My hot take is later on that’s going to get baked into the model layer somehow. You’re just going to have some personal fine-tuned thing that’s just going to have some semantic representation of your past experiences. I don’t know how many of you guys saw the DeepSeek OCR paper, but one of the insights from there was you could maybe store kind of lossy image representations of the text too as different resolutions of your memory. So there’s kind of some interesting questions whether that’s stored in text or images. But yeah, I don’t think there’s a real abstraction. I think it’s either going to be in some storage layer or it’s going to get baked into the model layer.
Ben Lorica: Anyone else?
Adam Jones: I don’t think I have anything too much else to add other than rather than just saying: don’t build an overcomplicated solution because the models will change and your whole solution will not work in six months if you’ve spent ages overthinking it.
Ben Lorica: Aparna brought up evals, and a lot of people are talking about observability just for agents. I know there’s disagreement in this panel about the need for that. So, make your case.
Aparna Dhinakaran: My case is observability and evals are important. I don’t see a team that is building an actual production agent without having some sort of observability around what the agent is actually doing. I think what ends up happening is that people confuse offline evals with online evals so often when they’re talking. Just to clarify some language here: Offline evals is typically when you’re testing an experiment or an iteration on some prompt, or swapping out models, whatever it is, but you’re running an eval on some experiment you’re taking. Online evals is actuallyevaluating the spans, traces, whatever, that the agent actually did in your production application. There’s so much hoopla made about offline evals: “Oh, test before you deploy and don’t ship until you have X number of evals.” To be quite honest with you, I see most people going and shipping into production with vibes, and then actually adding tracing and evals on top of it afterward. And that feels like the actual right paradigm because you don’t care about it until you actually have data in production. You don’t end up needing to evaluate until it’s actually touching your customers. And to be quite honest with you, the old world of scalar-based rewards and feedback is just not good enough to categorize broad general patterns of mistakes that agents can make.
Samuel Colvin: You told us we should be controversial, so I’ll disagree a little bit and I’ll say I agree with you entirely that offline evals—there is a place for them, but lots of applications don’t need them. I would say tracing on the other hand—observability into what the hell this agent is doing—people generally do like from the first day of development. I was going to be more controversial and say maybe it’s your platform’s harder to use than ours, which is why they add it later, but I won’t say that. But yeah, I think that observability is absolutely—I think everyone agrees on that. There’s some debate about whether or not AI observability is going to be ultimately a separate domain, a separate product from general observability. My belief is they’ll end up as one platform. AI observability will go away same as cloud observability or web observability have gone away. But on the evals, I think when people are negative about evals, when Swyx has some hot take about evals aren’t necessary, maybe he’s just trying to be controversial, but what he’s talking about is online evals. Or maybe he’s just got different words, which is fine. But offline evals—maybe. Online evals—obviously some way of understanding what’s happening to your model over time and having the basic rubric for trying to improve it is really important. Now, the coding agents that haven’t got evals point is an interesting one. Maybe part of it is that they have such quick feedback from things like type safety or running the tests that it’s a slightly different domain from many agents where they don’t immediately have—they can’t immediately know whether what they’ve done is right. I don’t know. Maybe you have a take on why things like Claude Code supposedly don’t have evals.
Adam Jones: You do have evals. We were just talking about it.
Samuel Colvin: We do have some evals. I would say we probably have fewer evals than people expect. We’re trying to expand our set of internal evals on this. But actually what matters more—and we’ve seen success looking at—is just look at what your users are saying. Look at your product analytics. To some extent online evals are very similar and related to product analytics, but really you shouldn’t be caring like, “Oh, what’s my P95 latency?” It’s “What are my users angry about? And is that latency?” That’s the thing you need to dig into and improve. I suppose that some of the most sensitive uses of agents for the most part don’t have detailed product analytics about how I’m using Claude Code day-to-day. So you’re slightly relying on what people are ranting about on GitHub or Twitter rather than being able to go and look in PostHog at whatever metric about user behavior.
Adam Jones: Yeah, we have some anonymous metrics, and we also do a lot of user research sessions. I think actually just watching people use the product and then realizing, “Oh man, we built something really confusing,” is a way better way to look at things a lot of the time than just staring at graphs. There’s places for that, but nothing really beats looking at product analytics, talking with users, et cetera.
Aparna Dhinakaran: I genuinely think that evals are just a way of getting additional data to do that type of human review and human review of users using your product. You could have someone manually go through and look at all of those sessions, and I just don’t think teams, especially when you actually have successful agents, can do that for every single instance. And so you need evals to actually—and some of you might say, “Well, they’re not good. Like the evals itself, it’s not going to capture every case.” Well, the evals are just like other prompts or even models where you can continuously fine-tune them, improve them with examples. There’s examples of people actually iterating on their eval templates using meta-prompt learning so that you can consistently make those eval templates better. But it’s an aggregate way of trying to understand, providing English-based feedback, especially if you use evals with explanations. This is something that I see the most useful type of evals isn’t just “Yes/Good,” like binary evals, but they actually have an explanation of why someone, or why did the agent mark this or the LLM as a judge mark this as a failure? It was because of X, Y, Z reasons. And that explanation and feedback is a lot more rich data to then pass into some sort of either prompt optimization library—I mean, it’s not just us on prompt learning, but even DSPy released Guapa, one of their latest approaches uses meta-prompt learning, which is essentially English-based feedback that is being passed in to a meta-prompt with your original prompt to then come out with a more optimized prompt for your use case. That’s where explanations gets really, really powerful.
Jerry Liu: Yeah, it’s interesting, right? Cause this is like a PyTorch conference where you’re probably used to doing gradient descent or Adam or whatever on your parameters. This is basically just doing it in English. Like the space of discrete English tokens.
Aparna Dhinakaran: Right. And you know, I don’t know how many of you guys watched the Karpathy talk recently, but he had this whole take around how RL isn’t—just RL is not the right approach for where models are today. He had this one line that was like, “It’s like sucking supervision through a straw.” Because ultimately what it does is that it finds the right path that, you know, whatever the right answer was, and then it uplifts the weights across that entire path even if it took certain wrong steps in that path. What you actually want is something that’s a little bit more process-based supervision, where incrementally as you’re taking steps, you’re getting some sort of feedback around was this next step right or not. And I think Karpathy says there’s places where LLM as a judge does that today, there’s places where it could be better, especially when you’re thinking about things like trace evals or trajectory evals. And I don’t think Pydantic has all that level today of complexity, which is where I think the richer and deeper types of evals are going to come more and more this next year.
Ben Lorica: I guess very quickly, let’s shift over to what’s not working. So this can be anti-patterns, things that people should avoid, or you can also make observations about where the tools for building agents are still painfully immature.
Samuel Colvin: I think the most obvious one is this idea that we had, like I said earlier at the beginning of this year or last year, where we could just have one agent, we gave it all of the tools, and it could do everything and we could just set it off and it will solve our problem. That works in some domains like coding agents where they have lots and lots of feedback and human supervision. It often does not work in other applications. And again, the other thing I said, going and trying to build with a coding agent where you have no type safety and expecting to be able to write unit tests as the only way of giving your agent feedback is destined to fail.
Aparna Dhinakaran: I’ll do my hot take on this one. We were just talking about it, but there was this lot of buzz about everyone’s going to stand up their own MCP and every external MCP is going to talk to every other external MCP. I am not seeing that today. I feel like it’s a lot more internally used. I don’t know if it’s because of security reasons and et cetera, but the providers of these different servers don’t feel like they’re getting the traction that I thought they were going to beginning of the year.
Ben Lorica: By the way, for our audience, Adam works on the MCP team at Anthropic.
Adam Jones: Yeah. Definitely seen a lot more internal uptake of MCP than perhaps we were expecting. I think we’re still seeing strong external uptake, and I think there’s going to be a few exciting external MCPs coming up. But I do think trust and safety is going to be like a big blocker on do you actually want to use these services, et cetera. One thing we’ve mentioned is MCP is a great way to bring context into agents. There’s many other ways you can bring context into agents. We’re starting to see agents have more and more of this wide array of tools. And this is eating huge amounts of context and meaning that you’re burning lots of context that way. I think a lot of agent harnesses, because they’ve been designed for the past world where you might only have a handful of tools, work very well for that, but actually fail quite hard when you introduce them to hundreds or thousands of tools. I think actually it is the right direction to be going that we’re having agents with thousands of tools, but our harnesses have not caught up. So looking at things like progressive disclosure of kind of new tools and new skills as the agent continues down the path. Skills is another way of kind of packaging it up neatly. I think programmatic tool calling as well is a really exciting area where rather than making tool calls individually and getting all the results back in context, you’re able to write say type-safe programs that can actually call out to tools, compose them together, have the agent control its own control flow by writing code because they’re getting really good at this. We should leverage that strength. That’s where I think it’s maybe going, but not so great today.
Jerry Liu: Basically for those of you who don’t know, a lot of our stuff is basically like pro-code tools to help you build agents over your documents. So stuff around like invoice processing, deep research over your financial reports, anything that’s like a PDF, PowerPoint, that type of thing. And so we spend a lot of time working with developers on building these agents through our codebase framework. But recently, I think I’ve been very interested in how you can kind of bridge that gap between the developer with the non-technical user. For a non-technical user to go in and just enable someone to build an agent by themselves, right? Without having to actually write a line of code. There’s some interesting case studies of people within these companies that are non-technical, like finance department people using Claude Code and other coding agents to just go and build stuff. And that’s really cool. So I think one test I have right now is basically—because I’m kind of bored by the no-code/low-code builders. Like that’s kind of boring. Like that’s like the old school style like drag-and-drop UI stuff. So I’ve been very interested in how much can I just prompt Claude Code to just one-shot generate some agent workflow for me using our own framework. I successfully got it to output some sort of document extraction workflow, which is an extremely linear sequence of steps with like three steps in the middle. I haven’t quite gotten it to one-shot a pretty complicated business process yet. And I think that would be a good benchmark for any of these agents is how much can they just one-shot? Because then the user actually doesn’t have to go and peek in, right? And break glass and understand what’s going on. And I think that’s actually going to be a good test for enabling kind of like non-technical adoption of these AI tools.
Adam Jones: We’re working on it. Oh yeah. But I think another area that is often a mistake is companies having AI teams. I suppose there’s many people here who are on AI teams. Not because that is fundamentally the wrong thing to have, but because you are focusing on building AI solutions. When actually what you should be doing is building platforms to enable everyone else in your company to be building AI solutions. Right? Like with this one-shot low-code style, but for AI agents. I think we’ve seen when we’ve done that at Anthropic, we’ve had a lot of success where our sales teams are now actually building agentic workflows and they’re writing the code with Claude Code and get some reviews from other people. But like, that’s who you want to be empowering because they understand their problem domain, they understand their real problems, and they’re going to iterate and solve the problems much, much better than you ever could because I don’t understand what all these sales people do a lot of the time. They do, and they can build better agents.
Ben Lorica: So we’re winding down. So let’s look towards the future. Next, I don’t know, 6 to 12 months. What agent capabilities are you expecting? And how should our audience prepare for this new capabilities? Are there going to be new tools just for agents? Cause recently people are talking about revisiting or rebuilding parts of the software stack for agents. For example, databases. I was talking to Adam in the back. Although he disagrees with the need for rebuilding tools just for agents. But anyway, look to the future and tell us what you’re thinking.
Samuel Colvin: I don’t have a clue about the future, but I will say that one thing that I think will happen more because it works in the present is using interfaces like SQL to allow AIs to go and search things. So if you can expose a SQL interface for people’s AIs to come along and introspect your data, it works extremely well. Obviously I’m biased because that’s what we have in Logfire and it works really well, but it allows us to—we have not had to go and build a like AI SRE. We just have Claude Code connect and go query SQL and solve the bug. So I think that is something that will expand more.
Aparna Dhinakaran: I think planning is going to be the year—the year of planning. Coining it now. And I think stuff like the skills.md that Claude Code has released was really cool because getting a bunch of tools to come together and work in some sort of skill kind of starts making planning for—we’ve noticed it ourselves—making planning for our agent slightly easier. And actually more successful in calling the right sequence of things. So, yeah, planning is what I’m excited about.
Adam Jones: I think I’m very excited for end-to-end agents that really can do almost everything that your employees could do. I think MCP has been great for enabling access to tools where there are nice APIs or there is something nice that you can touch behind the scenes. But computer use I think is the last mile where you have all these long tail services that don’t have nice APIs or you can’t be bothered to wrap in an MCP server. And having an agent that’s able to use a computer effectively—and we’re starting to see models get just about competent enough to do this and I think over the next year we’ll see them really shine at this. That’s what I’m very excited for. As well as having hybrid agents that use a combination of skills, tools, writing code, and computer use and mushing this all together to achieve things most effectively.
Ben Lorica: Come on Adam, surely there’s things around foundation models that you can share.
Adam Jones: Foundation models? I mean I think computer use is a big thing that we’re putting into foundation models that—they’re going to get a lot better at this. Faster models as well where I think you might have a reasonably competent Sonnet model or something that’s able to plan and think through things that delegates to a very rapid model that can go off and actually make the tool calls or spin up a computer and click through interfaces very quickly. I think delegating between different models within the same agent, it might be exciting.
Jerry Liu: Yep. I mean, coding agents are super general. I think computer use agents are even more general. So yeah, that’s super exciting. I think one thing that’s a little bit more relevant to our business and our history, I’m kind of excited about what RAG like 2.0 looks like. And so like RAG itself, like semantic search over vector database is kind of dumb at this point. Like everyone knows how to do it. You could do it in our toolkit in like five lines of code. What’s very interesting is kind of like how do you have an agent use the CLI, or just like use various tools to scroll up and down a page, like scan across a file directory. It probably integrates with computer use in terms of how you actually traverse different types of content and search for stuff of different sizes. And so I think just like this toolbox of search for agents, I’m personally super interested in.
Ben Lorica: And by the way, enterprise search is still not solved.
Jerry Liu: Yeah, indeed. Yeah. Yep.
Ben Lorica: So, anyway, so, uh, we’re out of time. I think, uh, you folks are going to stay around, right? So people in the audience can, uh, look for you in the expo hall. So, and, uh, let’s thank our panel.
