As Code Generation Speeds Up, Who Tests the Output?

Evan Marshall on AI Code Generation, The Verification Bottleneck, and the Future of QA.


Subscribe: AppleSpotify OvercastPocket CastsYouTube •  AntennaPodPodcast AddictAmazon •  RSS.

In this episode, Ben Lorica talks with Evan Marshall, CTO of Ito AI, about why software testing and QA are becoming the critical bottleneck in the age of coding agents. They discuss automated QA on every pull request, the limits of static analysis, the growing need for runtime and behavioral testing, and why verification matters more as teams generate and ship code faster. The conversation also explores the future of the QA profession, the overlap between testing and security, and the infrastructure challenges involved in running long-lived, model-driven testing agents at scale.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript



Related content:


Support our work by subscribing to our newsletter📩


Transcript

Below is a polished and edited transcript.

Ben Lorica: All right. Today we have Evan Marshall, CTO of a new startup called Ito AI, which you can find at ito.ai. The tagline is “Automated QA on every PR,” so automated QA on every pull request: code reviews that execute your app in a real browser, testing every affected flow with video and failure details. And with that, Evan, welcome to the podcast.

Evan Marshall: Thank you so much for having me.

Ben Lorica: All right, so I guess let’s start at the thesis. What’s the problem you’re trying to solve?

Evan Marshall: I’ve been a developer for 15 years, and the rate of code generation has never been higher. People are pushing out a number of lines that is probably an order of magnitude higher. But how do we verify that that code actually works and does what we want it to do? I think this is still a very unsolved problem, and that’s what we’re trying to tackle.

Ben Lorica: And one of the reasons there’s obviously more code being pushed now is because of these coding agents, right? I think there’s a bit of a laissez-faire attitude around code now, in the sense that, “All right, this may have bugs here and there, but I’ll push it out because I can always kind of start over from scratch.” At the time you started the company, how mature were those coding agents? Were coding agents one of the tipping points for you folks?

Evan Marshall: I’d say the coding agents are still rapidly improving. They are one of the unlocks that make agentic testing and environment provisioning possible. But even now, it’s a lot of work to get the agents to do what you want them to do. So I’m excited for the next generation of models.

Ben Lorica: So I guess we’ve established that this is going to be a growing problem, in that code generation is outpacing testing to some extent. On the other hand, there are AI testing startups—companies that build software testing tools that use AI. You write a prompt, and then they generate test suites. But it seems like most of the discussion, honestly, is around the actual coding, not the testing. Am I missing something? The amount of investment in coding tools seems to outpace the testing tools.

Evan Marshall: Right now there are kind of two approaches to QA. One is more code: end-to-end testing, unit testing, and everything in the middle, like integration testing. More auto-generated code to fix your auto-generated code is an incremental improvement, but it doesn’t really solve the problem. How are you sure that your end-to-end tests are actually testing what you want? And if they’re fragile and you just explode your end-to-end test suite, what kind of developer experience is that? CI gets slower and more expensive, and you have to deal with more fragility.

Then there’s manual testing. Manual testing is where a lot of the bottleneck has shifted to developers. When I was at the beginning of my career, you’d have QA as a separate team, and now a lot of developers are expected to do their own QA.

Ben Lorica: I guess some people who are not following the space closely might say, “Well, at some point maybe the coding agents get so good, Evan, that they don’t introduce any bugs,” which is obviously not going to happen. But maybe they’ll introduce bugs at a rate where human testers can keep up. Is that a likely scenario?

Evan Marshall: I think the problem is going to be that even if the agentic coding loop is perfect—and we’re part of that journey of perfecting the agentic coding loop—even if that is perfect, humans aren’t perfect. Humans can’t describe what they want in one shot. And when you add a group of people—try to pick where your company wants to go for lunch. You can’t just say the first thing that comes to mind and say, “That’s where we’re going to lunch.” There’s a lot of collaboration.

Where this heads is really fast prototyping, really easy collaboration, organizing efforts across multiple people for consistency, and a lot of taste-making that comes from groups. I imagine that product and engineering in the future will look kind of like a writer’s room, where people bring their little stills and say, “Hey, this is what I’m working on. How do you think my episode fits into the season?” I think collaboration is not going to disappear, and it’s going to happen on multiple levels—both the product and user-experience side, and also the technical design side, including what trade-offs and decisions you want to make at a higher level.

Ben Lorica: Let’s drill down a little bit. We’ll go back to broader AI for software testing and QA later, but let’s drill down on what you folks do. Like I said at the top, the tagline is “Automated QA on every PR.” Obviously, testing itself has a big surface area. Why did you focus—or maybe you’re not focused, maybe it’s just the tagline—on QA for pull requests?

Evan Marshall: What I’ve noticed over my career is that more and more of the testing responsibility is on the author of a PR. Moreover, I’m a little biased here—I’m a little engineering-supremacist in terms of who’s doing the hard work—but no one really tests a feature or anything in software as much as the person creating the PR.

So by giving them leverage—if you change a button that appears in 50 different places, as a software engineer you’re probably going to check the three most important places, but you’re not going to check the other 47. That’s what Ito is for. Ito is going to check the other 47, give you screenshots, give you a video of the interaction, give you any relevant logs, and really act as a force multiplier for people testing their own code.

Ben Lorica: Broadly speaking, there are static tests and runtime execution tests. Which of those do you focus on, or do you do both?

Evan Marshall: We really focus on execution tests. The gap we identified is that basically every single review bot right now looks really hard at your code. It agentically, probabilistically pokes around your code, tries to identify issues, and then combines that with feedback from a number of static analysis tools. That’s great, but it’s like buying a car by looking at the engine really hard. Actually doing the test drive and seeing how it performs is a major gap. It also misses the behavioral side: how users are actually going to interact with an application, and what kinds of issues they’re going to face. That’s where we noticed the biggest gap, and that’s what we’re focused on.

Ben Lorica: I’m not super familiar with this area, but I think there are some open-source frameworks that over the years have gotten traction. The one that comes to mind is Playwright, and then there’s Appium. Do you guys use any of these tools?

Evan Marshall: Yeah, we definitely use a lot of them under the hood. We build on top of them. The way I view our experience is that we make it as seamless as possible for you to leverage these AI tools without having to do a lot of complicated setup yourself. This includes everything from organizing the agentic pipelines to testing really thoroughly, to building lots and lots of infrastructure to make things happen quickly and efficiently. A lot of the ingredients are out there, and we combine them in a really high-fidelity way so that all you have to do to get started is connect your repo, and we’re going.

Ben Lorica: And then your AI agents will automatically run. So who owns the test? In other words, if I hook your tool up and it starts testing, do I retain some of those tests for future use in case I don’t want to burn tokens again?

Evan Marshall: Right now, we don’t export to a Playwright script for CI. I think something that’s really interesting about where we’re at is that we’re in the plastic age of code. Code used to be something where you built it, maintained it, and wanted it to last forever. We’re entering single-use code territory, where you generate some code, run it once, and then throw it away.

This gives you a lot of possibilities. We don’t want you to have to maintain an end-to-end test suite, and we think that’s something you’ll do in parallel. But we run so many tests on a PR that it doesn’t scale. You can’t add it as part of your regression suite unless you want your CI to take hours.

Ben Lorica: And how do I know that you have good coverage and you’re clicking on all the buttons and anticipating all the insane ways users will interact with my application?

Evan Marshall: I think trust is something you have to earn over time with developers. It’s not going to be immediately given. But we have lots of visibility into everything that’s going on. We have agentic real-time summaries of exactly what the agent is doing, accompanied by video, logs, and lots of other artifacts. You can see exactly what the agent is doing in its little sandbox VM.

Ben Lorica: So you folks don’t do any static analysis? In other words, maybe my problem is something basic, like, “Hey, you’re using an outdated version of a Python library that has known security problems.”

Evan Marshall: I think we play really nicely with the other AI tools. I see us as an addition instead of a replacement right now. To be honest, static analysis is the easiest thing for code gen to fix. What’s really hard is: if I want to simulate the activity of a hundred users using your change, how do you do that on your machine? You can’t. You have to put it in the cloud somewhere, and you have to run all of those tests in parallel. That’s basically what Ito does at the infrastructure level.

Ben Lorica: So, Ito—I have a pull request, I’m hooked up to Ito. Ito then has access to my source code?

Evan Marshall: Yes, temporarily. We pull it down, build the environment, and then delete it after we’re done running.

Ben Lorica: And that source code is isolated from other customers of Ito somehow?

Evan Marshall: Oh yeah. I was just on a call with a container specialist at AWS yesterday, and I was saying, “We do things kind of weird in AWS specifically because of our security requirements.” A lot of containers are meant to be run on a shared pool of instances. We don’t do that. We are extremely focused on security. The mental model we have is that we want agents to be fully empowered within their sandbox. We want them to be able to do whatever is necessary—seed data in the database, intercept requests, mock services, spin up servers. Within that sandbox, we want them to be extremely empowered. And then that sandbox is an iron cage where nothing gets out.

Ben Lorica: The sweet spot for your AI testing tools is what kinds of software systems? It sounds like you have to have a UI of some sort. This is not meant for heavy backend testing, right?

Evan Marshall: Well, it actually works on all of these things. It’ll work on developer tooling. If you change your ESLint config and you have a PR to do that, Ito is able to run it just like any other developer on your team. But I’ll be honest, the videos are garbage right now. The tests are really good for backend and tooling changes, but we’re still trying to figure out what a good video looks like for a curl request.

Ben Lorica: So it sounds like one of the main ways you tackle this trust and auditability question is through video, like you said. What else? Is there some sort of real-time dashboard telling me what’s happening and what kinds of tests it’s running? And what if, at the end of the test, I look at the video and the reports and say, “You know what? Why don’t you go ahead and do one, two, three, four, five other things?” Is that possible?

Evan Marshall: Yeah, you just tag us in a comment on the PR. We’re working on exposing that on our own website too, so you can have test cases where, if you want to guide testing, you can emphasize certain categories—maybe security or performance, or if you’re more on the UX-perfection side, handling more edge cases there. You can guide it that way, but you can also just tag the bot and say, “Hey, I also want you to test these three things.”

Ben Lorica: So the primary way I interact with Ito is through pull requests?

Evan Marshall: Yeah, it’s through GitHub or our website.

Ben Lorica: I see. Interesting. So this is not yet meant for the pure “vibe coder” who cannot code, huh?

Evan Marshall: I mean, an API is definitely in our future. But what we’ve seen as the bigger problem is: how do organizations, as they scale, feel the benefits of AI? A single person alone vibe-coding is like, “This is magical. Wow, look at all this stuff I can do alone.” An organization with 50-plus engineers is not feeling that same level of benefit.

The reason why is because of their release process. They may have a dev branch, promotion tiers, dogfooding—everyone at their company gets the release. They can’t just push out garbage and say, “I’ll fix it later.” There’s so much process and so much headache. So organizations are not feeling the love from AI as much as individuals are, and that’s where we see the biggest opportunity.

Ben Lorica: So you are an AI software testing agent testing AI-generated code.

Evan Marshall: Yeah. “Manual automated testing” is the contradiction we like to use sometimes.

Ben Lorica: So then, how comfortable are you talking about the AI under the hood of your tools? How does it work? Did you take an open-weights model or an API endpoint from a proprietary model and just use certain prompts, or did you do some fine-tuning? How much of the secret sauce can you talk about?

Evan Marshall: Yeah, I’m pretty open about it. In terms of the models, we cycle between the big three: Gemini, GPT, and the Opus/Claude models. We use the biggest and best. Honestly, our problem set is right on the edge of what these models are capable of. There are, of course, new models to try all the time, but none of the open-weights models are there yet. They’re probably a generation or two behind. Maybe with fine-tuning—but honestly, then the big labs are just going to push an even better one, and you’re going to have a fine-tuning pipeline just to always be behind. So we’ve really just used the top models.

In terms of balancing it, I love OpenRouter. We cycle between them very quickly.

Ben Lorica: Yeah, I’m an OpenRouter fan as well.

Evan Marshall: Yeah, I love OpenRouter. There’s not a lot of visibility. I have a conspiracy theory that Opus 4.6 and 4.5 were nerfed at the end of February, but you have to be able to switch providers and models really quickly in this space.

Ben Lorica: Where does your IP begin? If you have the models and you’re cycling through them, besides the obvious integrations you have to make, why can’t teams just do it themselves?

Evan Marshall: It’s about the harness and the infra. Getting a model to do what you want, especially for long-running tasks, is really difficult. We have tasks ranging anywhere from a few minutes to a few hours for extremely large, complicated PRs. Getting an agent to do what you want for three hours is a gigantic challenge. And it’s not just one agent, but an agent pipeline composed of swarms. Getting them to do what you want is a gigantic challenge.

And then in terms of being able to run all of that infrastructure, we have extremely unique infrastructure challenges that require very senior people at AWS to collaborate with us on how to actually address them. The infra layer is completely new. We’ve had some customers tell us, “Hey, I came up with a cool prototype in a couple of days—and then it didn’t work. It was way too slow, the model didn’t always do what I wanted, and there are just so many components, like memory.” There are so many different components where having a really good harness and really good infrastructure matters. You can homebrew it yourself, but I can tell you it’ll take a long time. And I think the biggest organizations will. The biggest companies will homebrew.

Ben Lorica: What is the boundary between what you’re doing—QA testing—and security audits? I’m not talking about penetration testing, but when you have a UI, in the past you’d have SQL injection. Now you could feed prompts to an app that interacts with a model and try to steal information from the company. What is the boundary between software testing and security in this world?

Evan Marshall: I think a truly perfect security model doesn’t exist in software at all. I used to work in areas where we used the most extreme methods of verification, like formal methods—creating a proof about the expected behavior of software. In the probabilistic, agentic world, it’s not the same. I think what you can really do is work at the infra layer, where you have to make sure that if a malicious customer does something on one machine, it stays on that one machine. You need extremely rigid boundaries in your infrastructure, like not running containers from different customers on the same instance.

Ben Lorica: But in terms of—let’s say, I’m not sure if this is a great example—but you look at a system like OpenDevin, now OpenHands, right? Vibe-coded, released, took off. And then a bunch of security folks looked at it and started filing PRs. Presumably, maybe if it had been tested better before it launched, a lot of the issues that came up later might not have made it to production. I guess my question to you is: in this era where people are pushing out code so fast, does QA start overlapping with security?

Evan Marshall: Yeah, absolutely. Security is already one of the dimensions that we’re regularly testing, more from a pentest perspective of access and role-based access controls—what are users actually able to do in a certain application? If you forget an admin hook somewhere on something that should be protected, that’s something we’ll surface. We don’t do a lot of the supply-chain stuff, like static analysis, or increasingly, outward-facing applications that involve a model. Pentesting in this case is almost like red teaming.

Ben Lorica: Yeah, I think security always depends on what your application looks like. OpenHands is a nightmare because you want to give it access to real things that you really care about. The power is the access.

Evan Marshall: Yeah. If you instead have our access, it’s a lot nicer because everything is sandboxed. We’re not putting customer PII in that sandbox and letting an agent go wild. We’re using mock data, seed data, and test credentials. Everything the agent has access to in our use case is not sensitive information. And at the boundaries where something becomes a deterministic flow—something we care about—we don’t just let agents post to GitHub or do things with GitHub. We have them create a report, and then the actual posting is a deterministic step that happens at the end of our pipeline. It’s really use case by use case. OpenHands is sort of an absolute nightmare for security because you want to give it everything you care about. You can see what that opens you up to.

Ben Lorica: So your strong suit is that you’re able to test user interactions. You build on top of established open-source projects like Playwright and so on. But we’re increasingly moving toward a world where the users are themselves agents. Bots are going to be interacting with agents. Does that change how you test these applications if the end user is not a person?

Evan Marshall: I think it’s just another persona to consider. The reason I bring it up is that a person, in some ways, might be more predictable than an agent, which might be more relentless. Like we talked about with security, an agent could be more relentless than a person trying to break your app.

Ben Lorica: Something we’ve thought about is that we could probably score your application based on how easy it is for an agent to use, by how easy it is for our agents to use your application. We thought that might be an interesting metric for a lot of people: how well is your application designed for an agent?

Evan Marshall: But honestly, I think APIs are just—MCPs are great, but APIs work really well for agents, and agents are being fine-tuned on them. I don’t know if we need to totally reinvent the wheel there. There are going to be ways for humans to access digital systems forever. How does that evolve? Something people don’t really talk about is economics. Why would you make an agent go explore a website to create a fake, more probabilistic version of an API rather than just use an API?

Ben Lorica: Yeah, obviously if they have an API, you’d hit the API, right?

Evan Marshall: Yeah, just use the API. And the only reason they wouldn’t have an API is because they don’t want to, or they don’t want to expose something. And then you get into more legal questions about what’s okay.

Ben Lorica: At the start of our conversation, you alluded to the fact that at the start of your career, you had these separate QA teams, which I also experienced. QA was a separate org chart. In many ways, people took pride in the fact that the QA people didn’t report to engineering, so they had the equivalent of editorial independence. But as you alluded to, even before the rise of AI, developers were increasingly taking on more and more of these responsibilities. What was the state of the QA profession before AI, what is the state of it now, and what is likely going to happen to the QA profession?

Evan Marshall: So it’s kind of a before, present, and future. Long before, it was a separate team on a different side of the cubicles—you couldn’t even see them. And sometimes, like I said, they didn’t even report to engineering.

Now, what QA looks like at most of the organizations we work with is more of a cross-team, dynamic role. If there’s a big feature, engineering can call them in to help validate it. They’re always responsible for ensuring core flows and releases. There’s also an infrastructure component—QA engineers maintain the end-to-end testing suite and test infrastructure.

Ben Lorica: What’s the typical profile of people in this org or profession? Were they former developers? How did people get into that career path?

Evan Marshall: I think some people have some coding experience. In terms of getting into that role, it’s really about having the dedication to do a good job and being extremely collaborative around what your organization needs. I think coding is now, with GenAI, more expected. If you’re a QA engineer who doesn’t have a lot of coding experience, you can prototype stuff, bring it to engineering, and help iterate on it. But it was never really a strict requirement. There were people who migrated into that path from engineering, but I never thought it was a strict requirement.

Ben Lorica: So then today, you still have QA teams. They’re probably smaller than when we first started. What’s likely going to happen to the QA role and QA profession?

Evan Marshall: The way I think it’s going to evolve is that it’s actually going to look a lot like it does right now. But if you talk to any QA team, they feel underwater. They know there are a lot of things they should be doing. There’s a lot of infrastructure, end-to-end testing, and better reporting they should be building, but sometimes it’s hard just to keep up with releases.

Really, QA has been left behind. Who is feeling the pain the most? It’s QA. Because engineers will ship things ten times faster with ten times as many bugs, and it’s often someone else’s job to fix it. I see this agentic tooling that we’re building as a way to empower them to keep up.

I think the way it’s evolving, in the same way it’s evolving for software developers, is that software developers no longer care so much about a single line of code. They care about structure, how things play nicely together, how it can evolve in the future, the design—but not so much about a single line of code. I think QA is going to evolve similarly, where they won’t care so much about a single bug being fixed. Right now they have to walk that bug through whatever process and make sure it’s fixed, whether that requires a point release or whatever.

It’s going to be more about this taste-making of, “Hey, we noticed that we have a large class of errors that different developers on our team have different opinions about.” Some developers say, “We don’t have to worry about every single little edge case of back-and-forward navigation and rage-clicking something a hundred times.” As a QA engineer, how do you inform the global perspective in your organization about how errors should be handled and prioritized, so that all the engineers can be on the same page? Like, “Hey, here we are UI perfectionists. This does not get released.” Whereas at other organizations, it may be, “We’re security perfectionists. We fix every security issue, no matter how small, as soon as we find it.” That’s what QA is going to evolve into: a more global perspective on what kinds of errors are happening and where focus needs to be built up, instead of just walking single bugs through whatever process.

Ben Lorica: And like I said, if you read the news, a lot of funding is going into helping people build software, but not necessarily test software. I think there are some startups now, including you, but definitely, as far as the arms race goes, I guess it’s always been the case that developers have more funding than QA, right?

Evan Marshall: I mean, QA doesn’t exist without people writing code, so it makes sense that that’s the primary problem. But we quickly see the bottleneck shifting. There’s a spectrum of AI adoption, and the moving average and distribution are getting to the point for many organizations—especially those at the forefront—where they have really solid agentic flows and verification through static analysis, but then things get caught up in a human actually testing it and ensuring that it works the way they expect it to. Because humans are never going to write perfect PRDs. Just making sure something works the way you want it to, and that all the edge cases and user experience are handled the way you want—that part is where the bottleneck is.

Most developers who are on the very edge of full vibe-coding spend all of their time testing. Prompt, wait, test: is it working the way I want? Prompt, wait. And we’re trying to cut back on that inefficiency by saying, “Well, here are all 100 things you could have looked at. What do you want to be different?”

Ben Lorica: It seems like now that company and engineering leadership are very much aware of the power of these coding agents, it’s just common sense that they’ll realize rather quickly, “Hey, our testing tools need to keep up, because our engineers are now 10x, 100x more productive,” right?

Evan Marshall: I think they’re already feeling it right now. If you talk to engineering leaders, especially at larger organizations, they’re in a really tough bind. There’s a mandate from the C-suite to use AI. Some of this is public—you can look at Amazon’s mandates and KPIs that are forcing engineers to use specific AI tools. They even have token leaderboards, so if you’re not on the leaderboard, it’s, “Hey, you’re not taking this shift to AI coding agents seriously.”

And then if you look at studies, like Anthropic’s own studies on whether vibe-coding is improving efficiency in organizations, it’s not. It’s not helping the largest organizations because AI is great at being a hammer and making something really quickly—it’s not a scalpel. To understand all the business context, all the collaboration, and how painful it is when something goes wrong inside a really big process—engineering leaders are kind of stuck saying, “Okay, I’m forcing my developers to use tools that I really don’t think are moving the needle, just so I can say I’m using AI.”

So what’s the problem? Why aren’t they able to adopt AI as fast? Because their verification requirements are much higher. They need additional verification to be able to trust the output of LLM-generated code. That’s where I see QA being the biggest need. As your organization scales, you need more verification that LLM-generated code is doing what you want it to do.

Ben Lorica: And the interesting thing too is that a lot of the processes are being revamped or completely tossed aside. Remember, up until a few years ago, if you wanted to add a feature, it was, “Oh yeah, let’s do a PRD, and then we have five meetings, and then another five meetings to talk about it.” But now an engineer can have an idea, vibe-code it on his or her phone on the way to work, and here—we don’t even have to talk about it anymore. I think one of the casualties of all of this is the people whose job is to run meetings, right?

Evan Marshall: Thank God for that. But I do think there’s this secret tech debt bubble that every engineer with enough experience knows exists. Maybe it’s hard to quantify, but there is a tech debt bubble, and no one knows when your velocity drops off a cliff.

I think it’s really cool to be able to get faster to a demo, but if you’re working with agentic tools in a large organization, there’s some class of issues where, okay, you can automatically fix it. But how expensive is a bug? That’s a really great question. For organizations with 50 engineers, it’s tens of thousands of dollars. If you calculate all the meetings, all the discovery, everything flowing through the process, it could be $10,000 or more for a bug. And then you ask, “Okay, how much human engineering did this save me if it introduced these bugs?” The math just doesn’t work out in a lot of cases. The verification is never—

We have really crazy, powerful website builders now, but verification for complex software that is mission-critical and has to work within a lot of constraints, with a lot of collaborators doing things at the same time—you need verification beyond that. Right now, the only options are really slow, flaky end-to-end tests, which people are now also generating with AI, and manual testing. We’ve talked to several companies that offer manual testing as a service, where you can hire a bunch of people as contractors and scale that up as needed. Their demand is exploding. It has never been a better time to be a manual software tester.

Ben Lorica: So this is like the equivalent of Scale AI and the data labelers, where it’s a heavily manual thing. But I think the writing is on the wall for manual testing, no?

Evan Marshall: We’re trying to make humans spend less of their lives clicking those buttons. But right now, today, the demand is overwhelming for manual testing because people are moving faster, but they’re experiencing bugs and weird regressions that no human is really responsible for and no one has the context to fix. People are trying to use more tools to track it down, and those help, but the verification layer for AI doesn’t really exist yet, and that has prevented a lot of adoption.

I think a lot of organizations are being kind of bullied by non-technical people into thinking it’s a user problem. It’s like, “No, you can’t figure out how to use AI. I saw this tweet that says we have AGI, but we’re not a thousand times more effective on the engineering side.” The problem they’re running into is verification.

Ben Lorica: By the way, in closing: as these coding agents get better, you can imagine that these coding agents will include testing agents. But I think just like we used to have separate QA teams from engineering teams, engineering teams will still benefit from having separate AI testing tools. You don’t want the person who wrote the test to grade the test, sort of. And you also benefit from an independent AI testing tool being able to bring in multiple different models that are distinct from the models your AI agents relied on.

Evan Marshall: Yeah, and I think, like anything, it comes down to obsession with a particular problem. Everything we do is obsessed with QA and software verification. When designing a harness, the models have different personalities, and I haven’t seen great resources on dealing with different model personalities. So if you’re building a harness and you want a model to do something—let’s say you test Opus 4.6 or whatever—you have a prompt, and the model does some stuff, and you test it a bunch of times to see what it does. You adjust the prompt, and then it does what you want. And then you plug in a model that benchmarks better—like 5.4 Pro or Gemini 3.1 Pro—and now it’s not doing what you want anymore. Because some of the things Opus was doing, it just wanted to do. It was baked into its personality. And all the things it was doing that you didn’t want, you adapted to. Now Gemini is doing a bunch of things that you don’t want it to do, so now you’re figuring out, “Do I do model-specific prompts in my harness?”

Ben Lorica: Do you use DSPy and prompt-optimization frameworks?

Evan Marshall: Yeah, DSPy is a toolkit for prompt optimization and prompt engineering. It’s much more principled, based on an evolutionary algorithm. It’s an open-source project. Inside it, they have all sorts of optimization algorithms that will allow you, in a very principled way, to get to what you’re trying to write.

It’s more like templating languages and experimentation. But it’s also about how you benchmark personalities and how you get insight into that. It’s sort of weird, but the model families have similar personalities, and sometimes there’s a big shift in personality and you don’t really realize it until you run your own internal evals against it.

Ben Lorica: What about multimodality? Is that something you folks also use? Increasingly these models are multimodal, so I’m assuming the applications are increasingly multimodal.

Evan Marshall: Yeah, we don’t have audio yet, but we do have vision and CLI-type agents. It is sort of a combination. It’s hard to know: do you use two models—one that’s really good at vision and one that’s really good at tool calls? Or do you use both, or use a combined model? Generally, we’re seeing benefits from model segmentation—figuring out what tasks you’re doing. Do you need a smaller model or a bigger model? Do you need a model that has vision? Expose the vision model as a tool call to your other models. We’re really seeing the swarm approach being more useful than just saying, “Hey, let’s switch to a model that has vision capabilities.”

Ben Lorica: Yeah, which is why I’d rather use a tool like yours than have to try to build it myself. Because basically, you’ll bring the best solution to the table at the most efficient cost, and save me money along the way. And with that, thank you, Evan.