Hamza Tahir on Agents, Harnesses, Orchestration, and the Road to Production.
Subscribe: Apple • Spotify • Overcast • Pocket Casts • YouTube • AntennaPod • Podcast Addict • Amazon • RSS.
Ben Lorica speaks with Hamza Tahir, co-founder of Kitaru, about what it takes to move AI agents from demos to production. They discuss the difference between workflows and agents, why “harness engineering” is becoming a useful frame for application builders, and why long-running agent workloads require orchestration, durable execution, retries, state management, and human-in-the-loop semantics. The conversation also covers European AI adoption, regulatory pressure, defense-sector demand, model neutrality, and the idea that “AI agents are implemented, not adopted.”
Interview highlights – key sections from the video version:
-
-
- Kitaru and the Return of Orchestration
- European AI Adoption: Compliance, Security, and Conservative Deployment Patterns
- Agentic Systems vs. Workflow-Driven LLM Applications
- EU AI Act Changes and the Question of Model Liability
- Defense, Manufacturing, and AI Momentum in Germany and Europe
- Harness Engineering: Optimizing the Environment Around the Model
- Looking Past Buzzwords: Context, Function Calls, RAG, and Verifiability
- Custom Harnesses, Agent Frameworks, and Off-the-Shelf AI Tools
- The Future of Harnesses: Open, Proprietary, or Operating-System-Like?
- Why Long-Running Agent Workflows Need More Than Benchmarks
- Kitaru as the Runtime and Orchestration Layer for Agents
- What Orchestration Means: Execution, Scheduling, Recovery, and Observability
- The Production Gap Between Agent Frameworks and Reliable Runtime Infrastructure
- Meta-Orchestration: Connecting Agent Harnesses to Existing Infrastructure
- Open-Weight Models, Vendor Neutrality, and Why AI Agents Are Implemented, Not Adopted
-
Related content:
- A video version of this conversation is available on our YouTube channel.
- Stop upgrading your LLM. Start fixing your data.
- The Missing Layer: Why Your AI Agent Fails — and What Actually Fixes It
- Why Your AI Agents Need Operational Memory, Not Just Conversational Memory
- Richard Garris and Barry Dauber → The Gap Between AI Hype and Enterprise Reality
- Zhou Yu → Why Your AI Agent Isn’t Ready to Ship (And How to Know When It Is)
Support our work by subscribing to our newsletter📩
Transcript
Below is a polished and edited transcript.
Ben Lorica: All right, so today we’re welcoming Hamza to the show — one of the founders of Kitaru, which you can find at kitaru.ai. That’s K-I-T-A-R-U, and in Japanese it means “to arrive.” Some highlights from their website: “Agents are leaving the laptop. Kitaru is the open source runtime for long-running Python agents” — which means checkpoints, replay, wait, isolated execution, versioned deployments on your cloud. And with that, Hamza, welcome back to the podcast.
Hamza Tahir: Thank you for having me back.
Ben Lorica: So we’ll get into what Kitaru does. It sounds a lot like what we used to call orchestration back in the day.
Hamza Tahir: Last time we spoke, you had the same take. So I guess it’s orchestration all the way down.
Ben Lorica: Yeah, it does seem that way — similar to how some of the tools from that era are pivoting. The Prefects and Astronomers of the world. But since you’re based in Europe, and we’re obviously in a bit of a Silicon Valley bubble here in the Bay Area — assuming you’re now completely focused on helping people build agents and generative AI applications — what are some of the key challenges you’re seeing from that side of the pond?
Hamza Tahir: Well, I think European challenges share a lot with global challenges. But what makes Europe particularly complicated is, of course, data compliance and security. Unlike the US mindset, people in Europe are a bit more conservative about how they connect systems — especially to non-deterministic systems. An agentic application can be quite consequential, so people tend to be more conservative. I’ve seen more implementations of workflow-driven LLM applications rather than truly agentic ones. Unlike in the Bay Area, where you have companies like Harvey — although maybe that’s the wrong example, since Harvey has a lot of workflows too — but there are certainly well-known examples of truly agentic systems coming out of the US.
Ben Lorica: For our audience who may not appreciate that distinction — what’s the difference between agentic and workflow-driven?
Hamza Tahir: If the control flow of your program is driven by the LLM, then it’s an agent. If you predetermine the control flow of execution, it’s a workflow.
Ben Lorica: And one of the key challenges in Europe is obviously the EU AI Act — which, breaking news, I haven’t read all the details, but there’s apparently been some softening of it. Do you have a high-level understanding of what exactly happened in the last few days?
Hamza Tahir: I don’t pretend to be a legal expert, but I think there has been a bit of a softening of the rules. Just like in the US, where the previous administration seemed more worried about AI proliferation, in the EU we had a similar journey where we perhaps over-regulated. There was ambiguity around who exactly is the producer of a model — I think that’s one of the main drivers of discussion, because if you train the model, you carry more liability. There are also different compliance tiers in the EU AI Act. As I understand it, if you’re training a model that directly affects everyday people, you face the highest possible compliance burden, which was sometimes very difficult to fulfill. What I believe they’ve done recently is clarified that if you’re building on top of LLMs and you’ve sourced them properly, you can build LLM-powered applications with less liability. So it’s more of a semantic clarification — just like you asked me the difference between a workflow and an agent. We’re all still grappling with definitions.
Ben Lorica: For those of you in Europe who are building AI applications — is this a positive development? What’s the immediate reaction among your circle?
Hamza Tahir: I think positive. In Europe, we’ve classically been regulation-first, and innovation has lagged a little behind as a result. So it’s always good news when the bureaucracy gets out of the way, especially at the cutting edge of a field. That said, I don’t want to give a purely political answer — there are pros and cons on both sides.
Ben Lorica: You’re in Germany specifically, right? We’re hearing a lot about the revitalization of the German defense industry and manufacturing. Is that propagating to AI? Are you starting to see those sectors rushing toward it?
Hamza Tahir: Yes, actually. The defense sector — I don’t know how much your listeners are aware, but given the political climate, there’s been a massive shift toward more independent, locally developed systems, alongside significantly increased budgets. I believe it was something like 5% of GDP that the countries committed to defense, which has been a direct consequence of the Trump administration pulling back some resources from Europe. For example, Germany had a lot of US troops stationed at Ramstein, and that’s changed.
Ben Lorica: And now some of those funds will go toward modernizing systems, with AI as a big component.
Hamza Tahir: Naturally. Necessity is the mother of innovation. There’s been a massive explosion of activity inside Germany. I personally know many people who are now working in defense companies — whereas a few years ago, that wouldn’t have been a typical career path for a young person. I know founders who are traveling to Ukraine to talk to people on the ground and building based on what they learn there. And of course, AI is quite central to all of that. You may have heard that Helsing, a very prominent startup here in Germany, just raised at a massive valuation. There’s definitely interest and drive, and a growing willingness from the German and European militaries to reduce the bureaucratic barriers for getting startups involved — whereas before, it could have taken years to get that technology across the line.
Ben Lorica: Similar to what happened here several years ago, before the Obama administration made it easier for startups to work with the Pentagon. All right, so let’s talk about agents. Before we started recording, we were discussing the latest buzzword in the space: harness engineering. Broadly speaking, the philosophical insight is that you can get a lot done by optimizing the environment and surrounding pieces around the model. Yes, the model matters — it would be a lie to say otherwise — but in certain well-defined applications, you can optimize things around the model and maybe get by with a cheaper or smaller one.
Hamza Tahir: Or you can get more performance out of the existing frontier models.
Ben Lorica: Exactly. And like any buzzword — world models, big data — harness engineering means different things to different people. But broadly, you should think of it as the harness around the model. Some basic components include additional context, constraints governing input and output like guardrails, and a component that allows the agent to iteratively explore the space and converge on a solution. As I describe it, it becomes clear we’re talking about things outside of the model. And that’s the excitement around harness engineering — because frankly, most teams don’t build their own models. They build applications on top of models. The harness is something teams can actually build themselves. The most popular agents right now, like coding agents, come with harnesses. Claude Code could be considered a harness. One philosophical fork in the road: if something isn’t completely integrated end-to-end, is it still considered a harness? Anyway, that’s a long preamble — I’m curious to hear your take on what’s becoming an increasingly exciting area.
Hamza Tahir: Well, first of all, I share your apparent skepticism toward the constant rebranding of similar concepts. We were talking about orchestration earlier, and it’s gone through a similar rebrand. People now talk about things like “durable execution,” which again sounds a little…
Ben Lorica: In this case, honestly, “harness engineering” is like “world model” or “big data” — different people have different notions of it.
Hamza Tahir: Exactly. And I think what’s important for the audience is to try to understand the underlying concepts, regardless of the terminology. For me, it’s been increasingly evident that the performance gains we’ve seen — largely driven by coding agents, primarily because they’re easy to measure, have clear benchmarks, and are one of the most popular use cases —
Ben Lorica: And the people building these tools are themselves interested in coding.
Hamza Tahir: Exactly. They use it every day.
Ben Lorica: Right.
Hamza Tahir: So they’ve understood that how you call functions, how you execute them, and how you load context — what used to be called context engineering — is the biggest bang for your buck outside of actually training an LLM. And I think that’s been true since the days of RAG popularity. RAG was essentially: how do I get the right context to the LLM at the right time? The harness is just a formalization of that.
Ben Lorica: Now that you put it that way — RAG is a kind of harness, right?
Hamza Tahir: RAG is a tool that can be used by a harness. If you think of the harness as everything surrounding the model, then the elements of RAG could be considered part of a harness.
Ben Lorica: Yeah, I agree. No matter how you slice it — the way I think about it is the LLM is the brain.
Hamza Tahir: And the harness is the hands.
Ben Lorica: Yes.
Hamza Tahir: It’s the thing that actually affects the world. The model tells it what to do. And if you can optimize the handover between the brain and the hands efficiently, you can get more performance than just making a smarter brain.
Ben Lorica: Right. And part of the harness is also this ability to iteratively explore — which is where, as you mentioned, coding agents shine. Or tools for research math, where you have something like Lean or MATLAB — domains where you can verify or measure results.
Hamza Tahir: Exactly. If you have verifiability in your system, then having your own harness is even more beneficial, because you have a closed feedback loop. And coding is great for that, but there are other examples too. A lot of what I’ve been hearing about lately is how you can have a shared context for company-building — if you’re running a startup, how do you make company knowledge queryable? How do you surface decisions made between founders, engineering, product, and marketing teams? If you have a custom harness that can parse that context and deliver it just-in-time to an intelligence layer, you can go very far with that. The concepts are quite new and very exciting.
Ben Lorica: So the question is: we agree you have the model as the brain, and the environment around the model — broadly speaking, the harness. There are tools and frameworks that allow companies to build agents, and in doing so, they’re essentially building a harness around a model. Tools like LangGraph, CrewAI, and so on are helping enterprises do harness engineering. And then there are off-the-shelf agents — like coding agents — that already come with a harness. But if you’re building an agent yourself, you’re pairing a model with a harness. A typical enterprise will either do that from scratch or use one of these frameworks. Am I wrong?
Hamza Tahir: You’re absolutely correct. There is a strong case for building your own custom harnesses. The distinction —
Ben Lorica: That’s the camp I’m in.
Hamza Tahir: The frameworks come with a lot of things you may not need.
Ben Lorica: Exactly.
Hamza Tahir: One hundred percent. The pre-built harnesses — take Claude Code, for example. You can programmatically use that harness with the Claude Agent SDK, but the way it reads or edits files may or may not translate well to use cases outside of coding. That’s why you might want to build your own. But Ben, where I’d love your input is this: the labs are selling their harnesses tied to their models. It’s very hard to take Claude Opus and put it into Open Code or Open Claude — a different type of harness — and get the same performance. Do you think we’ll end up with one dominant harness, or two or three, like operating systems — Windows, Linux, macOS? Or will we live in a world where everyone builds their own open harnesses, and the labs stop so tightly coupling their models to their harnesses?
Ben Lorica: That’s a good question. For one, while there are open source RL tools out there, they’re still beyond the reach of most typical enterprises — they require a certain level of expertise. So if you can build a harness without doing any RL for your use case, maybe you move forward. The other challenge I’m hearing is this: my employees discovered Claude Code, and now they just want to use Claude Code. I was actually talking to the founder of one of my favorite agents — I won’t name them — it’s in the financial space, very specific slice of financial analysis, and it automates a job that used to require ten people down to one. It collected all sorts of data and updated it in real time. But what they found was that their own employees preferred Claude Code. So they built a UX around it, and now they’re debating whether to approach Anthropic, license their data, and essentially let Claude Code become the UX and harness.
Hamza Tahir: Is that because they like the experience, or just because they’re already familiar with it?
Ben Lorica: Mostly familiarity. They were already using it for other things. So now when you tell them to use this other web UX, they have to re-familiarize themselves. And in some domains, that might work fine — imagine users already deeply familiar with Figma or Photoshop, and you just say, “We put AI in there, there’s a harness, go at it.” But then Claude Design came out and Figma’s stock tanked. So my two answers to your question are: yes, you might be able to build a harness in certain domains, but it may require RL expertise you don’t have. And the other challenge is that in the world of shadow IT, your own employees may already be using tools they prefer as their harness.
Hamza Tahir: Yeah, I sort of agree. A good analogy for predicting the future here is the world of operating systems — you have an open source variant that’s widely customized in the form of Linux, with Red Hat, Ubuntu, and various kernels, and then you have enterprises that just say, “I need macOS” or “I need Windows.” I think there’s a world where both types of harnesses coexist — custom-built and off-the-shelf. What’s interesting is the time window. It’s very easy to use Claude Code right now, but it’s very expensive for a lot of use cases. When the cost becomes too painful at scale, that’s when we’ll start to see a shift. I’m not sure when that will happen, but even at 15 people, we feel it.
Ben Lorica: It could be that the Claude Codes of the world will have to lower their prices due to competitive pressure. I personally use Open Code with OpenRouter, still with Claude models — it’s not quite the same, but I get the benefits of the more advanced models and can swap in cheaper ones for routine tasks. But it does seem like there’s real value in using something your employees are already comfortable with. That might be one of the stickier challenges. On the other hand, Kitaru operates at a different layer. The thesis is that the harness doesn’t actually help you with long-running workflows or more involved agentic tasks. I actually wrote about this in my newsletter — in the past, people would go to a benchmark, even a programming benchmark, and ask what the highest-rated coding model was. But now, because people are using Claude Code in a lot of these agents, they’ve realized that’s not enough. When you actually build a program, you want the agent to do something much more involved — a long, complicated set of tasks: installing libraries, calling tools, working in the terminal, and so on. That’s why people have started developing much more involved benchmarks. I wrote about something called Terminal Bench, which has now evolved into Harbor. Even in the most successful area of agents — coding — people have realized that one-shot benchmarks aren’t enough. So it sounds like Kitaru is a tool that helps manage these long-running, more systematic workflows. Is that correct?
Hamza Tahir: Yes. To put it in terms you’d appreciate — we’re the orchestration layer for agents, the runtime layer. A bit of background: as I mentioned, I’m the co-founder of ZenML, and Kitaru is another product of ZenML — we’re essentially the same company. We spent the first five years building machine learning orchestration, MLOps, and we still have that business. It’s still running and growing.
Ben Lorica: So the question, Hamza, is: if we define the harness broadly as everything around the agent — context management, constraints, exploration — isn’t Kitaru part of the harness?
Hamza Tahir: That’s a great question. You can divide the harness into multiple layers. Going back — we spent a lot of time working on machine learning pipelines, and ZenML still runs on that. The reason we built Kitaru was that more and more of our customers were using ZenML as the orchestration layer for running agents, but the SDK and docs were too machine learning-heavy.
Ben Lorica: By the way — what are the key things an orchestration layer actually does? Maybe you should start there.
Hamza Tahir: The key of orchestration, simply put, is when you have a workload you want running in a managed way outside of your own machine. Kubernetes is a container orchestrator — it helps run nodes, machines, and virtualizations on top of that.
Ben Lorica: And there are companies dedicated to orchestration — Prefect, Astronomer…
Hamza Tahir: Prefect, Astronomer, ZenML. And orchestration comes in different flavors. There’s data orchestration like Airflow, and machine learning orchestration like ZenML or Prefect.
Ben Lorica: In my mind, orchestration means some form of execution scheduling —
Hamza Tahir: Very important.
Ben Lorica: — monitoring and observability —
Hamza Tahir: I’d push back slightly on that one.
Ben Lorica: Okay. And within execution, the ability to retry and recover.
Hamza Tahir: Yeah. Monitoring is interesting — there’s actually a very interesting relationship between observability tools and orchestration tools. If you look at something like Datadog, or in the agent world, tools like Braintrust or LangSmith — these are passively dropped into the program to observe what comes out. Orchestration tools, by contrast, define the shape of the program itself. So they give you a different type of observability. I wouldn’t say the primary job of an orchestrator is to capture all traces — its primary job is to give you a runtime that manages the workload reliably. Observability is a consequence of that, but it’s different.
Ben Lorica: And the key difference here is that the frameworks helping you build agents are essentially helping you build something that works — almost like a demo. They excel at getting the agent to function. But reliably running it in production is where orchestration tools step in. Theoretically, framework builders could add that, but they already have their hands full defending their own turf.
Hamza Tahir: I agree. This is where we need a clear understanding of the responsibilities of different parts of the stack. Some frameworks will want to own everything — from defining the graph to deploying and monitoring it — and there’s certainly a market for that. But there are plenty of enterprises that want to separate those layers. Maybe you don’t want to tie yourself to a specific harness or commit to, say, Claude Code or a particular cloud provider. That’s where we come in.
Ben Lorica: And as we discussed earlier, you can be someone who builds an agent from scratch — meaning you’re building the harness yourself — but you’re very unlikely to also build the orchestration layer.
Hamza Tahir: Very unlikely, because orchestration has been solved in many different ways already.
Ben Lorica: Right, there are things you can use.
Hamza Tahir: Exactly. But there’s a gap right now. You have a new type of workload, and existing orchestrators like Kubernetes weren’t typically built to handle it out of the box. You need to add an event-based queue, figure out memory, understand what artifacts are, handle durable execution — and that’s the gap Kitaru wants to fill.
Ben Lorica: And basically, what you’re describing is that it’s within reach of most teams to build the agent and the harness, because they understand the workflow, the data sources, the context, and the constraints. But once they’re ready to productionize it, they’ll hand it off to a platform team that either already has an orchestration tool or will adopt one. And it’s unlikely that the framework they used to build the agent will also provide that orchestration layer.
Hamza Tahir: Exactly. And that’s what we learned from ZenML — you need to bridge the gap between platform engineers and application developers or data scientists. Our approach is to act as a meta-orchestrator: we connect the harness layer with whatever orchestration backend your platform team has already chosen. We don’t tie ourselves to a specific orchestration backend, whether for machine learning or agents. That seems to have resonated with a lot of our customers.
Ben Lorica: Can you describe exactly what you mean by meta-orchestration?
Hamza Tahir: It means we abstract away the concerns of the concrete runtime. A concrete example: let’s say you want to run your agent on a Lambda function — that’s a different backend than Step Functions, or Kubernetes, or Slurm. What you need is a translation layer between the harness and any of these stacks, so it can plug and play into your existing infrastructure. You don’t want to recreate another orchestration layer just for your agents.
Ben Lorica: I see. But it still provides what you’d expect from an orchestration layer — scheduling, execution, retries, all of that.
Hamza Tahir: It has to. And that’s what Kitaru does well. Orchestrating agents is a slightly different beast from orchestrating machine learning models. If you have sub-agents within an agent, you want them to be reliable. You have thread-waiting, human-in-the-loop semantics — if an agent is waiting for your response and goes idle, you don’t want it burning resources in your orchestration layer. That costs a lot of money at scale. There are many nitty-gritty things you have to figure out for agentic workloads that simply weren’t considered before.
Ben Lorica: So for someone already familiar with conventional orchestration tools — what’s the pitch? Because they might say, “I don’t mind using Airflow. I’ll suffer through Airflow.”
Hamza Tahir: Honestly, if you really want to use Airflow, go ahead. But I think that’s a small population. There’s currently strong demand for next-generation orchestration tools, because whenever there’s a new workload, you need a new layer of tooling. For those who want to stick with repurposed orchestration tools, we need to provide integrations for their harnesses, and they’ll probably build it themselves. But that’s not the signal we’re getting from the market. The signal is: “We have this new workload, we don’t want to re-litigate our orchestration engine decision, but we need an easier translation between our harnesses and our existing infrastructure.” That’s really the customer I typically talk to.
Ben Lorica: And as you said, generative AI and foundation models have vastly expanded the pool of builders. A lot of these new builders never touched previous orchestration tools. So at what point do they realize they need something like this — even if they don’t know the word “orchestration”?
Hamza Tahir: The moment they’ve truly deployed an agent running at a certain scale. If your agent isn’t heavily used, you won’t notice. I typically wouldn’t suggest worrying about orchestration when it’s running on your laptop, or when you have a small agent with one or two tool calls — you can get away with wrapping it in a FastAPI and deploying a Docker container. But when you scale — say, a support agent chatbot, which is the most popular use case — you hit it very quickly. Some of our customers handle hundreds of thousands of support requests per week. At that scale, you can’t just have one FastAPI server running in the back. You need event queuing, you need a storage layer attached to all your context. The moment you start thinking about those things, you’re in the realm of orchestration.
Ben Lorica: Interesting. By the way, for our listeners — Kitaru is open source. What’s the difference between the open source version and a commercial product, if there is one?
Hamza Tahir: The commercial product is coming soon. We follow the same model as ZenML — ZenML has been open source its entire life, and we have ZenML Pro, which has enterprise features like SSO, RBAC, advanced triggers, and analytics. Kitaru doesn’t have that yet, but that’s the intent. For now, we’ve just launched this new product and we’d love to get feedback and work closely with early adopters. There’s no commercial product for Kitaru at the moment.
Ben Lorica: Given the nature of Kitaru, it sounds like you mostly speak to people who are further along — already wanting to scale.
Hamza Tahir: Yeah. I’m sure there will eventually be something like an agent ops maturity index, similar to the MLOps maturity index. From a broad perspective, we talk to people who are either about to deploy their first agents or have already deployed and expect significant execution volume. And increasingly, there’s another market: enterprises building internal agent platforms who want to standardize how different harnesses — Claude Code and others — are deployed and managed centrally. If you’re a Fortune 500 company, it’s very important that context is centralized, otherwise you end up with 20 different stacks. We talk to those people too.
Ben Lorica: So you still don’t consider yourself part of the harness?
Hamza Tahir: We’re the outer harness.
Ben Lorica: One question I’ve been asking people — I don’t know if you’ve come across this — but the Chinese open weights models?
Hamza Tahir: Oh, they’re great.
Ben Lorica: It seems like the typical conversation I have is: people have tried them, they like them, but when it comes to deploying to production — even when the conversation gets to “we don’t mind deploying it in our own clusters, we have it locked down” — people are still saying, “I’m sure they’re good, but our usage isn’t high enough that we’d abandon Gemini, Claude, or GPT.”
Hamza Tahir: I have a lot of counter-examples, actually.
Ben Lorica: So in Europe, you’re coming across people deploying them in production — not just prototypes?
Hamza Tahir: One hundred percent. And in America too. If you look at the usage of something like GLM or Kimi —
Ben Lorica: Maybe they just don’t talk about it publicly.
Hamza Tahir: Maybe. But I’m not sure why they’d hide it. Is there a regulation?
Ben Lorica: No, I don’t think so. I think it’s mostly because the names your end users are familiar with are just Gemini, Claude, and GPT. In Europe, I’ve heard of people using Mistral.
Hamza Tahir: Why has the market —
Ben Lorica: I’m not entirely sure. It’s not open weights.
Hamza Tahir: It’s European.
Ben Lorica: Yeah, it’s European and homegrown. But the better models aren’t open weights — it’s like Gemini versus Gemma.
Hamza Tahir: Just like OpenAI.
Ben Lorica: Right. So the Gemma-type models — I can see the market there for edge devices, laptops, that kind of thing.
Hamza Tahir: Coming back to the Chinese models — Jensen Huang said that the second most popular set of models are the open weights models, and the only popular open weights models are the Chinese ones. If Jensen is saying that, I think it’s pretty clear.
Ben Lorica: That’s probably true, particularly in the development world. I’m just talking about enterprise usage — publicly saying, “This is what we’re using.”
Hamza Tahir: Like Kimi — wasn’t there a well-known thing where Cursor’s Composer model was basically Kimi?
Ben Lorica: Could be, but they’re a startup, not an enterprise. I’m talking about much more conservative companies.
Hamza Tahir: Yeah. I think if you look at the token economics at play right now — if you’re the company charging per token and you can arbitrarily crank up reasoning tokens on your next big model, what would you do?
Ben Lorica: And part of the hesitation, beyond supply chain concerns, is predictability. One of the best open weights model families was Qwen, but based on what we’re seeing, it seems like Alibaba may be moving away from open weights. The predictability of your supplier matters. The commercial foundation models — Claude, Gemini, ChatGPT — refresh every four months or so. How do we know that if we go all-in on open weights models, they’ll still be around? Because the way most people design their agents is still somewhat vendor-specific in their prompting.
Hamza Tahir: But the thing is, Ben — the models are asymptoting in intelligence. That’s an open secret in the field. The gains are now at the harness and application layer. So even if open weights models stopped tomorrow — which I don’t think will happen — you could still take the oldest generation and get value out of it.
Ben Lorica: I agree with all of that, but try saying it publicly. “The legal agent you’re using is based on a two-year-old model.”
Hamza Tahir: Yeah, fair point. What we’re seeing with our customers is that they’re hedging — they don’t want to commit to one particular model or harness. And going back to our earlier conversation: do you want to build your own harness? Yes, partly because you don’t want to be tied to one model that might change overnight.
Ben Lorica: And that’s why tools like DSPy, for example, let you design prompts so you’re not too dependent on a specific model. Some agents are so tightly prompted to a specific model family that even if you offered them free credits to switch from GPT to Gemini, they’d hesitate — they’re afraid their agents will break. But there are now a lot of tools that let you hedge and minimize the risk of switching between model families and harnesses.
Hamza Tahir: Exactly. I love those types of tools, and I think enterprises would do well to develop in-house expertise. It’s a bit like using coding agents right now — you don’t need to reinvent yourself completely or throw away your old knowledge, but you have to keep up. Even if the whole harness paradigm shifts in three months, the value you get from learning is immense. And enterprises know that. They know they need in-house knowledge of how to build custom agents and custom harnesses, because that’s ultimately what will differentiate them from competitors.
Ben Lorica: And here’s another argument for strengthening your own knowledge and expertise: one of the main obstacles for agents in the enterprise is data integration. A lot of these agents, in order to do their job, need data from different systems, and all those integration points need to be managed reliably. Which actually points to another trend, Hamza — AI and agents aren’t adopted, they’re implemented. That’s why there’s so much discussion these days about forward-deployed engineers. It seems like this whole wave of AI and agents will actually be good for some of the consulting companies.
Hamza Tahir: Yeah, I completely agree.
Ben Lorica: You have to go in there and get your hands dirty.
Hamza Tahir: You have to integrate. At the end of the day, you have the basic tools, but the context is missing. And there’s no magic bullet.
Ben Lorica: There’s no magic bullet. There’s no one model, one harness —
Hamza Tahir: — that you just drop in and it miraculously works. You have to sharpen it like a tool. And we keep coming back to this — on your orchestration, on your runtime, even on your harness.
Ben Lorica: Make sure you’re model-neutral.
Hamza Tahir: Model-neutral, just like you want to be vendor-neutral on the infrastructure side.
Ben Lorica: Yeah. My new thing: agents and AI — they’re implemented, not adopted.
Hamza Tahir: Good one. That’s the tagline for this podcast.
Ben Lorica: And with that, thank you, Hamza
