Arun Kumar on Multi-Agent Systems, Ensemble Learning, Agent Optimization, and the Future of AI Engineering.
Subscribe: Apple • Spotify • Overcast • Pocket Casts • YouTube • AntennaPod • Podcast Addict • Amazon • RSS.
In this episode, host Ben Lorica sits down with Arun Kumar (Associate Professor at UC San Diego and Co-founder/CTO of RapidFire AI) to explore the rapidly evolving discipline of agent engineering. They discuss how multi-agent workflows are essentially generalizations of classical machine learning ensembles, and why developers need to move away from “YOLO agent engineering” in favor of systematic evaluation. The conversation also dives into the hidden complexities of agent optimization, the pitfalls of anthropomorphizing AI roles, and the future of automated workflow construction.
Interview highlights – key sections from the video version:
-
-
- Why multi-agent workflows can be viewed as a generalization of ensembles
- Where the analogy breaks: memory, dynamic topologies, and human-in-the-loop
- Tools, skills, and feature engineering analogies in agent systems
- What teams can learn from ensemble methods: evals, ablations, and avoiding over-complexity
- The vocabulary gap between classical ML and agent engineering
- Practical lessons for builders: decomposition, aggregation, and simplifying inherited workflows
- AutoML for agents, workflow search, and the limits of role-based agent design
- Explainability tradeoffs and what the MAST taxonomy reveals about failure modes
- Agent optimization as an emerging discipline with cost, latency, and quality tradeoffs
- Deep research agents: key knobs, autonomy, and human-in-the-loop design choices
- Why RAG and agentic systems are hard to tune: prompts, chunking, retrieval, and model choices
- From experimentation to production: continuous optimization and AI-assisted tuning
- What agent engineering means, how it relates to Ray Tune, and when teams should stop tuning
-
Related content:
- A video version of this conversation is available on our YouTube channel.
- Agent Optimization: From Prompt Whispering to Platform Engineering
- Why Your AI Agents Need Operational Memory, Not Just Conversational Memory
- Are your AI agents confusing activity with achievement?
- Lior Gavish → Why Traditional Observability Falls Short for AI Agents
- Samuel Colvin (Pydantic), Aparna Dhinakaran (Arize AI), Adam Jones (Anthropic), and Jerry Liu (LlamaIndex) → The Truth About Agents in Production
- Jason Martin (of HiddenLayer) → Securing the “YOLO” Era of AI Agents
Support our work by subscribing to our newsletter📩
Transcript
Below is a polished and edited transcript.
Ben Lorica: Today we have Arun Kumar, Associate Professor of Computer Science at UC San Diego, and co-founder and CTO of RapidFire AI, which you can find at rapidfire.ai. Their tagline is, “Stop guessing, start engineering AI agent outcomes.” With that, Arun, welcome to the podcast.
Arun Kumar: Thank you for having me, Ben. It’s great to be here.
Ben Lorica: Let’s start with a recent post you wrote arguing that multi-agent workflows are really ensembles in disguise. First, could you define what you mean by “ensemble” for our listeners, and explain how that lens provides a better framing for multi-agent systems?
Arun Kumar: Certainly. For our listeners, ensembles are essentially collections of machine learning models deployed to improve the robustness and reliability of prediction outputs. This is a statistical technique introduced to the classical ML world decades ago. Researchers realized that when you decompose a model’s prediction error, you find components called bias and variance—the classic bias-variance tradeoff. Ensembles help reduce both bias and variance, thereby reducing overall prediction error. While they introduce certain complexities, researchers have spent a lot of time understanding how inference workflows can be architecturally modified to address prediction issues. This led to techniques like boosting, bagging, stacking, and majority voting.
Ben Lorica: Generally speaking, an ensemble means you have multiple models and you’re trying to improve reliability, right? The most basic approach would be to simply take the average of those models.
Arun Kumar: Yes, that’s the most basic form. You have multiple models make predictions, and then you take a majority vote for classifiers, for example. The models within the ensemble could have different architectures, like decision trees, gradient-boosted decision trees (GBDTs), or linear models. They could also be trained on different subsets of the data. There are many ways to infuse diversity into the ensemble’s components, which provides different forms of reliability.
This science of ensemble learning has been developed over many decades and has seen robust success. Even today, in the age of LLMs, ensembles like GBDTs and random forests remain the most popular and successful models for classification and prediction tasks on structured data.
It’s no coincidence that we’re seeing similar trends in LLM-based generative AI. Generation is a form of prediction—it’s generative rather than discriminative. We need robustness and reliability for LLM-based AI applications, and I believe multi-agent workflows are a generalization of ensembling. That is the crux of my blog post. I didn’t say it’s a repackaging or just ensembles in disguise; rather, it is a generalization. It expands the frontiers of what we consider ensembles and introduces new ideas to the space.
Ben Lorica: Based on what you just said, the general idea is that while an ensemble uses many models, a multi-agent system uses many agents. How you coordinate, orchestrate, and combine those agents has a direct analog in ensembles. However, according to your blog post, the overlap isn’t 100%, right?
Arun Kumar: That is correct. There are several axes where the concepts map one-to-one. For example, voting and stacking map directly, and you can build cascades in both. Bagging and boosting are generalizations of those concepts. In multi-agent systems, you aren’t typically subsampling inference data to generate variations. Instead, agents are given different system prompts corresponding to different features, making it closer to random forests, where you use subsets of features.
However, the correspondence isn’t perfect because agent engineering introduces new concepts with no analog in ensemble theory. Memory is the first major difference. Agents can coordinate using shared mutable state and maintain persistent memory across independent sessions. Dynamic topologies and human-in-the-loop evaluations are also new. Classical ensembles were generally fixed pieces of code; they didn’t generate new models on the fly. In contrast, even a single-agent workflow can automatically generate new computations based on the LLM’s judgment, without upfront prescription. So, while there are exciting new frontiers, many common patterns—like decomposition and reflection—are grounded in classical ensemble principles.
Ben Lorica: What about tools or skills? Are there analogs for those in ensembles?
Arun Kumar: That’s a great question. In the blog post, I noted that the use of tools is one of the generalization axes. I view tool use as a way to bring in new, relevant features. In the ensemble world, this is akin to “late fusion,” where some ensembles incorporate new features during or toward the end of the process, rather than training across all models from the start. Tool use essentially retrieves additional relevant data—perhaps through a search step, a calculator, or running generated code—and provides that extra context to the model. That additional ingested string acts as extra features infused into the inference process. Generally, the more relevant features you bring in, the better.
Ben Lorica: That’s a very generous mapping. You can imagine calling a tool that is actually an advanced, sophisticated financial calculator running complex Monte Carlo simulations.
Arun Kumar: That’s true, but the same applied to the classical ML world, where feature engineering was a massive field. Engineers would write highly sophisticated code to generate features, store them in feature stores, and ingest them into models. They often had to monitor those features for drift. The analog exists here. The tools aren’t necessarily under the LLM’s direct control; you might run an inference step twice and get entirely different features back.
Ben Lorica: Knowing this analogy exists—even if it isn’t a perfect 100% match—how does this help a team currently building agents?
Arun Kumar: I highlighted one practical benefit in the blog post: people are currently obsessed with multi-agent workflows, but recent research shows they aren’t always better. Sometimes, a single agent outperforms a multi-agent system. This happens for the same reason ensembles can overfit. Adding more models—or agents—isn’t inherently better for accuracy, reliability, maintainability, or software engineering.
When building multi-agent workflows, you must incorporate systematic evaluations for both ablation and the addition of new components. Many developers practice what I call “YOLO agent engineering”—they build a workflow, run it, do a quick “vibe check,” and deploy it if it seems to work. That carries significant risks. Ablation studies allow you to systematically test changes. If you want to add an agent component, you should run an A/B test on your evaluation set: How much did it improve? Where did it improve? Where is it now failing? This systematic evaluation process has long been standard in machine learning, and it needs to be applied to agent engineering.
Ben Lorica: You also pointed out in your post that these two fields don’t share the same vocabulary, which might prevent them from communicating or benefiting from one another.
Arun Kumar: That is true. It’s a problem in both academia and industry. Many people building multi-agent workflows come from a software engineering background, not statistical or classical ML. Understanding basic ML concepts can be incredibly helpful for software engineers entering this space. Conversely, many AI researchers still focused on classical ML or statistics are far removed from modern agent engineering. Some dislike the anthropomorphic framing—like saying “agents are thinking”—and dismiss the field entirely. Bridging this gap and sharing methodologies could have a massive impact on what is currently an exciting new frontier of computing.
Ben Lorica: In that spirit, removing ML jargon like “ablation,” could you share a few lessons from the ensemble world using language that everyday agent builders can understand?
Arun Kumar: Absolutely. Let’s look at an example. Suppose you’ve built an agentic workflow for a customer support chatbot, and the task has become highly complex. The agent is making numerous tool calls and performing both critique and reflection. Instead of forcing a single model to do everything and letting multiple invocations clutter the context window, you should decompose the task and construct multiple specialized agents. This is the exact analog of the machine learning principle where throwing too many features at a single model causes it to overfit. By decomposing the workflow, you create multiple components that each bring their own expertise. That is the core structure of decomposition and aggregation.
Ben Lorica: Let me stop you there. In the machine learning world, there must be rules of thumb or best practices for when to split these tasks. What signs do people look for?
Arun Kumar: There is no one-size-fits-all answer; it entirely depends on the modality and variety of your data. For multimodal data, it’s natural to build ensembles that span different modalities and then use fusion models to combine them. If the data is homogeneous, like plain structured data, ensembles typically rely on subsets of features, subsets of data, or different model architectures.
In a multi-agent workflow operating on homogeneous text data—like document search—you might compare a cascade of cheaper agents against more expensive ones, or test different prompt structures tuned for varying precision levels. A persistent problem with agentic workflows is the assumption that one size fits all. At query time, you encounter immense diversity. In a RAG chatbot, some questions require a highly specific, “rifle-shot” extraction of a single fact from one page. Other queries require synthesizing facts across hundreds of pages. A rigidly designed RAG pipeline may not be optimal for both query types. If you hardcode your chunking and embedding strategies upfront, you’ll likely fail on certain queries. This is where ensemble ideas help: you can create multiple retrieval methods and incorporate them into a reranking step. That’s the forward approach—moving from a single-agent workflow to a decomposed, multi-agent one.
The reverse is also true. If you inherit a complex multi-agent workflow and need to fix it, you can use ablation. This simply means leaving one component out to see if the workflow can still answer accurately without it. You observe how the errors vary. Often, you’ll track multiple metrics and find a Pareto frontier. If there is a dominant point on that frontier, you’ve found a minimal, optimal architecture. Because models are constantly gaining stronger reasoning abilities, a workflow doesn’t necessarily need to remain highly complex forever.
Ben Lorica: A couple of points on that. In the classical machine learning world, people often used AutoML tools, meaning much of what you just described might have already been abstracted away for them. Secondly, in the multi-agent world, people often start from an anthropomorphic perspective. They think, “If a human team does this, we need a product manager, a designer, etc., so each role deserves its own agent.” What is your reaction to both of those points—AutoML and the mapping of human roles to agents?
Arun Kumar: Both are great questions. Regarding AutoML, a similar concept is growing in the agent engineering world: automatic workflow construction. Google recently published the MASS (Multi-Agent System Search) paper, which explores this. Just as classical ML had automated workflow construction and deep neural architecture search, we now have analogs for agentic workflow structures. Researchers are co-optimizing prompt structures and workflow architectures using meta-heuristics that build the system for you.
Ben Lorica: What is the current state of that technology? Is it showing up in the tools people are actually using?
Arun Kumar: Based on conversations with our design partners at RapidFire AI, it’s still in the very early stages, though big tech companies are actively exploring it. A perennial issue with AutoML tools has been controlling costs—determining how long and how broadly the system should search. Everything introduces hyperparameters on top of hyperparameters. An AutoML heuristic comes with its own set of parameters, and you have to understand how they impact your specific use case. The same applies here. It’s highly unintuitive for developers to figure out how to set these meta-hyperparameters for agentic workflows. I expect we’ll see more adoption in the next year or two, as big tech innovations usually trickle down to the rest of the industry, but there is still a significant gap in both cost and usability.
Ben Lorica: And regarding the second point—anthropomorphizing the agent architecture. I assume people do this because it helps them understand how to break down a task, and it makes the final system easier to explain to others.
Arun Kumar: That tendency has existed in deep learning for a while. There was often no strict mathematical justification for why a neural architecture was designed a certain way—like AlexNet, ResNet, or DenseNet. Creators simply relied on heuristic intuition. We are seeing the same thing here. Designing an agentic workflow based on how humans collaborate might feel intuitive, but it isn’t necessarily the optimal way for LLMs to process a task.
Ben Lorica: Right. The thinking is, “For this project, I need a project manager, a designer, and so on, so each of those becomes its own agent.”
Arun Kumar: Exactly. That’s the natural inclination when coming from a software engineering background. But you have to ask, “Is every agent earning its keep? Am I adding too much complexity?” Adding unnecessary agents leads to the analog of overfitting, or it creates a mess of inter-agent coordination. The complexity grows super-linearly. If agents talk to each other peer-to-peer, the coordination cost is quadratic. If you use a central orchestrator, it’s linear. Every agent you add introduces inference costs, latency, and the potential for confusion. This is why recent reports consistently show that simpler agentic workflows often outperform systems where developers just kept adding agents to emulate human teams.
Ben Lorica: Since explainability is a major reason people design systems this way, do you think that will remain a driving factor in how multi-agent systems are built?
Arun Kumar: It depends heavily on the use case and the company. Explainability has been a core concept in ML for a long time, taking different forms across different generations of the technology. We’ll see the same spectrum here. If an automated heuristic constructs a highly efficient workflow, but the agents are handling abstract subsets of intelligence that can’t easily be named, that system won’t be explainable. Conversely, if you explicitly prompt agents with roles—”You are doing X, you are doing Y”—it becomes fully explainable, but it might not be the most accurate or reliable approach. Applications that prioritize strict evaluation metrics and accuracy might embrace black-box architectures, while applications requiring auditability and explainability will tightly control the workflow structure.
Ben Lorica: You’ve probably heard of MAST, the project from the researchers at Berkeley?
Arun Kumar: Yes, MAST.
Ben Lorica: Could you describe it for our listeners and share your take on its practical implications?
Arun Kumar: Certainly. MAST stands for Multi-Agent Systems Traces. The researchers analyzed multi-agent systems across various domains and created a comprehensive benchmark dataset based on those execution traces.
Ben Lorica: And they developed a taxonomy from that, right?
Arun Kumar: Yes, they built a taxonomy. They looked at the errors and categorized them into three main groups: system design, task verification, and inter-agent misalignment. I believe they analyzed around 200 traces, which is a significant effort. The taxonomy is highly useful because it helps identify why systems fail and suggests recourses to fix them. It’s a post-hoc analysis—they looked at existing workflows to see why they broke down—but it’s incredibly complementary to the ab initio design of new workflows. By applying the knowledge from the MAST paper, you can design workflows that are inherently robust against these common failure modes.
Ben Lorica: Next, I want to discuss two topics: agent optimization and agent engineering. Let’s start with agent optimization. This essentially means treating the agent as a pipeline and optimizing it end-to-end. Tools like TextGrad from Stanford or DSPy (which includes features like MIPRO or OpenEvolve) are emerging in this space. Am I imagining things, or is there a growing discipline around optimizing agent prompts and pipelines? Do you think this is a lasting trend?
Arun Kumar: It is definitely a real trend. Much of the work we do at RapidFire AI falls into the agent optimization space, specifically regarding experimentation—figuring out how to tune various knobs without wasting credits or GPU resources. Optimization is critical because, in the real world, you are managing a Pareto frontier of evaluation metrics, latency, and total cost of ownership. A configuration that yields the best evaluation metrics might be completely unviable in terms of cost or latency.
Ben Lorica: To anchor this discussion for our listeners, let’s look at a familiar example: Deep Research agents. Anyone can build one, but how do you actually optimize it, and what are the specific knobs you can tune?
Arun Kumar: For personal consumer use, people typically run these on foundational models from vendors like Anthropic or OpenAI.
Ben Lorica: But let’s say you aren’t using a consumer tool. You’re building an internal Deep Research agent using your company’s proprietary collection of PDFs.
Arun Kumar: Oh, I see. Are there companies doing that entirely in-house?
Ben Lorica: Glean does enterprise search.
Arun Kumar: Right, though I believe Glean sits on top of foundational model vendors.
Ben Lorica: Yes, exactly.
Arun Kumar: It’s very difficult to compete directly with OpenAI or Anthropic on the foundational models themselves, though some companies are trying—DeepSeek, for example, recently released their R1 model.
Ben Lorica: Building deep research on your own data means first ensuring you have all the necessary data integrations. Then, there is obviously a RAG-like component, which involves tuning all those specific retrieval knobs. You also have the prompt itself, which can be tuned using tools like DSPy or MIPRO. Beyond that, what other elements are involved in optimization?
Arun Kumar: The workflow structure itself is a major knob. For instance, is the research agent following a strict workflow, or is it entirely autonomous? That is a fundamental design question.
Ben Lorica: By the way, in classic online deep research, it isn’t like standard RAG that just retrieves information once. The idea is that it retrieves data, evaluates it, decides it isn’t quite right, and then goes back to search again.
Arun Kumar: Exactly, that is the autonomous part. Based on intermediate results…
Ben Lorica: Right, that’s the agentic behavior.
Arun Kumar: Right. You essentially have autonomous agents and workflow agents. With workflow agents, you specify the exact process, and the agent executes it. You might build in bounded autonomy—like allowing up to three retries—but it’s constrained. Deep research agents, however, are typically highly autonomous. They decide what additional searches or computations to run based entirely on the intermediate data they produce.
Ben Lorica: And sometimes they even come back to the user and ask clarifying questions.
Arun Kumar: Yes, that introduces the human-in-the-loop component.
Ben Lorica: They might ask, “Is this what you meant?” or “How should I structure the output?” So you can imagine how complex optimizing an agent like this becomes.
Arun Kumar: It is perhaps the most complex example. This human-in-the-loop gating is relatively new. Early deep research agents were fully autonomous, but now you can enable or disable user gating to prevent the agent from wasting tokens by going down meaningless rabbit holes. However, if a user lets the agent run overnight without being present to answer questions, the process gets stuck. So, you still need a robust, fully autonomous mode.
When building this, another challenge is that the output of deep research agents is typically large and unstructured—like generating a full presentation rather than a simple factual answer. How do you even evaluate that? Typically, evaluation is subjective, requiring a human to eyeball the results. The key optimization question is whether you can codify those evaluations into metrics, either using LLM judges or code. If you can establish a suite of input-output pairs and reliably measure how good the final output is, then every other knob we’ve discussed becomes optimizable.
Ben Lorica: Setting that aside, the first step is obviously moving away from manual prompt engineering. Using tools like MIPRO and DSPy allows for a much more principled approach to building prompts, correct?
Arun Kumar: That’s true. In many deployments, users have highly specific system instructions for their use case. Bizarrely, the way you phrase instructions and the order in which you present them can significantly affect accuracy. People are constantly discovering new model behaviors based on prompting. Exploring variations systematically is where these methods are incredibly useful. You likely want to use an LLM to generate your first-cut prompt, and then manually inspect and edit it.
Ben Lorica: By the way, a well-known hack that many people still don’t use is that you can actually threaten the model.
Arun Kumar: I’ve heard about that recently! I know several colleagues who actually yell at the model in their prompts.
Ben Lorica: They do, and apparently, it works. Before we move on to agent engineering, let’s stick with optimization. You obviously have a strong background in data management.
Arun Kumar: Yes.
Ben Lorica: In the data management world, you have automated optimizers. So why do developers have to worry about all these manual adjustments now? Even setting agents aside, why are there so many knobs to turn just for standard RAG?
Arun Kumar: That’s exactly why I started RapidFire AI. There are simply too many complex knobs. People initially claimed RAG would be much easier than fine-tuning, but it turns out agentic RAG workflows have far more variables. These aren’t just model or algorithm knobs; they are data knobs. You have to chunk, embed, retrieve, and rerank. Chunking itself is an art form. For applications dealing with multimodal documents, commodity chunking methods won’t cut it; you need proprietary approaches.
Ben Lorica: Even preceding chunking, there’s information extraction. For example, which PDF extraction library are you using?
Arun Kumar: Even that matters, yes. Many extraction libraries internally use LLMs or ML algorithms, meaning they aren’t strictly deterministic. If you look at this through a classical ML lens, there are three components: data representation engineering, hypothesis space engineering, and hyperparameters.
Data representation engineering covers chunking, embedding, indexing, and even prompt tuning. Hypothesis space engineering involves choosing your base model—are you using Opus, Sonnet, or Haiku? How much are you willing to spend, and what latency can you tolerate? Finally, you have hyperparameters, like temperature. Another crucial hyperparameter is context window size: if the model sees too much information, you risk the “lost in the middle” problem; if it sees too little, it loses vital context.
All these combinations must be constructed and optimized. One of the main reasons agent deployments fail in production is that teams haven’t systematically evaluated what works across these variables. They rely on vibe checks instead of rigorous testing. We built RapidFire AI to change exactly that.
Ben Lorica: One of the main challenges is that testing every single knob is computationally intensive, right?
Arun Kumar: That’s correct.
Ben Lorica: You want developers to try many different embedding models or chunking strategies, but running all those variations is expensive.
Arun Kumar: Exactly. This is why distributed computing frameworks like Ray have become the foundational layer for many AI teams—you absolutely need to scale out.
Ben Lorica: And the great thing about Ray is that you can mix GPUs and CPUs, right?
Arun Kumar: That’s right, which is especially important for data preprocessing. Text preprocessing typically doesn’t run on GPUs. That’s exactly why we chose to build RapidFire AI on top of Ray. It’s a distributed Python orchestrator that offers a lot of generality.
The logic we built on top is analogous to multi-query optimization in the database systems world. When executing multiple queries, a database looks across computational units to reduce redundant data movement and share computation. Bringing that lens to agent engineering, our system automatically shares computation when you test multiple forms of chunking, embedding, retrieval, and reranking. You don’t have to manually create separate indexes; the system optimizes it for you automatically.
In terms of exploring different values for these knobs, you can follow best practices, but we are also building ways to guide the system using natural language prompts. It turns out not all knobs are created equal. For some applications, the chunking method isn’t critical, while for others, it’s make-or-break. Depending on your latency and cost constraints, you want to explore the space systematically rather than relying on YOLO engineering. Our automation logic constrains that search space in an application- and constraint-aware manner.
Ben Lorica: Otherwise, you might just throw in the towel.
Arun Kumar: Exactly. Because we control the data flow and system allocation, we can launch far more variations even on finite resources. For example, one of our design partners used a 4-GPU machine to run 2,000 configurations of supervised fine-tuning (SFT), whereas previously they could only manage a dozen because they had to wait for one run to finish before starting another. With hyper-parallel execution, you get a dashboard showing all configurations. If you realize certain runs aren’t performing well, you can use early stopping to halt them, which is a common feature. But we also support resuming runs later, and a “clone and modify” feature that allows you to inject new configurations dynamically on the fly.
Ben Lorica: Is there a distinction between using these tools to get to production—doing all the initial testing and tuning—versus using them once you’re in production? Optimization should be an ongoing process, so the same set of tools should help maintain and optimize the setup continuously.
Arun Kumar: That is true, and it’s something we are building at RapidFire AI. We are developing a promptable, natural language chatbot that works with you to optimize your system continuously. It understands the data and metrics you provide. LLMs are being infused into every part of knowledge workflows, and we can leverage these powerful reasoning models to reduce the cognitive burden on engineers.
Engineers constantly ask: “How do I ensure my application is Pareto optimal? How do I ensure it remains reliable and deployable?” We can infuse models throughout these processes to aid professionals. Today’s forward-deployed engineers are the new data scientists. They are taking pre-trained foundational models and bridging the gap to real-world applications. The principles and techniques must evolve alongside them.
Ben Lorica: I see. So, how would you define “agent engineering”?
Arun Kumar: I actually posted a snarky joke about this on LinkedIn that resonated with a lot of people: Agent engineering is the new context engineering, which was the new model engineering, which was the new feature engineering, which was the new knowledge engineering.
Jokes aside, the capabilities of AI applications have taken a quantum leap across each of these generations. People always want more—it’s a hedonic treadmill. What we accomplished with regular deep learning became a commodity, so we jumped to context engineering. When that became a commodity, we jumped to agent engineering. The complexity of the use cases just keeps increasing.
Ben Lorica: Is there an overlap with tools like Ray Tune that allow for hyperparameter tuning?
Arun Kumar: Yes, I would say our approach is a generalization of the Ray Tune concept. Let me explain the difference in the parallelization mechanism. Ray Tune does task parallelization—if you have four workers, you can do four things at once. At RapidFire AI, we use a mechanism called hyper-parallel execution. I can launch a thousand combinations simultaneously, even on those same four workers. Not all combinations will run to completion. We automatically partition the data and provide quick estimates in an online, aggregated manner, so you don’t have to wait until the very end to see how well a configuration performs.
Ben Lorica: I just realized we might even be oversimplifying things. When building an agent, you have tools, skills, and context files like agents.md. You have to optimize all of those as well. Interestingly, a new paper just came out suggesting that using agents.md files actually makes performance worse.
Arun Kumar: I saw that. Yes, it turns out you’re often better off without those instructions.
Arun Kumar: And that introduces yet another knob to tune, right?
Ben Lorica: Exactly. Then, to your earlier point, you might realize, “This skill is too complicated; it should actually be broken down into three separate skills.”
Arun Kumar: Yes, that is exactly right.
Ben Lorica: It becomes an unending process. AI teams obviously need to tune their applications to make them reliable, but at some point, you have to decide to stop running experiments, right?
Arun Kumar: That’s right, because you don’t have infinite money for tuning. At some point, the return on investment is satisfactory, and you just have to ship it. Once it’s in production, you can collect more real-world input-output pairs.
Ben Lorica: And then that real-world data informs your ongoing optimization.
Arun Kumar: Exactly.
Ben Lorica: On that somewhat daunting note, thank you, Arun.
Arun Kumar: Thank you for having me, Ben, and thank you to your audience as well.
