Building the Operating System for AI Agents

Ben Lorica

9 months ago

Chi Wang on Multi-Agent Systems, Real-World Agentic Apps, Evaluation Methods, and AG2’s Roadmap.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Chi Wang is co-creator of AG2, and a Senior Staff Research Scientist, Google DeepMind. This episode explores AG2, an open-source “agent OS” that provides infrastructure for developers to build sophisticated multi-agent AI systems. Chi explains how AG2 differentiates itself from other frameworks by supporting diverse interaction patterns and the full development lifecycle, from prototyping to production deployment. Real-world applications in chip design, scientific research, VC analysis, and software engineering demonstrate how AG2 has dramatically reduced time and effort for complex knowledge work tasks.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
AI Agents: 10 Key Trends & Challenges You Need to Know
Ilan Kadar → Bridging the AI Agent Prototype-to-Production Chasm
What AI Teams Need to Know for 2025
AI Unlocked – Overcoming The Data Bottleneck
Vaibhav Gupta → Unleashing the Power of BAML in LLM Applications
AI Foundation Models : What’s Next for 2025 and Beyond

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Q: What is AG2 and what problem is it trying to solve?

A: AG2 is an open-source “agent OS” that provides common infrastructure for developers to build agents and agentic software. It helps developers quickly leverage generative AI technologies to create powerful agents, enabling multiple agents to work together on complex tasks with efficient deployment paths.

The fundamental problem we’re solving is how to provide a robust framework that allows teams to explore the large design space of agent-based systems while finding optimal design points. We’ve approached this by creating a unified interface for different agent types that simplifies orchestration, making it intuitive to define which agents should speak first, second, how they interact, and so on.

Q: What led you to focus on multi-agent systems rather than other AI challenges?

A: It started with the AutoML project I began about six years ago, which focused on tuning hyperparameters for machine learning models. When powerful generative AI models like GPT emerged, I began exploring whether AutoML techniques could optimize these newer models.

We discovered there’s an enormous difference in performance based on how you configure and use these models. For example, on certain coding benchmarks, we saw performance ranging from 6% to 90% using the same model but with different configurations, inference approaches, and external components.

I realized developers needed a framework to efficiently explore this large design space, and agents became the most intuitive concept for reasoning about this. We started with a simple two-agent conversation and quickly found we needed to support more complex multi-agent interactions, which led to building a comprehensive framework.

Q: What are the key elements that define an “agent OS” as opposed to just an agent framework?

A: We’re taking a layered approach to defining what constitutes an agent OS. The most essential concepts are agents and agent interaction/orchestration.

For agents, the fundamental capabilities are:

Using language models as backends
Leveraging non-language model tools
Incorporating human inputs

For agent interaction, we need to understand and support different interaction patterns – sequential chats, group chats, message-based communications, and specialized patterns like our recent “swarm” concept inspired by OpenAI’s swarm framework.

What makes our approach unique is the connection between single agents and multi-agent systems. You can start with primitive agents (using simple backends), have them interact together, and then construct stronger meta-agents containing these multiple agents. This creates a recursive pattern where you can build increasingly sophisticated agents without changing the fundamental concepts.

Unlike simpler frameworks, an agent OS needs to support the full development lifecycle – from prototyping to testing to production deployment, including observability, monitoring, and evaluation capabilities.

Q: How does AG2 differ from other agent frameworks like LangGraph or CrewAI?

A: While all these frameworks help developers build agent systems, they have different design principles and starting points:

AG2 was the first truly generic multi-agent framework supporting diverse applications
LangGraph extended LangChain’s sequential structure to a graph structure, making it suitable for developers who think in graph terminology
CrewAI also builds on LangChain components but focuses on tasks, agents, and task-agent assignment, primarily supporting sequential workflows

AG2 aims to support forward-looking agent capabilities beyond just sequential workflows. We support various conversation patterns including sequential chats, group chats, nested chats, and our new “swarm” feature which provides conditional handoffs and context variable sharing.

The key advantage of AG2 is that it can support all these different workflow types without changing fundamental concepts. Being the first framework in this space, we’ve also developed advanced features like teachability (allowing agents to learn over time), captain agents (that can manage other agents and create agent teams on the fly), and real-time agents for real-time communication.

Q: What’s the typical workflow for building multi-agent systems with AG2?

A: The workflow can be simplified into two main steps:

Determine what agents you need to solve your problem: how many agents, what roles each one plays, and how to define them
Describe how these agents should interact, using the conversation programming patterns we provide

Even people without coding backgrounds can conceptualize this approach because it mirrors how humans solve problems. The programming itself is relatively straightforward.

We’ve recently added new features to further simplify this process. Users can start with a “captain agent” that can take a task, automatically construct a team of agents, decompose the problem, and attempt to solve it without requiring manual specification of which agents to use or which tools they need. By observing what the captain agent does, users can then refine the approach by creating more specialized agents or adjusting interaction patterns.

Q: What are some real-world examples of AG2 being used successfully?

A: Several impressive examples stand out:

Nvidia’s chip design use case: They’ve reduced design tasks that previously took days or weeks down to just minutes. By cleverly leveraging human expertise within the agent framework, they achieved perfect accuracy from what was initially low accuracy, and are now using it in production for multiple task types.
Scientific applications: MIT developed agents that can use ontological knowledge graphs to navigate and reason like human scientists, generating high-quality scientific hypotheses and full research reports. Similar approaches are being used for protein design and material science.
VC research: Better Future Labs has been using AG2 for over a year to analyze business ideas, conduct market analysis, evaluate competitors, and generate investment memos with high-quality results.
Cosmology laboratory: Cambridge built a self-driving cosmology laboratory using AG2 to analyze enormous volumes of telescope data (larger than most internet traffic). Their agents leverage both language capabilities and existing specialized software, reducing what used to be a day’s work to tens of seconds.
Software engineering: A small team from Uruguay created a top-ranked solution for the SWE Bench Lite bug-fixing benchmark using multi-agent approaches. They can fix nearly 50% of bugs within minutes at a cost of about $1 per bug.

The common pattern across these examples is automating complex tasks that previously required significant manual effort by domain experts, whether they’re chip designers, scientists, analysts, or developers.

Q: What improvements in foundation models would most benefit multi-agent systems?

A: Currently we’re seeing benefits from reasoning-enhanced models and multi-modal capabilities, but several areas would particularly advance agent systems:

GUI interaction capabilities: Many knowledge workers still interact with graphical interfaces, so models that better understand and control GUIs would be tremendously valuable.
Higher-level planning capabilities: As we get more specialized models with different strengths and weaknesses, we need meta-level planning to intelligently organize and orchestrate these specialized agents. This relates to the “captain agent” concept we’ve implemented.
Creativity and curiosity: The most effective multi-agent systems simulate organizational structures with different roles. For example, Better Future Labs created a hierarchical team with directors giving instructions to managers who create questions for researchers. After researchers deliver initial results, managers critique and push for improvements. This approach produces much higher quality outputs.

Interestingly, agents can actually help improve foundation models through a virtuous cycle. For example, we built a reasoning agent feature in AG2 that allows non-reasoning models to perform multi-step reasoning. The data generated by these agentic reasoning processes can then be used to train better foundation models.

Q: How do you evaluate multi-agent systems?

A: Evaluation is challenging but crucial. We’ve developed two approaches:

Agent Eval: This uses an agent to analyze the problem-solving process, breaking down evaluation into multiple dimensions. For math problem-solving, we can evaluate not just the success rate but deeper insights like problem understanding, method selection, and calculation accuracy.
System behavior analysis: We’ve developed methods to identify which agent might be a bottleneck or responsible for major issues, pinpointing where things first go wrong rather than manually reviewing the entire process.

Multi-agent systems actually provide better evaluation opportunities than monolithic architectures because we can perform more specific unit testing on individual agents for particular tasks.

Q: How do you approach explainability in multi-agent systems?

A: Explainability requires two key elements:

Reasoning transparency: Agents need to show their reasoning steps, not just final answers. The recent reasoning-enhanced models like DeepSeek are helpful here as they explicitly output reasoning tokens.
Knowledge attribution: Agents must indicate when they’re using external tools or knowledge bases rather than their internal knowledge. For technical domains like chip design, it’s crucial that models properly show when they’re leveraging external tools or human-written instructions.

When multiple agents contribute to a solution, their individual reasoning steps can be combined and summarized to provide a comprehensive explanation of the decision-making process. If this narrative doesn’t make sense to domain experts, it signals potential issues with the solution.

Q: What’s on AG2’s roadmap for the next 6-12 months?

A: We’re focusing on three main areas:

Advanced capabilities: Our core strength is enabling advanced users to quickly try different ideas. We’re enhancing the captain agent direction to further reduce the manual effort required to develop agent systems, while adding support for multi-modal capabilities and other advanced features.
Ease of getting started: We need to make it easier for new users to navigate AG2’s many features and quickly find what they need.
Production deployment: We’re building tools to help users move beyond prototyping and connect their agent systems to different environments and infrastructure for production use.

Of these, advancing the core capabilities remains our primary focus to ensure AG2 stays ahead of the curve in supporting the most advanced AI technologies and applications.

Q: With increasingly capable agents, will there still be roles for human knowledge workers?

A: In the short term, there’s no risk of agents replacing knowledge work entirely – there are still many complex tasks where agent systems fall short. Even in the long term, when we reach a point where many tasks could be delegated to agents, the important question becomes: when do we want to delegate and when do we want to do tasks ourselves?

This question of agency and human direction will become increasingly important in the future. But for now, there are still many hard problems to solve before agents can faithfully follow our guidance and instructions for all types of knowledge work.

Chi Wang on Multi-Agent Systems, Real-World Agentic Apps, Evaluation Methods, and AG2’s Roadmap.

Transcript

Share this: