Bridging the AI Agent Prototype-to-Production Chasm

Ilan Kadar on Synthetic Data, Knowledge Graphs, Reinforcement Learning, and Workflow Integration.

Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

Ilan Kadar is co-founder and CEO of Plurai. Deploying AI agents from prototype to production presents significant hurdles, particularly in achieving reliable performance and building user trust. This episode introduces IntellAgent, an open-source platform designed to address these challenges through synthetic data generation, knowledge graph construction, and reinforcement learning for agent optimization. IntellAgent offers a comprehensive testing methodology and practical guidance for developers to streamline the deployment of robust and trustworthy AI agents, especially in high-stakes domains. Future developments promise even greater versatility and enhanced capabilities for this valuable tool.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:



Related content:


Support our work by subscribing to our newsletter📩


Transcript.

Below is a heavily edited excerpt, in Question & Answer format.

Q: What is IntellAgent and what problem does it solve?

A: IntellAgent is an open source framework for comprehensive diagnosis and optimization of AI agents using simulated, realistic, synthetic interactions. We developed it to address the primary challenge holding back AI agent adoption: performance quality and trust. According to Langchain’s public survey on the state of agents, 50% of companies are stuck in the prototype phase due to concerns about performance quality and trust. Many organizations are hesitant to deploy agents in production because they can’t confidently predict how these agents will perform, especially when given the ability to perform actions on behalf of users. IntellAgent provides a systematic approach to testing, diagnosing, and improving these agents before deployment.

Q: What types of AI agents is IntellAgent best suited for?

A: IntellAgent is particularly valuable for agents where the cost of errors is high. Customer service agents, travel booking systems, retail assistants, and agents in legal and finance domains are perfect candidates. Essentially, any agent that makes decisions or takes actions on a user’s behalf needs thorough testing. IntellAgent is especially useful for multi-turn conversational agents where the interactions are complex and the agent might need to consult knowledge sources or use tools to complete tasks. While it can also be used for simpler RAG-based systems, the real value comes when testing agents that perform complex, multi-step tasks where reliability is critical.

Q: How does IntellAgent compare to what teams are currently using to test their agents?

A: Currently, most teams use small, manually curated test sets that don’t adequately cover the full range of possible interactions. This approach creates a “you don’t know what you don’t know” problem – similar to what we faced in the autonomous driving space. Some companies simply release agents internally and rely on employees to catch issues, while others deploy externally and pull back when problems arise. There’s been no systematic way to simulate the diverse, complex interactions agents might encounter in the real world. IntellAgent addresses this by generating comprehensive test scenarios that provide much broader coverage than manually created test sets, enabling teams to identify and fix issues before deployment.

Q: Could you walk us through how IntellAgent works?

A: The typical workflow has three main inputs: policy documents that describe how the agent should behave, tools/functions the agent can use, and a database of relevant information (like user data, flight information, reservations, etc.).

First, IntellAgent automatically transforms these documents into a knowledge graph where each node represents a policy (like refund policies or booking procedures) and edges represent the likelihood of policies occurring together.

Second, we use this graph to generate realistic scenarios by traversing the graph and creating combinations of policies that might occur together. We then simulate dialogues based on these scenarios against your agent.

Finally, a critic model evaluates each step the agent takes, looking for policy violations and generating a detailed report showing success rates for each policy and how the agent performs at different complexity levels. This gives you a precise understanding of your agent’s strengths and weaknesses across the full spectrum of possible interactions.

Q: How is the knowledge graph constructed automatically?

A: We use a multi-step approach with LLMs to construct the knowledge graph. First, we extract potential conversation flows from the documents. Then we identify the specific policies within those flows. Finally, we create edges between policy nodes by asking the LLM to determine the likelihood of two policies occurring together in the same interaction.

The documents don’t need to be structured in any particular way – we can work with unstructured text. While it’s helpful if documents have chapters or sections, it’s not required. The system will extract entities and relationships automatically, similar to approaches like GraphRAG, but focused specifically on agent policies and behaviors.

Q: How do you ensure you’re capturing edge cases in your testing?

A: We strike a careful balance between generating realistic scenarios that reflect real-world distributions and creating challenging, diverse test cases. We’ve validated our approach by comparing results against manually curated datasets like the TAU benchmark, showing high correlation between our synthetic data and these human-created test sets.

The key advantage is that we can generate a much more comprehensive set of test cases covering the full spectrum of complexity levels and policy combinations. We can create scenarios that are increasingly challenging while remaining realistic, allowing us to test the boundaries of an agent’s capabilities in ways that manually created test sets cannot.

Q: What foundation models does IntellAgent support?

A: We’ve designed IntellAgent to be as flexible as possible. You can use any foundation model you prefer, including models from OpenAI, Google (Gemini), Anthropic, and models available through AWS Bedrock or other providers. We’re also expanding support for models optimized for reasoning.

One interesting insight from our testing is that the most advanced and expensive models aren’t always the best choice. For example, we found that Google’s Gemini Pro outperforms GPT-4 Omni on simpler tasks, but as tasks become more challenging, GPT-4 Omni actually performs better. This demonstrates the importance of testing different models specifically for your use case rather than assuming the newest or most expensive model is always best.

Q: What’s on the roadmap for IntellAgent?

A: We have three main priorities for the next 6-12 months:

  1. Expanding framework support to include AutoGen, Poe AI, and other agent frameworks beyond our current support for Langchain/LangGraph.
  2. Enhancing our graph generation capabilities to incorporate user interaction data. This will allow the system to learn from historical conversations and create more accurate likelihood estimates in the policy graph based on real-world distributions.
  3. Developing optimization capabilities. Currently, IntellAgent helps identify problems, but we’re working on using the synthetic data we generate to train small models that can help reduce failures. These models will act as an external layer between user input and your agent, essentially providing dynamic system prompt optimization.

Q: Is IntellAgent similar to penetration testing but for AI agents?

A: Exactly! It’s like penetration testing but focused on reliability and quality rather than security. We simulate increasingly challenging user interactions to identify where your agent might break or fail to follow policies. This “stress testing” approach helps you understand your agent’s limitations before deployment and make targeted improvements.

Q: What’s the current status of IntellAgent and Plurai?

A: IntellAgent is an open source project that’s about a month old, but it’s already gaining significant traction. The open source version provides testing and evaluation capabilities, with a basic UI for visualizing results.

On the commercial side, we’re working with several companies who approached us after the open source release. We’re developing premium features including a more sophisticated simulator that can be integrated into release pipelines and advanced optimization capabilities to help companies quickly improve their agents based on test results. Our mission is to help organizations move confidently from prototype to production with capable, reliable AI agents.