Why Traditional Observability Falls Short for AI Agents

Lior Gavish on Traces & spans, LLM-as-judge, and agent reliability in production.


Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

Lior Gavish, CTO and co-founder of Monte Carlo Data, joins Ben Lorica to discuss the critical transition from data observability to agent observability in production environments. They explore how data teams are evolving into AI teams, the necessity of granular telemetry for non-deterministic agent workflows, and the rise of “observability agents” designed to monitor complex systems. The conversation also covers the intersection of data quality and agent reliability, the challenges of governing unstructured data, and why cross-functional collaboration is the key to successful AI deployment. (Here’s a link to the article we refer to in the episode.)

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript



Related content:


Support our work by subscribing to our newsletter📩


Transcript

Below is a polished and edited transcript.

Ben Lorica: Today we have Lior Gavish, the CTO and co-founder of Monte Carlo Data. You can find them at montecarlodata.com. Their taglines are “Trust your agents in production” and “Close the loop between the data inputs and agent outputs with data and AI observability.” Welcome to the podcast, Lior.

Lior Gavish: Hi, Ben. Excited to be here.

Ben Lorica: A few years ago when we spoke, your rallying message was around data quality, data observability, and perhaps data governance. Is my recollection correct?

Lior Gavish: Yes, that’s right. We started the company about seven years ago with the mission to help data teams build trusted and reliable products. We primarily did that through data observability—capturing telemetry about pipelines and data products to understand issues proactively, rather than waiting for feedback from stakeholders.

Today, our focus has shifted toward trusting agents. This is because the data teams we serve have shifted their workloads. When we started, it was perhaps 80% analytics and 20% machine learning. Today, most data teams have transitioned into “Data and AI” teams. They are increasingly building and operating agents to automate business processes or deliver innovative user experiences. Our goal is to help these teams ensure what they build is reliable. We’ve extended our solution to cover data, AI, and specifically agents, ensuring they work reliably and are trusted by users.

Ben Lorica: To clarify, these teams used to be responsible for building and maintaining pipelines, mostly for analytics. Now, it’s much more AI-focused. You’re seeing more of these teams relying on agents. For listeners who are skeptical about how much automation is actually happening in data engineering, can you give us a high-level overview of the current reality for these teams?

Lior Gavish: The reality is shifting quickly. The entry-level data engineering job—where a fresh graduate is told to “build a pipeline” to get familiar with the platform—is being automated away.

Ben Lorica: I think engineering is leading the pack in terms of automation, perhaps alongside customer support.

Lior Gavish: Data engineering, in particular, lends itself to this.

Ben Lorica: Correct. There is a lot of code involved, and AI handles code extremely well. Grunt work can now be done quickly with code generation tools. But is this only happening in Bay Area tech startups, or are you seeing this in traditional enterprises too?

Lior Gavish: Our customers span every sector, including manufacturing, media, education, and logistics. Adoption of AI tools for workflows, like code generation, is common across every sector and geography. I hardly run into anyone who doesn’t use AI for coding.

Regarding building custom AI tools for their own business—like automating customer support or sales interactions—that is also spreading. A year ago, I would have said AI was primarily in the news and the Bay Area. Now, a huge portion of our customers are either in the early stages of production with agents or about to go live. Some have already scaled, deploying thousands of agents serving real customers. We are at an inflection point this year regarding the breadth of adoption across all sectors.

Ben Lorica: Five years ago, you built tools for data observability where humans built the pipelines. Now, pipelines are being built by agents, or agents are kicking off the jobs. Your product serves a similar purpose, but the “user” has changed.

Lior Gavish: Dramatically. AI allows teams to scale and do much more. This has introduced a new level of complexity that requires a new level of automation. We’ve introduced “observability agents” to automate the workflows of monitoring and troubleshooting these systems. People no longer have the time to manually figure out how to debug these complex systems, so we built agents to automate that work.

Ben Lorica: So, it’s basically agents monitoring agents.

Lior Gavish: Exactly. It’s a little bit meta.

Ben Lorica: Because of AI, entry-level work is being eliminated. But on the other end, is there a trend where non-coding analysts can now build things? Are there more people involved in building these agents, creating more things for you to monitor?

Lior Gavish: The biggest trend there is “conversational BI.” The idea is that we may not need to build custom pipelines and reports for every single need. Instead, we can expose a natural language interface that allows people to answer questions based on data. This eliminates much of the back-and-forth between data engineers and analysts.

Ben Lorica: This is why companies like Databricks and Snowflake are focusing on semantic layers and context stores. You can’t just plug in an AI without context. But once you do, business users start generating reports and assuming they are always right. That increases the surface area for potential errors.

Lior Gavish: Absolutely. It democratizes access to data, which creates more opportunities for things to go wrong. This increases the surface area you need to monitor. Additionally, those agents need to be monitored in and of themselves. It is very hard to understand how well agents are performing at scale. It has created a “new beast”—a workload called agents that must be observed to deliver trusted services.

Ben Lorica: Let’s pivot to agent observability. We’ve established that agents are real and widespread. What are the three to five things that are different about observability in the age of agents?

Lior Gavish: First, agents are complex and require very granular telemetry, specifically traces and spans. You need to understand how an agent arrived at its conclusion. A final answer might be the result of dozens or hundreds of small steps—tool calls to fetch data, LLM calls to reason, or decision-making steps. Collecting all that telemetry is point number one.

Ben Lorica: When a human does the work, you can just talk to them. With an agent, you need those granular traces to understand the reasoning.

Lior Gavish: Right. Agents also touch many different data sources. Understanding how those combined to produce an output is critical. Furthermore, unstructured data presents new challenges. These systems generate free-form text or images, often for open-ended use cases. The breadth of interactions is huge and hard to anticipate.

This leads to the second point: how to interpret all this telemetry. We use techniques like “LLM as a judge,” where one LLM grades the performance of another. Failure can happen for many reasons: new user questions the agent wasn’t built for, bad underlying data, or models changing unexpectedly.

Ben Lorica: When you go into a company that has already built agents, what is the typical state of their telemetry?

Lior Gavish: There are many tools available to help capture telemetry, so about half the time, people are collecting it. However, extracting insight from that telemetry is the real hurdle.

Ben Lorica: Some might argue we don’t need new tools because we already had telemetry before agents. Why do we need specialized tools?

Lior Gavish: Two reasons. First, agent telemetry introduces new security and compliance challenges. Conversations often contain sensitive PII or IP. Unlike traditional application telemetry, this data is extremely sensitive. Our architecture ensures that data stays within the customer’s control and environment.

Second, traditional tools struggle to measure the quality of agent outputs. How do you determine if a customer support interaction or a legal document analysis worked as intended? Traditional observability tools “shrug” at these questions. You need a new set of techniques to track execution and produce measurable quality scores.

Ben Lorica: In data engineering, when a pipeline breaks, you have an incident response playbook. For agents, you need granular traces around reasoning and tool use. I assume old tools can’t handle that data.

Lior Gavish: Exactly. Our agent observability solution is tailored to this. It understands whether an agent called tools in the right order, how relevant the context was to the user’s question, and whether the agent adhered to its instructions. We provide a “single pane of glass” for both the underlying data and the agent’s execution. Having that end-to-end visibility—knowing if an agent failed because of bad data or a bad decision—is the magical piece.

Ben Lorica: To what extent do you have to adjust your product based on new model releases, like the shift toward reasoning models?

Lior Gavish: Our implementation doesn’t depend heavily on the nature of the model, but our customers use observability to create resilience against model changes. Model providers often change models under the hood, which can cause performance degradation. Our tools allow customers to spot these regressions proactively, whether the change was intentional or an unannounced update from the provider.

Ben Lorica: Have you thought about the broader world of “agent optimization”—making workflows more dependable and smarter? Observability seems like step one of optimization.

Lior Gavish: It is a vital tool for optimization. Teams usually start by manually eyeballing results. Then they move to a beta stage where they collect human feedback or “thumbs up/down” signals. But when you move to the third stage—General Availability at scale—observability becomes indispensable. You need a systematic way to evaluate telemetry and product analytics together to understand where to focus your improvements.

Ben Lorica: There are emerging tools like DSPy or TextGrad that optimize prompts and workflows. Since you capture the traces, you seem well-positioned to eventually offer an end-to-end optimization engine.

Lior Gavish: It’s an exciting path. We’ve taken a first step with our “Troubleshooting Agent.” If there is a performance degradation, it analyzes the telemetry to create hypotheses. It asks: Is this a new set of questions? Is the model degrading? Was it a recent prompt change? Taking that a step further into a principled optimization solution is a great idea.

Ben Lorica: What about synthetic data and testing before production? Are you involved in the “pre-production” phase?

Lior Gavish: We don’t currently offer tooling for synthetic data generation. We help in CI/CD or dev environments by capturing telemetry there, but our focus is production-oriented. “Evaluation” is a broad term—it covers everything from model developers testing a new LLM to real-time guardrails that block toxic comments. Our angle is specifically evaluating production telemetry.

Ben Lorica: Let’s touch on governance and compliance. With agents everywhere, how do you handle the risk of PII or IP being leaked in the inputs?

Lior Gavish: AI governance is an entire industry. My take is that there won’t be one tool to rule them all because it spans everything from security to discovery. Observability plays a role by allowing you to audit what agents are doing and report on it consistently. While regulations are still taking shape, data quality is already a part of governance, and agent quality will be too.

Ben Lorica: Most companies use multiple models. If you use an “LLM as a judge,” you have to be careful about what you send to that judge model as well.

Lior Gavish: Exactly. We chose an architecture where both the telemetry and the “judging” stay in the customer’s environment (e.g., their Snowflake or GCP instance). If you’re analyzing sensitive agent data, you can’t just send it off to a third-party judge.

Ben Lorica: It feels like we might end up with vendor sprawl: an agent observability vendor, a governance vendor, an optimization vendor.

Lior Gavish: I suspect so. These are complex systems. Doing security is very different from doing reliability or compliance. Specialist vendors exist because most companies use multiple systems—Snowflake, Databricks, various cloud providers. A specialist can cut across all of them. While customers might wish for a single vendor, these are “meaty” problems involving different stakeholders like legal, security, and engineering.

Ben Lorica: In closing, are you seeing companies adopt a centralized AI team that services the business, or is it decentralized where marketing and sales build their own agents?

Lior Gavish: It’s changing. A year or two ago, every team was doing its own thing. Now, there is some centralization, but AI is a uniquely interdisciplinary problem. It’s not just a data or software problem; it requires product design, legal, and subject matter experts. You can’t build a great customer support agent without working closely with the actual support staff.

The most successful projects are cross-functional collaborations. My advice is not to silo AI in one team. Everyone is still learning; there are no “10-year veterans” of building agents. Success comes from pulling resources together and making it a cross-functional effort.

Ben Lorica: So, communicate internally, share what you’ve built, and don’t silo the knowledge.

Lior Gavish: Exactly.

Ben Lorica: Thank you, Lior.