Beyond the Demo: Building AI Systems That Actually Work

Hamel Husain on Data Analysis, Error Classification, and Building Reliable AI Systems.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Hamel Husain is the founder of Parlance Labs. He discusses how successful AI implementation requires fundamental data science skills often overlooked in current educational resources that focus too heavily on tools and frameworks. Hamel emphasizes the importance of systematically analyzing data, involving domain experts in the process, and prioritizing evaluation based on actual observed failure modes rather than generic metrics. He advocates for establishing robust processes rather than just relying on tools, and shares practical techniques like synthetic data generation to build confidence in AI systems before deployment. [This episode originally aired on Generative AI in the Real World, a podcast series I’m hosting for O’Reilly.]

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
What AI Teams Need to Know for 2025
AI Unlocked – Overcoming The Data Bottleneck
The Hidden Foundation of AI Success: Why Infrastructure Strategy Matters
AI Governance at the Crossroads: Navigating the Inference Revolution
David Hughes → Prompts as Functions: The BAML Revolution in AI Engineering
Vaibhav Gupta → Unleashing the Power of BAML in LLM Applications
Andrew Burt → Why Legal Hurdles Are the Biggest Barrier to AI Adoption

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Bridging the Gap in AI Education

What motivated you to create educational content on the practical usage of foundation models and LLMs?

We identified a significant gap in existing educational resources. While many tutorials and guides focus heavily on tools, frameworks, and API integrations (like RAG systems, function calling, or vector databases), they often miss the foundational data science skills that remain crucial even when you’re not training models yourself. There’s still certain practices from data science that are absolutely necessary to make AI work in real-world applications. Our content aims to help practitioners move beyond impressive demos to actually shipping reliable AI applications by emphasizing core skills like data analysis, error analysis, and basic data literacy. We wanted to provide this missing educational layer from the perspective of ML engineers and data scientists who have experience building and deploying these systems.

What unique angle does your content bring to the AI education space?

Most resources teach the “how” of using AI tools, but we focus on the foundational practices that make AI systems work reliably beyond simple demonstrations. We emphasize a key mindset shift: “look at your data” – which means systematically examining logs of user interactions, analyzing failure modes, and identifying patterns. This approach helps practitioners who often get stuck after building a prototype, endlessly tweaking prompts without making real progress.

We also stress that this isn’t just an engineering task – domain experts need to be involved in analyzing the data. We bridge theory and practice by focusing on processes rather than just tools, helping practitioners build measurement systems they can trust and that correlate with actual business problems. These foundational data science skills remain essential regardless of which models or frameworks you’re using.

Core Mindset Shifts for AI Practitioners

What is the most crucial mindset shift needed for teams building with foundation models?

The phrase we keep repeating is “look at your data.” Though this sounds trivial, it’s frequently overlooked. When systems don’t work perfectly, many teams get stuck endlessly tweaking prompts without a systematic approach. Instead, teams should examine logs of user interactions (or create synthetic ones if needed), write detailed notes about failure modes, categorize those patterns, and perform basic data analysis to understand which problems occur most frequently. This structured error analysis helps prioritize what to fix and what to measure. It’s not just a vague suggestion but a fundamental practice carried over from traditional machine learning that grounds development and prevents aimless tinkering. If you can’t test everything, focus on addressing the failure modes you’re actually seeing.

Beyond examining data, what other process-related shifts do teams need to make?

Focus on establishing robust processes rather than just relying on tools. Many teams think AI evaluation is primarily a tools problem, so they shop for evaluation frameworks or dashboards. But buying an eval tool is like buying a gym membership—having it doesn’t magically make you fit. The value comes from the consistent process of debugging, measuring, and iterating to create a positive feedback loop. Additionally, ensure domain experts (not just engineers) are involved in reviewing data and errors. Their understanding is key to developing meaningful solutions that address actual business needs. Use plain language rather than technical jargon when communicating across teams—this prevents alienating non-technical stakeholders whose domain expertise is crucial for success.

Data Analysis & AI Evaluation

How should teams approach evaluating complex AI systems with many components?

Most teams struggle with determining what to evaluate. AI systems have many components—RAG systems, function calling, routers, chunking strategies, embedding models—and you can’t test everything exhaustively. Your evaluation needs to be grounded in actual failure modes observed through systematic error analysis. Prioritize tests based on the errors you’re seeing most frequently in real usage. Without this data-centric approach, teams waste time measuring things that don’t actually matter to their specific application.

What are common pitfalls with AI evaluation metrics and dashboards?

A major pitfall is relying on off-the-shelf evaluation frameworks or vendor dashboards filled with generic metrics like hallucination scores or usefulness ratings. These metrics rarely correlate with the critical problems your specific application faces, creating noise rather than insight. Executives may initially accept these complex metrics but quickly lose trust when they don’t reflect real progress.

For example, I worked with a talent platform that measured edit distance—how much recruiters modified AI-generated emails—assuming fewer edits meant better AI performance. However, examining the data revealed many non-native English speakers were making edits that actually made the text incorrect. The metric completely failed to capture what was happening in reality.

Focus on developing metrics that you understand, that make sense to stakeholders, and that genuinely reflect progress on issues that directly impact your business outcomes. Choose meaningful, validated metrics over dashboards filled with potentially irrelevant ones.

How can teams gain confidence to deploy AI systems, especially when lacking real user data initially?

One effective technique to bridge the gap between prototype and production is generating realistic synthetic data. Use an LLM to create diverse user inputs based on carefully brainstormed personas, scenarios, and potential edge cases relevant to your application. By systematically “perturbing” your system with this synthetic data, you can observe its behavior across a wide range of inputs, identify potential weaknesses, and bootstrap your evaluation process. This structured approach gives you more confidence before taking the leap to production, even without extensive real user data.