How to Make Your Data Truly AI-Ready

Ben Lorica

5 months ago

Yoni Leitersdorf on the Semantic Layer, AI for BI, and the Future of Data Analytics.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Yoni Leitersdorf, CEO of Solid, joins the podcast to demystify why simply pointing an LLM at a database for text-to-SQL doesn’t work. He explains the critical need for a semantic layer to provide business context, turning raw data into a “Rosetta Stone” that AI can actually understand. Yoni details how to automate the creation of this layer by leveraging signals from across the enterprise — from dbt repos to Slack conversations — and shares practical advice on setting expectations for AI adoption, emphasizing a human-in-the-loop approach to build trust and achieve real value.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Structure Is All You Need
The Enterprise Search Reality Check
AI Unlocked – Overcoming The Data Bottleneck
Jakub Zaverl → How to Build and Optimize AI Research Agents
Mars Lan → The Security Debate: How Safe is Open-Source Software?
Anant Bhardwaj → Predictability Beats Accuracy in Enterprise AI

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Technical Foundations and Challenges

Why can’t we just point an LLM at a database and have it generate perfect SQL queries?

While foundational models have become remarkably proficient at SQL generation over the past 8-9 months, they lack the business and technical context specific to an organization’s data. Even with a well-structured star schema, models make educated guesses when encountering ambiguous situations. For example, if a data warehouse has three different revenue tables, the model won’t know which one is the authoritative source for a specific business question. It also won’t understand company-specific acronyms—a column named “FSU” might be interpreted as “Florida State University” instead of an internal business term. The bottleneck has shifted from model capability to supplying sufficient structure and context about your data environment.

What makes enterprise data warehouses more challenging than people assume?

Even in established enterprise warehouses, you typically encounter multiple layers of data quality: raw “landing zone” data straight from source systems like Salesforce, cleaned-up middle layers, and heavily processed “gold” layers. The challenge is that ownership and responsibility for data quality is distributed across many people and systems. Unlike public internet content where creators have direct incentives to maintain quality, enterprise data often lacks clear ownership, especially when employees change roles or leave. Additionally, you’ll find overlapping data marts, legacy artifacts, and inconsistent naming conventions that create ambiguity even within well-designed schemas.

The Semantic Layer Concept

What exactly is a semantic layer in the context of AI applications?

A semantic layer acts as a translation layer—a “Rosetta Stone” between business questions and warehouse reality. It provides the AI with institutional knowledge that human analysts would typically accumulate over months or years. Concretely, it captures:

Business terminology and definitions: What company-specific acronyms and terms mean
Data relationships: How different tables and columns connect to each other
Authoritative data sources: Which datasets/columns are trusted for specific questions
Business logic: The correct way to calculate key metrics
Data quality indicators: Information distinguishing “gold” tier data from “bronze” or “silver”
Example questions: Sample business questions with their canonical queries

The primary consumer is the AI system, but humans use it for explainability and trust-building.

In what form does the semantic layer actually exist across different platforms?

The semantic layer takes different forms depending on your data stack:

Databricks: Populate Genie “spaces” and Unity Catalog
Snowflake: Semantic models and semantic views
dbt: MetricFlow definitions and metric specifications
Direct LLM integration: Structured configurations (YAML files) describing datasets, joins, metrics, and approved queries

Automation keeps these formats synchronized as underlying schemas and models evolve to prevent drift.

Building and Automating the Semantic Layer

How do you automate the creation of the semantic layer? What signals and data sources do you use?

Automation is achieved by analyzing a wide range of enterprise signals to infer data quality, relevance, and business context without relying solely on manual documentation. Key sources include:

Data warehouse metadata and query logs: Historical SQL queries (typically 3 months back) showing which tables and columns are used most frequently and by whom
BI tool metadata: Popular and trusted dashboards and their underlying queries—if the CEO views a report daily, the underlying data inherits trust
Transformation tools: dbt repositories providing data lineage and transformation logic
Work management systems: Jira tickets containing discussions about data definitions and business requirements
Communication platforms: Slack or Teams conversations where analysts and business users discuss data, helping define terms and identify domain experts
Identity and organizational data: Understanding who uses and owns different data assets

By building graphs of these interactions, you can identify “star analysts,” understand which datasets are trusted for specific domains, and automatically generate a significant portion of the semantic layer.

What signals indicate trustworthy data in an enterprise environment?

Key trust indicators include:

Authority usage: High usage by senior stakeholders or recognized domain experts
Consistent patterns: Tables with stable usage patterns over time
Active maintenance: Models actively maintained in transformation workflows
Social signals: Data referenced by go-to people for specific business domains
Dashboard prominence: Tables underlying frequently-viewed executive dashboards

For new models without usage history, change signals from dbt repositories or Jira tickets indicate intent (e.g., “table X replaces deprecated table Y”).

Implementation and Rollout Strategy

What does a realistic implementation process look like, and how should teams manage expectations about accuracy?

Implementation should be gradual and typically takes 6-8 weeks. The system will not be 100% accurate on day one—expect 75-85% accuracy initially, with the remaining gap closed through human-in-the-loop validation.

The process includes:

Warm-up period (2 weeks): System analyzes historical data from connected sources using lookback periods of SQL logs (~3 months), communication history (~1 year), ticket systems (multi-year), and code repositories (complete history)
Staged rollout:
- Data engineering teams: Initial validation of technical accuracy
- Data analysts/modelers: Business context refinement and validation
- Business stakeholders: Final rollout only after technical teams build trust

This phased approach is essential for maintaining trust. If business users receive incorrect answers early on, they will lose confidence and abandon the tool.

Where should teams start when implementing a semantic layer?

Start narrow and specific—select one or two business domains like sales and marketing, then focus on particular use cases within those domains. You don’t need to “boil the ocean.” Instead, prove value in a focused area and expand gradually. This approach allows you to build expertise and refine processes before tackling broader organizational challenges.

Trust, Accuracy, and Human Validation

How do you build and maintain trust in the system when you know it won’t be perfect?

Trust is established through multiple mechanisms:

Transparency: Show generated SQL and reasoning to technical users (analysts and engineers) who can validate the logic
AI validation: Use separate “AI judge” models that evaluate the quality and accuracy of primary system responses
Observability: Provide data owners with reports showing what questions users ask about their models and system performance
Feedback loops: Implement thumbs-up/down ratings and continuous monitoring
Confidence scoring: Clearly indicate system confidence levels for different responses
Audit trails: Detailed tracking of which data sources were used for each answer

Business users may not understand the underlying SQL, but when internal data teams validate outputs and champion the system’s use, broader trust follows.

What role do humans play in maintaining these AI analytics systems?

Humans remain essential for validation and continuous improvement. While AI can achieve 75-80% accuracy initially, humans need to:

Validate outputs and provide feedback on edge cases
Supply business context that can’t be inferred from data patterns
Help with company-specific terminology and definitions
Identify areas where the system has low confidence and requires guidance

Unlike traditional data catalogs where humans had to document everything upfront, AI systems do most of the heavy lifting and request human input only when needed—typically for about 15-20% of semantic understanding.

Platform Integration and Data Stack Considerations

How do you handle permissions and row-level security with AI-generated queries?

Leave security enforcement to the underlying data platform (Databricks, Snowflake, BigQuery). The semantic layer tells the AI how to answer questions, while the data platform applies who can see what. This separation of concerns ensures that existing security policies remain intact and are consistently enforced regardless of how data is accessed.

Can you create one semantic layer across multiple platforms like Snowflake, Databricks, and BigQuery?

While enterprises desire unified semantic layers across platforms, most current success comes from domain-scoped layers per platform, aligned through shared business definitions. Cross-system layers are emerging but remain complicated by governance boundaries and organizational ownership patterns. The practical approach is to maintain consistency in business definitions while allowing platform-specific implementations.

Lessons from Previous Approaches

Data catalogs had similar goals but struggled with adoption. What lessons apply, and what makes semantic layers different?

Traditional data catalogs failed for two key reasons:

Manual documentation burden: They required humans to hand-document everything, creating unsustainable maintenance overhead
Wrong demand source: They were primarily data team initiatives that struggled to gain business traction

The current approach addresses both issues:

AI automation: Modern systems can automatically draft documentation and infer relationships, solving the cold-start problem
Business-driven demand: The push for “chat with data” capabilities comes directly from business stakeholders who want intuitive AI-powered access to insights, creating natural budget and organizational support

Additionally, semantic layers aren’t just about discovery—they enable direct question-answering, providing immediate value rather than requiring users to navigate to separate documentation systems.

User Experience and Education

What user education is required for effective AI analytics adoption?

Users often treat AI chat interfaces like keyword search engines, which reduces effectiveness. Successful implementations require education on:

Question formulation: Teaching users to state complete questions with context and constraints
Metric specification: Helping users articulate which specific business definitions they want applied
Clarification interaction: Training users to engage with follow-up questions rather than expecting perfect first responses
Result interpretation: Helping users understand confidence indicators and when to seek validation

Better prompting reduces back-and-forth interactions and improves system reliability.

Future Applications and Agentic Workflows

Where does this foundational work on semantic layers lead? What advanced capabilities does it unlock?

Making data accessible through reliable AI interfaces is the critical first step toward more autonomous systems. The long-term vision enables AI agents that can proactively use data to drive business outcomes.

For example, organizations are developing AI agents attached to marketing campaigns that continuously monitor performance by analyzing warehouse data. Based on their analysis, these agents propose campaign adjustments, present them for human approval, and automatically implement approved changes. The semantic layer provides the foundational, context-aware data access that makes such agentic behavior both safe and useful.

Other emerging applications include automated anomaly detection, proactive reporting, and intelligent data quality monitoring—all requiring the same foundational semantic understanding.

Build vs. Buy Considerations

Should teams build semantic layer capabilities internally or work with vendors?

You can build internally, but expect significant ongoing work: assembling signals across systems, maintaining currency as schemas evolve, monitoring chat interactions, and closing feedback loops. Successful internal implementations require treating this as a long-term capability-building exercise rather than a one-time project.

Vendors who observe patterns across multiple deployments tend to reach reliability faster and maintain lower ongoing burdens. However, whether building or buying, plan for the same implementation lifecycle: bootstrap with historical data, stage rollouts, implement human-in-the-loop validation, establish observability, and maintain continuous updates.

How should teams approach ROI evaluation for AI analytics projects?

Set appropriate expectations upfront. Many early pilots failed because organizations expected “magic”—plug-and-play solutions that would work perfectly immediately. Successful implementations treat this as capability-building that requires investment in semantic layers, user education, and ongoing maintenance.

ROI comes from enabling broader access to data insights and reducing bottlenecks on technical analysts, but it requires treating the AI system as a team member that needs training and support rather than a turnkey solution. Organizations that understand this investment model see better outcomes and more sustainable implementations.