The AI Revolution Finally Comes to Structured Data

Ben Lorica

3 months ago

Jure Leskovec on Relational Foundation Models, Graph-Based AI, and No-Pipeline Predictions.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Stanford professor and Kumo.ai co-founder Jure Leskovec introduces the concept of a relational foundation model for structured enterprise data. This approach treats relational data as graphs, allowing a pre-trained transformer to make instantaneous predictions directly from raw data without manual feature engineering. Jure explains how this technology provides state-of-the-art results for tasks like churn prediction, recommendation systems, and fraud detection, empowering data science teams to deliver business value faster.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

The Problem with AI on Structured Data

What problem is Kumo solving?

Kumo targets AI predictions and risk scoring over structured enterprise data—the kind of data stored in data warehouses like customer records, product catalogs, transaction logs, supply chain data, and other tabular business data. This is the “ground truth” of how businesses operate, but applying AI to it has remained stuck with outdated approaches.

What’s wrong with current approaches to building ML models on structured data?

There are two main approaches today, both with serious limitations.

First, some try using Large Language Models by converting tabular data into text and feeding it to the model. This performs poorly because next-token prediction is fundamentally different from numerical forecasting or structured reasoning tasks.

Second, the traditional ML status quo requires data science teams to manually engineer features, create training datasets, build task-specific models, and maintain complex feature pipelines. This process is extremely slow and expensive: a single model typically takes six months to deploy, requires two full-time employees to maintain, and more than half of these projects never reach production. You’re essentially using 30-year-old technology for modern problems.

What about AutoML and time series foundation models?

AutoML promised automation but has largely underdelivered. It’s typically a brute-force approach—running loops over hyperparameters, generating massive SQL joins for feature engineering, training dozens of challenger models in a random search, and hoping one works. It lacks the sophistication needed for high-stakes business problems.

Time series foundation models are a step forward and can perform forecasting quite well, but they’re generally limited to single time series forecasting. Kumo’s approach is more general—designed to work across any set of interconnected tables in a database for any predictive task, not just isolated time series.

Kumo’s Solution

What is a “relational foundation model” and how does it work?

Kumo provides a pre-trained, frozen model that has never seen your company’s specific data. You connect this model to your data warehouse schema (for example, 5, 10, or 20 interconnected tables), then immediately query it with a predictive task like “predict 60-day churn for this user.”

In about 200 milliseconds, the model returns a prediction through a single forward pass. This zero-shot prediction quality matches what a PhD-level data scientist could produce after approximately one month of manual model building. You can change the prompt on the fly (for example, from 60-day to 48-day churn) and get new predictions instantly. For business-critical tasks where you need maximum accuracy, you can fine-tune the model on your specific data and task to achieve even better performance—often 10-20% more accurate than traditional ML approaches.

What makes this technically possible? How does it work under the hood?

The key insight is that any relational database can be represented as a graph, where tables and rows become nodes and relationships (primary/foreign keys) become edges. Kumo uses a “Relational Graph Transformer” architecture that operates directly on this graph representation.

Instead of manually engineering features (like calculating SUM(transactions) over the last 30 days), the model’s attention mechanism attends directly over raw data—individual transaction records, for example. This allows it to capture far more nuanced signals than hand-crafted features. It’s analogous to how modern computer vision models learn from raw pixels instead of manually defined edge detectors.

Why is inference so fast?

Inference takes around 200 milliseconds because there’s no model training or feature calculation happening at query time. The model is frozen—there are no gradient updates or parameter changes during inference. When you make a predictive query, the system generates task-specific in-context examples from your data and feeds them through the pre-trained model in a single forward pass to produce the prediction.

How do you express what you want to predict?

Kumo uses a domain-specific language that looks like SQL but starts with PREDICT instead of SELECT. The key difference is you’re asking forward-looking questions about what happens next week or next quarter, not what happened in the past. You can also use natural language prompts (like “predict 30-day churn”), which get translated to the structured predictive query language through a semantic layer.

Setup and Implementation

What does the setup process look like and how long does it take?

Setup typically takes one to two days and requires collaboration with people who have deep knowledge of your company’s data. The process involves:

Connecting to data: Pointing Kumo to the relevant tables in your data warehouse (Snowflake, Databricks, BigQuery, Bigtable, S3, Parquet, Iceberg, etc.)
Defining the schema: Specifying data types, semantic types, and the primary/foreign key relationships between tables
Building a semantic layer: For natural language prompting, you need to define what business terms mean in your organization—for example, what “churn” means (zero purchases, total spend below a threshold, etc.)

Once set up, the foundation model can be queried for any predictive task over that defined schema. This should connect to the curated warehouse or lakehouse that your analytics and data science teams already use—not directly to raw ERP systems with thousands of tables.

What institutional knowledge needs to be captured during setup?

Information about data quality across different sources, which warehouses have better data for specific use cases, and business-specific definitions all need to be codified during the semantic model setup. For example, if your Southeast sales data is better in one warehouse than another, the data scientist or operator needs to specify this during configuration. Once configured, the model can handle various data quality issues more robustly than traditional approaches.

Fine-tuning and Customization

When should you use the zero-shot model versus fine-tuning?

The frozen model provides accuracy equivalent to about one month of PhD-level data scientist work building a custom model. For many applications, this zero-shot performance is sufficient. You should fine-tune when the prediction task is business-critical and you need maximum accuracy. Fine-tuned models can reach what they call “superhuman” performance—10-20% more accurate than what humans can build with traditional ML toolboxes. All the major customer examples (DoorDash, Reddit, the dating app) used fine-tuned models because of their high business value.

How does fine-tuning work?

The fine-tuning process is highly automated and significantly different from LLM fine-tuning. You don’t need to manually create labeled examples. Instead, you specify the predictive task you care about, and the platform automatically generates training data through “time-traveling”—sliding through your historical data with a moving window.

For each window, the system treats a past moment as “now,” looks into the future for the ground truth label (what actually happened), and uses data from the past to predict that label. For a 30-day churn prediction, it creates 30-day sliding windows across your historical data, automatically generates labels, and creates the appropriate subgraphs from past data to produce the fine-tuning dataset. You can modify or clean this data if desired, but it’s not required.

How much training data is needed?

The range is enormous—from as few as 1,000 examples on the low end to tens of billions for massive applications. For example, Reddit’s use case with all users, clicks, posts, and votes across the platform handles billions of examples. The key is that you don’t manually generate these examples—you describe the task, point to the raw data, and the platform handles the rest.

Use Cases and Real-World Results

What are the main use case categories where Kumo is being deployed?

Three major categories:

Retail and commerce: User behavior modeling including recommendations, churn prediction, and lifetime value (LTV) estimation
Safety and fraud: Fraud detection, financial abuse detection, and identifying malicious behavior across social networks and financial institutions
Risk modeling: Predicting various forms of risk such as underwriting in insurance or patient risk in healthcare

Can you share specific real-world performance results?

Yes, several strong examples:

DoorDash used Kumo to improve their “try something new” restaurant recommendation model—a flagship problem for the company. After connecting Kumo to their data (users, orders, restaurants, geography, web behavior) in a process that took one to two months, the fine-tuned model delivered a 30% performance improvement over their highly optimized, internally-built model. This translated to hundreds of millions of dollars in additional food orders.

Reddit used Kumo to improve their advertising models that predict ad clicks, which directly drives revenue. The graph-based approach delivered an improvement equivalent to what their internal team would have expected to achieve over four to five years of iterative development (they typically improve accuracy by 1-2% annually), all accomplished within a couple of months.

A large dating app added user profile images to their matching recommendation model, which immediately improved accuracy by 15% by leveraging computer vision embeddings. This demonstrates how easily the platform incorporates multi-modal data.

The technology is also being used by Coinbase and Expedia.

Does this work for recommender systems at scale?

Yes, with state-of-the-art results. The DoorDash example is a flagship use case. The technology enables rapid experimentation with different data sources—you can start with users, orders, and restaurants, then easily add user behavior data, geography tables, and other data sources. You can quickly test different objective functions (predicting orders next week versus next month versus next 48 hours) because the platform enables fast iteration without rebuilding entire pipelines.

Can this handle time series forecasting?

Yes. It can function as a time series foundation model, but it goes significantly beyond single time series forecasting. If you have a time series per product along with a product taxonomy, the model learns from the target product’s time series AND attends over related products’ time series through the taxonomy, borrowing information from correlated series for more accurate forecasts. This is something isolated time series foundation models cannot do.

Data Handling and Requirements

What data quality is required? How does the system handle messy or incomplete data?

While “garbage in, garbage out” still fundamentally applies, these models are more robust to data quality issues than traditional approaches. Because the model learns over the graph and relationships between entities, it can borrow information from nearby data points in the relational context to fill in gaps or overcome noise. Each data point is situated within its relational structure, allowing the model to be more robust.

For issues like messy categorical values (for example, “CA,” “CAL,” “CALIF” for California), the platform provides diagnostics and allows the data scientist to decide on the best modeling strategy—treating it as a categorical feature or as a text feature to capture semantic similarity. When modeled as text, different spellings are naturally handled together.

The key difference from traditional approaches is that you operate at the modeling level rather than spending 80% of your time on data cleaning.

Does the model work with unstructured data like text and images?

Yes, even though the model is designed for structured data, it can handle unstructured data as long as it exists as a column within your tables. Product descriptions, comments, conversations, and images can all be incorporated. The platform uses appropriate encoders—text embeddings for descriptions, image embeddings for photos—and integrates them into the predictive models automatically.

This makes it straightforward to build multi-modal predictive models combining tabular, text, and image data. The dating app example shows how simply adding an image column to the person table and including images in the recommendation model immediately improved accuracy by 15%.

What about entity resolution?

The platform allows you to build entity resolution models as a preprocessing step if needed—essentially a two-stage process. However, because the underlying representation is graph-based, the need for heavy entity resolution is often reduced. Entities that should be linked get connected through the graph structure, and the attention mechanism can learn to do entity resolution on-the-fly during prediction, rather than requiring it as a separate preprocessing step.

Deployment and Performance

What are the latency characteristics in production?

For online inference over the graph, you get approximately 100-200 millisecond response times. For applications requiring even lower latency, you can pre-materialize embeddings in batch mode and then do real-time scoring in 10-15 milliseconds using a very efficient model on top of those embeddings. Both approaches are available depending on your application requirements.

How does deployment compare to traditional ML pipelines?

Deployment is significantly simplified because you only need to refresh the raw data—there’s no complex feature engineering pipeline to maintain. Traditional approaches struggle because keeping features up-to-date in production is extremely difficult. For temporal problems, every new event requires recomputing all features. Features either become stale or have time-travel and information leakage issues that cause miscalibration.

With Kumo, the neural network handles temporal consistency and attends over raw data, eliminating these production pipeline challenges.

Explainability

How do you explain the model’s predictions?

Kumo offers a novel approach to explainability built on its attention mechanism architecture. Because the model attends directly over raw data to make predictions, you can run it backward to see exactly which data points—which specific rows, columns, and tables—it paid the most attention to when making a prediction.

This provides data-rooted explanations that work at two levels:

Model-level explanations: Show what signals the model looks at overall, letting you verify it’s attending to appropriate features
Individual prediction-level explanations: Show what specific past events or data structures led to each recommendation or forecast

These structured explanations can then be fed, along with the semantic model, into an LLM to generate clear, natural language summaries explaining the “why” behind a prediction. Because these are grounded in actual model behavior and real data, they’re accurate and non-hallucinated.

Integration with the AI Ecosystem

How does Kumo fit into AI agent workflows?

AI agents need decision-making capabilities rooted in live, structured data. Kumo serves as the predictive engine or “tool” that agents can call. An agent’s reasoning process often requires making predictions about the world based on internal, structured data.

For example, a customer retention agent could: (1) query Kumo to get churn risk scores for all customers, (2) query it again to get “next best offer” recommendations for the high-risk segment, (3) use an LLM to draft personalized emails, and (4) send the emails. Kumo provides the critical decision-making components—the two predictive problems that would traditionally require months of data science work can be solved through queries to the relational foundation model. The platform supports Model Context Protocol (MCP) for integration with agent frameworks and other AI tooling.

Is Kumo meant to replace data scientists?

No, the goal is to make data scientists significantly more productive and impactful. Rather than spending time on grunt work like feature engineering and pipeline maintenance, data scientists can become “rock stars” who drive business outcomes. With this technology, a data scientist can be approximately 20x more productive, explore more modeling options, focus on strategic modeling decisions, and directly impact business metrics.

The most sophisticated data science organizations (like DoorDash, Reddit, Coinbase, Expedia) see the most value because they can quickly measure impact through proper A/B testing and recognize when tools dramatically improve their capabilities.

Technology Foundation and Access

What data was the foundation model trained on?

The training uses a combination of open multi-table databases available on the web (not single tables) and significant synthetic data generation and augmentation. The key insight is that the model needs to learn pattern recognition—how to learn from in-context examples and generalize to unlabeled examples. It’s about teaching the transformer to recognize relational patterns rather than ingesting specific data semantics. The synthetic data generation focuses on creating diverse relational patterns the model can learn to recognize and apply.

Why does this approach generalize so well across diverse businesses and databases?

Two factors explain the surprising generalization:

First, transformers and neural networks have inherent generalization ability—they can embed similar patterns together in latent space even when they haven’t seen exact examples, going beyond training data in meaningful ways.

Second, the world may be more orderly than we think at higher levels of abstraction. Like a Taylor expansion where first-order approximation shows the world is mostly linear, diverse databases may collapse to a smaller set of fundamental patterns at the generalization level. While every business is unique at the surface level, predictive patterns may share underlying structure that the model learns to recognize.

What’s the relationship between Kumo and PyTorch Geometric?

Kumo’s team are the authors of PyTorch Geometric (PyG), the most popular open-source library for graph learning. However, putting graph learning into production requires years of engineering work beyond the open-source library. Kumo built a production-ready, large-scale graph learning platform with the relational foundation model on top.

Is Kumo open source?

PyTorch Geometric remains open source, but the Kumo platform and foundation model are not open source or open weights. This is because the model architecture is tightly coupled with specialized, large-scale graph learning infrastructure that’s required to run it. Releasing model weights alone would be insufficient—the graph learning model requires their specialized infrastructure to operate.

Instead, Kumo is offered as a service via SDK and API (including Model Context Protocol). There’s a free tier available for developers to experiment and try the technology, transitioning to paid only when it’s delivering business value.