Why Foundation Models Haven’t Replaced Classical Machine Learning

Doris Xin and Moustafa Abdelbaky on Classical ML, Context Graphs, Data Agents, and the Future of Enterprise AI.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • YouTube • AntennaPod • Podcast Addict • Amazon • RSS.

In this episode, Ben Lorica sits down with Doris Xin and Moustafa Abdelbaky, co-founders of Disarray, to discuss why classical machine learning models remain essential despite the rise of foundation models and LLMs. Doris and Moustafa explain how Disarray uses agentic systems and a proprietary context graph to navigate fragmented enterprise data — spanning legacy systems, code repositories, Slack messages, and more — to automate the full ML development lifecycle, from data engineering to model deployment. They also explore the limitations of AutoML and time series foundation models, the critical role of human oversight, and how entity resolution and long-horizon autonomy set their approach apart.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Time Series Foundation Models: What You Need To Know
Jure Leskovec → The AI Revolution Finally Comes to Structured Data
Terrence Lee-St. John → When “Garbage In, Garbage Out” Gets It Wrong
Stop upgrading your LLM. Start fixing your data.
Data Engineering in 2026: What Changes?
Mikio Braun → Coding Agents Meet Data Science
The Missing Layer: Why Your AI Agent Fails — and What Actually Fixes It
Jeff Hawke → World Models Are Here—But It’s Still the GPT-2 Phase

Support our work by subscribing to our newsletter📩

Transcript

Below is a polished and edited transcript.

Ben Lorica: All right. Today we have Doris Xin and Moustafa Abdelbaky, co-founders of Disarray, which you can find at disarray.ai. The description on the website is: “Agentic systems to turn complex, proprietary data into production-quality machine learning models at a fraction of the time and cost of manual development.” With that, welcome to the podcast.

Doris Xin: Thank you so much for having us, Ben.

Ben Lorica: Let’s set the context for our listeners. When we talk about machine learning in this discussion, we mean the classic machine learning models and tasks that companies still need: forecasting, fraud detection, recommenders, and so on. There is so much hype around AI that people sometimes forget these models still power a lot of important systems. In logistics, for example, companies still need forecasting tools for their supply chain.

One quick question for our listeners: I think many people assume foundation models have replaced traditional machine learning. Why is this assumption wrong?

Doris Xin: This assumption is wrong for a couple of reasons. First, for the types of models and applications you just mentioned, LLMs, video generation, video understanding, and image understanding are not going to be able to work with your newsfeed records or your product purchase records. This is a fundamentally different type of data, and these classical models are targeted at that kind of data.

Then there is the separate issue of your own proprietary data. Sure, we talk about fine-tuning, but how do you fine-tune an LLM on your user interaction data — billions of clicks, for example? That is not something foundation models are designed to handle.

Ben Lorica: Moustafa, Doris raised two points. One is modality, or the type of data. On the other hand, foundation models are slowly becoming more multimodal. The more pressing issue is proprietary data. One of the open secrets in the space is that integration is still a problem. People want to build agents, but the reality is that a lot of this data lives in legacy systems, such as CRMs and even older systems. So I have two questions: to what extent do you tackle the integration problem and the multimodality problem?

Moustafa Abdelbaky: Beautiful question. Maybe I’ll start with the integration question. For us, the thesis has always been that data is fragmented across the enterprise ecosystem. You have a lot of different systems, many of them very legacy. You have data gathered everywhere. But it is not just the data; it is also the context around the data and how that data has been used in the past to train classical ML models. That is really important.

One of the things we focus on is how to construct a knowledge graph, or a context graph, that allows you to map all of these relationships: everything that has been built in the past, how it has been used — essentially, the enterprise brain of what has happened — and surface that as context to agents so they can build new models and applications on top of it.

That is what we do on the integration side. We excel at building semantic relationships, using techniques like static analysis and other approaches to understand lineage: how data has been used in the past, what applications have used it, and what downstream applications would be impacted if, for example, you tweak a column here or there. That answers the first question around integration. The second question was around—

Ben Lorica: Multimodality. As I mentioned, foundation models seem to be becoming more adept at handling different modes of data, right?

Moustafa Abdelbaky: Yeah.

Doris Xin: We’ve actually tried to use LLMs for regression tasks. That is one of the major ones. The problems were twofold. First, it is very inefficient. For the kind of classical models we are talking about, you are dealing with millions or hundreds of millions of parameters. Using these massive models to do prediction is a waste.

The other issue is that, fundamentally, the type of data is very different. It is not about sequence prediction. It is not autoregressive. Even forcing an LLM to do that kind of task led it to heavily underperform classical models.

Ben Lorica: I have a couple of other questions, but first let me go back to the integration question. What is your working assumption? Is your working assumption that I have a lakehouse or warehouse — Snowflake or Databricks — and that is your starting point? Or is your working assumption that you will also help me build models where the data is not yet in Databricks or Snowflake and needs to be stitched together? In other words, there is the data engineering and pipelines piece, and then there is the model-building piece. What are you assuming I have?

Doris Xin: We are very much assuming the latter. We assume your data is all over the place. It is probably in some spreadsheets over here, some CSVs in S3, even though they probably should be tables in Snowflake or Databricks. We are very much planning for multimodality of data that lives in different places. The same kind of data may also live across different systems.

A major focus of what we do is figuring out how to bring all of that context onto a unified surface that allows us to understand the connections between these seemingly disparate pieces.

Ben Lorica: But Moustafa, even if you can build a model in the way Doris just described, where the data is living all over the place, do you know anyone who deploys anything like that in production? If the data is all over the place and in spreadsheets, isn’t the first step to get the data into one place where you can work on it, clean it, standardize it, and then build the model?

Moustafa Abdelbaky: Absolutely. Given our background and experience building in this space, we think about the holistic journey: from the point where the data is all over the place to the point where you have a working model. That is how we designed Disarray.

It is essentially a human-in-the-loop agent that allows you to run very long-range research. It starts with stitching the data, building the pipeline, doing the ETL and data engineering step by step, doing the cleaning and transformations, putting things in the right place, doing feature engineering — essentially every stage of the machine learning cycle. We are not just starting with, “Here are my features; train a model.” We cover everything from data augmentation to everything in between.

Ben Lorica: So your goal is not just to help me build a proof of concept. If my data is a mess, you will help me deploy this model in production. Along the way, some data engineering artifacts get created that I can potentially use for other models, right?

Moustafa Abdelbaky: Absolutely. And it does not have to be just machine learning models. You can think of all the data products you might want to build.

The nice thing about what we do is that we do not just look at your data. We also look at your code — everything you have built in the past — and use that to understand best practices and semantics. What does this table actually mean? What does this field mean? Things can mean different things to different people. Even different teams in the same company can have very different definitions of what revenue means.

Part of what we do is use that code to understand the semantics of how things have been used in the past and use that to construct the right context for the goal you are trying to achieve.

Ben Lorica: For our listeners and viewers: normally I do not talk to people who do not have anything available yet, but I am making an exception because these folks are RISELab alums, and I am a UC Berkeley family member in many ways.

Another question: one of the main things you are going to help me with, as Doris described, is creating a context graph — or what Databricks and Snowflake sometimes refer to as the semantic layer in their data warehouse. I want to get into that piece.

But there is another line of work around foundation models that tries to approach this problem from more of a foundation-model perspective. The original set of foundation models focused on time series. There were time-series foundation models, which, by the way, I tried. My conclusion is that if you want something super accurate — whether your use case is financial trading or some other application that requires a high level of precision — maybe these are not the right tools. But they do solve the problem that, if you have nothing, something is better than nothing in many cases.

Did you first look at some of these time-series foundation models? Granted, time series is only one slice of what you are trying to do. What is your assessment of time-series foundation models? Feel free to trash them if you feel like it. Don’t hold back.

Doris Xin: For us, time-series foundation models really come down to the fit between a foundation model and your data. LLMs make sense because language is language. Time series, by contrast, come in very different shapes and forms, and every organization will have its own.

Ben Lorica: It fits within a subset of what you are trying to do. There is a class of data scientists whose task is forecasting or anomaly detection, which inevitably involves time series. My problem with time-series foundation models is that people say they work well, but define “well” — in terms of accuracy, latency, and all sorts of things.

Doris Xin: Right. You are really trading off the effort between fine-tuning that model for your use cases and your data versus training something from scratch. I do not know that there is a clear winner, or that because you are starting from a foundation model you are going to get to a better place faster.

Ben Lorica: So did you kick the tires on these things?

Doris Xin: We did. I mentioned earlier that we used LLMs for regression before, and the performance was horrendous.

Ben Lorica: Your approach is to build a context graph or context store. It could be a graph, or it could be something else. What do you need in order to build this? Doris, in the example earlier, my data is all over the place: Excel files, whatever. A lot of the context may not exist in the files; it might exist in someone’s head. How do you build context when context is all over the place?

Doris Xin: A key piece of the puzzle is that when we think of context, we often go straight to data catalogs and try to extract meaning directly from each piece of data individually. But what we really need to do is make connections between different systems via other knowledge sources.

Maybe the context is in someone’s slide. Maybe it is in a Slack message. Maybe it is in the data pipeline code that transforms Data Source A into Data Source B. Or maybe it is in wikis.

We need to be able to identify, in these knowledge documents — code included — what pieces of data they are referring to, and then use those knowledge documents to add additional semantics and relationships across these different things to create a holistic context graph.

Ben Lorica: Moustafa, did you look into the prior generation of startups that tried to build the next generation of metadata stores? What was that system out of LinkedIn? DataHub. I advised one of these companies. One of the key things they learned is that context exists in many places, as Doris alluded to. It could be metadata, usage data from SQL logs, or organizational data. Let’s look at the org chart: Ben reports to Doris, so Ben’s work might be more valid because Doris is just a manager and not hands-on. Ben actually builds all the pipelines, and Doris manages Ben.

There are many sources of context. Did you look at that line of prior work around metadata?

Moustafa Abdelbaky: Absolutely, and it informed a lot of our thinking. But where our thesis differentiates itself is that a lot of the logic is encoded in applications. These companies have been around for a while and have been building applications left and right, where code encodes a lot of the semantics. Documentation, Slack messages, pipelines, Airflow runs, logs — all of these have pieces of context.

The part that is really compelling about what we do is entity resolution. People can be talking about the same data system or the same piece of data in many different ways, but they are all referring to the same entity. That is where we excel. We use entity resolution to understand all the relationships that can be inferred about a particular piece of data.

Ben Lorica: Entity resolution itself is a whole area. There are startups that only do entity resolution for people because, obviously, for fraud detection and things like that, you have to be able to do this at scale in real time. What is the margin of error for your entity resolution system? I imagine it is best effort. It does not have to be 100% correct, but it still provides additional context because most systems will not try to resolve entities at all.

Doris Xin: Right. Our graph is eventually feeding into agents that have the ability to handle some amount of error or uncertainty in the context that is brought in. We aim to bring in enough context so that, even if there is a mistake, the agent can further reconcile it rather than taking it as ground truth.

Because of this agentic integration, we also have a virtuous cycle. As we observe how people work with these things, we get signals about whether we got the entity resolution right, and we can update the graph for refinement.

Ben Lorica: Once your system is built, how do you envision it being used or deployed? Will it require a bunch of forward-deployed engineers? It seems like your system requires getting into the internals of a company. You can build crawlers or integrations, but at some level there is no substitute for getting inside a company.

Doris Xin: We designed the solution to be as self-service as possible. The starting point is not a forward-deployed engineer sitting with a team for two weeks or two months to gather data. It is more like a Fivetran-style connector or Airbyte-style connector. You tell us where your data lives and give us the ability to grab metadata from these different sources, and the system itself does the heavy lifting of triangulating across different systems: Notion, GitHub, S3, Snowflake, Databricks, and so on.

Once all of this context graph is collected and used, we envision it being as simple as talking to an agent, like a coding agent. At the end of the day, ML engineers are also writing code to build these very specialized applications. Our system acts like a coding agent, but it has a lot more context about your entire data ecosystem.

Ben Lorica: So it is a coding agent for ML engineers and data scientists in some ways. Coding agents are not that great at data science. Data science requires looking at the data, realizing it needs to be cleaned, and so on. A lot of that might exist in notebooks or other things the data team has built.

Moustafa, is your target persona — or the place where you land inside the company — the data science or ML team? It does not seem like the kind of tool where you can say, “Hey, you are a marketing analyst and want to build a churn model, and you do not want to go through the data science team, so use Disarray.” It sounds like Disarray still requires an ML engineer or data scientist.

Moustafa Abdelbaky: Eventually, the plan is to allow anybody to make sense of data. The goal for Disarray has always been to democratize access to data. But the starting point is definitely ML engineers and data scientists.

The reason is that, as you know, a lot of these coding agents can make a mess if you do not have enough context or experience. Especially with data, we have deleted buckets and done a lot of different things. It has been interesting to see these failure modes. I do not know if you saw this recently, but someone posted that they were using Claude and it deleted their entire database, as well as all of their backups.

We definitely want to start with people who can guide the coding agent. The coding agent is very performant, and with context it is 10 times as performant, but it still needs guidance. This is what we think about with the human in the loop: making sure someone understands the larger objectives.

Especially if you are training a model, it is not just about cranking out the model. There are legal and ethical considerations you need to think about when building these models, and agents cannot do that. You need someone with the experience and knowledge to build these kinds of models.

Eventually, once we have enough training data and enough experience, we should be able to go back to someone on the marketing team and say, “Now you can self-onboard and build what you want.”

Ben Lorica: Doris, let’s say you land in a company and your tool is asked to build a model they do not have yet — let’s say a churn model. You unleash your agent. The agent grabs the context, gets the right data, and prepares the data. That seems like the heart of the system. Once I have the data, doesn’t the previous generation of AutoML build at least a starter model? If I am starting from no churn model to having something, it seems like where your system really helps me is getting all the data, the features, and the things I need to build a model.

Doris Xin: I’m really glad you brought up AutoML. That was actually one of the key research topics for my PhD research, where we wanted to understand where autonomy lies and where the human still needs to have control.

I think the general consensus is that AutoML has failed or been forgotten. There are a couple of reasons for that. First, until LLMs, we had a lot of trouble understanding semantics. AutoML was really just glorified hyperparameter tuning, maybe with some feature engineering — very mechanically searching through well-defined search spaces. That meant it could only automate a very limited part of the lifecycle.

Ben Lorica: It is actually automating the most fun part of the process.

Doris Xin: Right: being able to see how the model performs and then having the right “aha” moment, the intuition.

Ben Lorica: The actual building of the model is the most fun part.

Doris Xin: That’s right. The other piece is that because it did not have a lot of intelligence baked into it — Bayesian optimization was kind of the best we had — it was trying a lot of random things. The hypotheses it churned through were not super high quality, which meant it ran way more iterations than an intelligent human would. That also limited its applications.

Ben Lorica: But to my point, if I am willing to cede the model building to AutoML because I am starting from no churn model, it seems like where your system really helps me is getting to the place where AutoML can start: getting all of the data and the right data. That is actually the hard part.

Moustafa Abdelbaky: Absolutely. But there is another interesting piece we have also built. We looked at Kaggle — if you think of Kaggle as the largest enterprise in the world, with all the datasets and all the state-of-the-art code and implementations — and built a context graph out of all of Kaggle.

For use cases where you have never built this kind of model before, we can surface the state of the art and use that to bootstrap the agents. It is not just naive AutoML, where it tries everything. It actually understands: for this particular problem and this particular dataset, let’s reduce the search space to the best architectures people have used in the past.

Ben Lorica: You folks are from Berkeley, so I’m going to bring in Stanford, the rival across the bay. Last year I talked to Yuri Leskovich, who has a startup called Kumo.ai. At a high level, based on prior work they did on graph neural networks, their working assumption is that you have Snowflake or Databricks, and they will grab all sorts of context out of those systems. They have a foundation model trained on lots of data, including synthetic data they created. The claim is that, with their system, you can build standard models — fraud detection, forecasting, and so on — using pipe-through prompting.

It seems like that system would tackle at least what you are trying to do if I have my data in Snowflake or Databricks. I have not used their system, but Yuri swears by it, and they do have logos on their website of users. Did you look into this line of work, where if my data is already in a warehouse, clean, structured, and I know the columns and metadata, do I still need Disarray?

Doris Xin: In those situations, as long as your columns are canonically defined — for example, specific revenue definitions—

Ben Lorica: You mean the names are descriptive?

Doris Xin: Right. The names are descriptive, and there is nothing idiosyncratic about how you define things.

Ben Lorica: And assuming I have a catalog, though most warehouses probably do not have such a great catalog — one where the catalog is really helpful.

Doris Xin: Right. I think these foundation models that build classical models rely on your columns conforming to specific semantics. If there is anything different about how you define something that sounds similar to how other people would define it, or if there are quirks or data distribution anomalies, then you need additional mechanisms to compensate for that.

Ben Lorica: So if my columns are just called A1, A2, A3, can you derive context? If the task is churn, or forecasting a value, I suppose you can brute-force and build the model based on the columns I make available to you, just like AutoML people would brute-force it. If the column names are so bad, where else will the additional context come from?

Doris Xin: Chances are we can find the code that generated that column. The SQL query is somewhere, or maybe it is in somebody’s pandas notebook. There is a lot of hope for recovering, from surrounding context, what that column actually is. If we see that A1 came from joining a column named “first name” and another named “last name,” then we know what A1 is about.

Ben Lorica: The system seems like a lot of the value is in the data engineering part. Once you help me with that, I may just build the model, because that is the fun part if I am a data scientist. What is the role of the data engineer in your system? Is the data engineer a reviewer, debugger, approver? What is the role?

Moustafa Abdelbaky: I would think of them more as supervisors who are now supervising many agents that can do data engineering. Data engineers in particular have a lot of backfill work. There is not enough time in the day. Now imagine having assistants that can help you build all of these pieces, while you review, orchestrate, and supervise at a much higher level than writing a SQL query or a particular ETL pipeline.

Ben Lorica: A lot of what data engineers do is build pipelines that do not exist. How can your system magically know what pipeline to build? The data engineer has to talk to several people to understand: “Oh, you want this and that. Let me first crawl the internet to get this.” There is context inside people’s heads that requires conversations and, unfortunately, multiple meetings. If one of the main tasks of a data engineer is requirements gathering or understanding what pipeline to build, how can an agent bypass that?

Moustafa Abdelbaky: It does not bypass that. There is talking to people to understand what you want to build. But part of what you just mentioned is: I need to crawl the internet, I need to look at what I already have, I need to understand this. Those are the pieces you delegate to the agent. You say, “Go crawl the internet and find the information that might be relevant for this use case.”

This goes back to the idea of supervision. The data engineer is still orchestrating because they are the people talking to stakeholders and understanding the requirements. Then they work with the agent to automate the data-drudge work.

Ben Lorica: At some point, you use the equivalent of Claude Code to help the data engineer build the actual pipeline, optimize it, and know that it works and will not break.

Moustafa Abdelbaky: Exactly. On that note, one thing that is important is that, for many coding agents, people are building relatively short features. You have a short window: “I am going to build this feature.” The agent runs for a couple of minutes, maybe a couple of hours, and then it stops.

For a lot of data work, that is not the case. The timescale is very different. We run things for hours and days, and you are training models. A lot of what we built into the agent harness itself is long-horizon autonomy. How do you allow these agents to run for a very long time without falling into the context-rot trap or running into issues around safety and permissions?

A lot of our own engineering has gone into building a much more robust version of a Claude Code-style system that understands it is working in an unexplored space. There is no clear definition of success. It is not simply writing a test that will pass or fail and then iterating. A lot of it is a research cycle: you propose a hypothesis, do experiments, come back, evaluate, and try again. That applies to the data engineering problem, the ML engineering problem, the data science problem, and a lot of AI research in general.

Ben Lorica: How opinionated is your system when it comes to the model? In other words, if I am an enterprise and I tell you, “We are pre-approved to use Gemini 3.x, but not the other models,” are you flexible enough to use whatever model the company is using?

Moustafa Abdelbaky: Yes. One line of code.

Ben Lorica: Interesting. Most of these agents are very sensitive to the model. Honestly, when I talk to many startups that are building agents, the agent is really just a prompt. Are you comfortable that whatever model I am using, your system will still work?

Moustafa Abdelbaky: Absolutely. One of the interesting experiments we ran involved the Kaggle competitions Doris mentioned. We gave the system a competition description, a starting dataset, and a GPU, and said, “You have 24 hours to get a medal.”

We ran this with many different models from all the frontier labs: OpenAI, Anthropic, Gemini, and so forth. We ran more than 500 experiments and looked at whether the model made a difference. What we found was that, for a lot of these tasks, most of the models were able to get a medal. What really mattered was whether the context graph was turned on or off. That made a very big difference in performance. Some of them were not able to medal until we turned on the context graph, and then they were able to medal much faster.

The context graph — or our thesis, which is that context matters — is much more important. Ali from Databricks said a couple of weeks ago at HumanX that a lot of people are trying to go after superintelligence, but he believes AGI is already here. Models are already very smart. We actually believe that as well. The models are pretty good. It is about the context and how you connect them to workflows that will make the biggest difference.

Ben Lorica: You go into a company, particularly a data engineering or data science team, and you tell them, “If you use our tool, you can be 10 times more productive and take on 10 times more projects.” On the other hand, some might interpret that as, “If I bring in this tool, we might end up reducing the size of our team.” Therefore, they may bad-mouth Disarray, make sure it does not get used, and point out its flaws. Have you thought about how to engage these stakeholders?

Doris Xin: Human in the loop is a huge philosophy that we very much stand behind. We want to make sure the users we serve have full visibility and understanding, even though they may not be hands-on with everything that is happening. Making the human your partner, rather than replacing them, is a huge part of what we believe is needed to be successful in the data world.

It does not matter whether Claude built it or the data engineer built it. At the end of the day, the data engineer is responsible for what happens, and these are very high-stakes applications. Human accountability will never go away.

The other aspect is that I do not think we currently have a surplus of ML engineering talent. There is way more demand than the talent can handle. We see Disarray as a way for people to take on more of their backlog, rather than making things so efficient that they do not need more engineers.

Ben Lorica: One of the complaints right now about agents and MCP is that they are inefficient. It seems like you can help in the following sense: for a project I have never done before, you might help me build a churn model that does not exist. Once you build this churn model, it improves the context store or context graph. But there is also some notion of memory, specifically procedural memory.

Now I want to build a fraud detection model or something else. There may have been lessons from the churn model that carry over to fraud detection. Therefore, when I use your system, hopefully I will not burn as many tokens as if I had learned no lessons from building the churn model. One of the complaints about MCP is that it is basically dumb. It does not know anything. Each time, it burns through tons and tons of tokens. How are you going to help me with token efficiency?

Doris Xin: Efficiency really comes down to how many things you try before you land on the thing that works. The context graph is the way we make sure every iteration is a high-quality hypothesis.

The context graph not only remembers what our agent has done in the past; it also seeds the agent with the institutional knowledge of the organization we serve on day one. The agent does not need to try things that human engineers have already tried and know will not work.

Efficiency comes down to taking fewer, higher-quality shots. Given that each iteration may involve hundreds of hours of GPU, bringing the number of iterations down from 10 to two is massive, both in terms of compute and token usage.

Ben Lorica: So your system has memory?

Doris Xin: That’s right.

Ben Lorica: And this memory is stored in the Disarray proprietary layer, so I cannot leverage it for anything outside of Disarray.

Doris Xin: It is all part of the context graph. Because we construct it from your existing data assets, you would be able to construct it yourself as well. Technically, we are not walking away with any additional knowledge. We are organizing it in a way that makes it easy for the agent to take advantage of.

Ben Lorica: But that layer of your system is not accessible externally? Let’s say I want to do something else that is not related to building an ML model. Because this exists, maybe I should try to talk to this layer of Disarray and see if it might help me with another task.

Moustafa Abdelbaky: Absolutely. As long as you are in the same organization, you do have access to that context. Any context that gets built is accessible. The way to think about it is that there is the initial context we constructed before we came in, and then there is existing context that continues to be built around the applications and models you are building.

Ben Lorica: So the context store, or context graph, is a standalone system that I can talk to and use outside of building an ML model?

Moustafa Abdelbaky: Absolutely.

Ben Lorica: Is it an actual graph stored in a graph database? Or what is it?

Moustafa Abdelbaky: It is a graph. It is not stored in a graph database.

Ben Lorica: Thank God.

Moustafa Abdelbaky: Our scale is a lot larger than what we can support with a graph database. But it does manifest in memory as a graph.

Ben Lorica: Is it basically like a knowledge graph?

Doris Xin: It is like a knowledge graph-plus-plus. A knowledge graph is entity relationships. But the context you alluded to earlier — the person who works on a piece of data, for example — also brings a lot of context. The orchestrator layer tells us how something came to be. That context is key.

This notion of a context graph is a graph of decision traces. We do not only want to know what the world looks like today. We want to know how it got there so we can use lessons from the past and apply them going forward.

Ben Lorica: In these systems, the main example right now is coding agents. The harness can vary, but some people even just use a command line as the main interface. In this case, it seems like the data engineer, machine learning engineer, or data scientist will require something more to engender trust. How do I know the system is getting the right data sources or has done the work to standardize or clean the data? What is the UX, or what is the main interface in the harness?

Doris Xin: The UX is a key piece of the puzzle that we are actively working on. Because we can integrate with coding agents, we surface the key context we use to reach a conclusion or recommendation along with the response.

We will be able to tell you: the reason we built this model architecture is because we have seen others use this architecture in similar applications, and here is the performance. But making sense of data is much more than reading a couple of citations. You want to traverse the lineage: “This model was built with what data? How was that data created?”

There is active work around how to surface data context and model intelligence context in a user-friendly way. We have been exploring different ways of showing things as graphs or making it easy for people to traverse relationships.

Ben Lorica: Data scientists use notebooks. I am not a fan of notebooks, but they are a tool people use. A notebook is like a lab notebook; it gives you lineage. The work is there. One of the first things people do is load the data, run basic descriptive statistics, and do exploratory data analysis. Will your system have that kind of view? The reason people do that is to get comfortable with the data. But if your system already does that, maybe I do not need to see it anymore. Just tell me you did it. Tell me you kicked the tires on the data and that will reassure me.

What are some of your early thoughts around reassuring the user?

Doris Xin: A notebook is actually a great example. We generate the notebook for people so they can walk through it in this familiar format.

Ben Lorica: If they want.

Doris Xin: That’s right.

Ben Lorica: So if they want to see the step-by-step process, it is there.

Doris Xin: That’s right.

Moustafa Abdelbaky: We also do things like cite, for example: we are using this particular architecture because here is the actual GitHub notebook that someone else on your team used in the past on this particular data.

From the get-go, data is tricky to work with, and we want to make sure people trust what we and the agents or LLMs are saying. If you think of the context graph as the understanding layer for institutional knowledge, we expose that to coding agents or MCP, but we also expose it via REST endpoints to the data scientist, the manager, or the data team.

There are different modalities for how you interact with and traverse the graph and understand what happened in the past. We have also built UIs in the past, but with millions of nodes, it is really hard for people to understand that at scale.

Ben Lorica: Data engineers and machine learning engineers may not use notebooks because notebooks have the stereotype of not being production-ready. In data engineering, building a pipeline might mean touching multiple distributed systems: Kafka, data processing in Spark, and then landing the data in a lakehouse. That is what makes testing and debugging pipelines hard: you sometimes have so much infrastructure.

Your system is only as good as my existing system for testing pipelines. If my production system is something I cannot touch, I may have a development environment. Sure, I can try to build the pipeline on my laptop, but how do I know that once I deploy it, it will work? How do you cross the chasm from development — or initial pipeline development, debugging, and checking — to production, given that it often involves infrastructure?

Doris Xin: When the system connects with all of these different data sources and orchestrators, it also has the tools to invoke them. It can grab some data and try to see whether this will run on the various infrastructure pieces you have. It is not just running locally on your laptop. It can run different pieces in different development environments.

Crossing the production chasm comes down to what test harnesses are available to the developer and can be transferred onto the agent. The agent can also help you build out your A/B testing infrastructure and gradual rollout infrastructure.

Ben Lorica: It would be nice if the agent told me, “Actually, we cannot deploy this to production because we need to test certain things,” or, “Frankly, Ben, your test environment sucks. Based on what you have told me about the production environment, your test environment is not good enough.”

Doris Xin: This is where we can partner with the data engineer. Through the context graph and the harnesses we build, we can help enforce good practices. Instead of the data engineer having to supervise ML engineers or data scientists to make sure they are doing the right thing, they can set up those practices in our system so it can enforce them across the organization in perpetuity.

Ben Lorica: My working assumption, despite what Doris said earlier, is that if all my data is in Databricks or Snowflake, then Databricks and Snowflake will build something similar to Disarray for that data. The challenge, of course, is that most enterprises use many systems. Inside a JP Morgan Chase, 10 groups might use Snowflake, 10 might use BigQuery, and 10 might use Databricks.

In many ways, this is your opportunity: you can bridge across multiple vendors. Is that correct? Even if each individual vendor executes well on the data in its system, it will only work for data in that system.

Moustafa Abdelbaky: Absolutely.

Ben Lorica: I am an advisor to Databricks, and Ali will probably say, “No, no, we can work across systems as well.” Earlier, you said you are pretty happy with the state of foundation models in terms of their capability. But let’s face it: no one is completely happy with the state of foundation models. What would you like to see from foundation models? Do you want models that are smaller and cheaper to operate, or is there a specific capability that would help what you are trying to do?

Doris Xin: I think we want to see some of the harness work we have been doing pushed down into the foundation models — some of these best practices and good behaviors baked into the foundation models through reinforcement learning, for example.

Ben Lorica: What specifically?

Doris Xin: A couple of things. We have to make sure the agent fully understands what the task is and what the completion criteria look like. I think there are certain things that, through reinforcement learning, can become part of how the agent thinks.

Ben Lorica: So even if you provide all the context through your context graph, the foundation models are still lacking?

Doris Xin: That is why we had to build out the harness for long-horizon autonomy.

Moustafa Abdelbaky: I can give a concrete example. One thing we have seen — and a lot of research, including from Anthropic, has seen something similar — is that LLMs are really good at claiming success when they have not finished the task. You can say, “Go build this,” and it will come back and say, “I’m done,” when it is not actually done.

What we ended up doing is building a multi-agent system where a supervisor verifies what the first agent has or has not done. We built this out of necessity because agents have a bias to claim success early.

Ben Lorica: Was that an agent, like Claude Code, or Claude the model?

Moustafa Abdelbaky: Claude Code.

Ben Lorica: So that has nothing to do with the foundation model. That is the harness, right?

Moustafa Abdelbaky: To an extent. I think the line is blurring between those things. Even if you go to ChatGPT and ask it to do some research, it will come back and say, “I have done everything.” Then you look at it and realize it is missing this and this and this.

What we see as the endpoint for the model is no longer just a model. Obviously, there is a much larger system in place. We do tool execution. What we interact with as frontier models now is more of an agentic system than a single model.

Ben Lorica: From your perspective, while you can build that piece, it is not core to your system. The core of your system is still context. The task-completion piece is something you wish you did not have to build. Is that what I am hearing?

Moustafa Abdelbaky: Partially. We have seen a lot of interesting failure modes. For example, we have agents and give them tools. Since we are training models, we want to make sure they are not evaluating on the test set. One of the things we do is set permissions to block MCP tools from touching the directory that contains the test set.

Because the models are optimized to do well, one agent tried to write a Python script that would read the test set and bake that into the training model. We have seen a lot of interesting behavior. At the end of the day, the models are optimized for a specific objective, and part of what we build into these harnesses is ways to prevent the models from finding reward hacks.

Ben Lorica: Interesting. In closing, do I have to worry about data privacy and data governance? When you talk about building this context graph, in a typical enterprise there is identity management and access control. Maybe I am not supposed to know about a particular piece of context because I am in a different division from Moustafa.

Moustafa Abdelbaky: Absolutely. That is part of why our background and experience in industry has helped. From the get-go, our design understands these issues. We baked role-based access control and identity management into the context graph, so you only get to see the personal view you are allowed to see. We essentially inherit the user’s permissions.

Ben Lorica: Many companies like Snowflake and Databricks now assume that many of the users in their systems will be agents, not people. You could be used by agents too, right? Potentially, Disarray could have non-human agents, although Doris alluded to the human-in-the-loop piece as an important part of what you do.

Once the context graph is up and running and coverage is great, maybe an agent can just use Disarray, bypassing humans. Or is that never going to happen, given that the human in the loop is core to what you do?

Doris Xin: I think that is a question for a specific organization and how it thinks about accountability. If you have only an agent responsible for pushing out a machine learning model—

Ben Lorica: It could be a low-stakes model. Nothing that important. Maybe it will be built by marketing, and they do not have the expertise to look at the paper trail anyway.

Moustafa Abdelbaky: From a technical perspective, we support that. It comes down to governance and policy. What kind of policy do you want for these agents? If you are recommending credit cards or making decisions about credit applications, you cannot just say, “The agent came up with this.” There has to be accountability at the human level.

But if you are building the equivalent of a dashboard that someone is using internally, we can totally support that.

Ben Lorica: We are recording this in late April, and I am airing this in late May. When will Disarray be available for early trial?

Doris Xin: It will probably be after you air the episode because of the UX problems we talked about earlier. We are actively working with people to figure out the right UX.

Ben Lorica: What can people do to follow what you are doing?

Doris Xin: Please sign up for our newsletter. We will continue to talk more about the technical aspects of how we are building things. Also, feel free to reach out to me directly on LinkedIn or via email. We always love to hear how people are thinking about the space and what kinds of needs they have.

Ben Lorica: With that, thank you, Moustafa and Doris.

Doris Xin: Thank you so much, Ben.

Doris Xin and Moustafa Abdelbaky on Classical ML, Context Graphs, Data Agents, and the Future of Enterprise AI.

Transcript

Share this:

Like this:

Discover more from The Data Exchange