When “Garbage In, Garbage Out” Gets It Wrong

Terrence Lee-St. John on Dirty Data, Latent Signals, Predictive Robustness, and the Limits of Data Cleaning.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • YouTube • AntennaPod • Podcast Addict • Amazon • RSS.

Ben Lorica speaks with Terrence Lee-St. John, founder of Enli and lead author of From Garbage to Gold: A Data Architectural Theory of Predictive Robustness. They discuss why “garbage in, garbage out” does not always hold for predictive models, especially when wide, messy tabular datasets contain redundant signals that help recover underlying latent drivers. The conversation explores practical implications for healthcare, regulated industries, feature selection, tabular foundation models, and how Enli aims to operationalize a more data-centric approach to stable, explainable prediction.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Time Series Foundation Models: What You Need To Know
Jure Leskovec → The AI Revolution Finally Comes to Structured Data
Stop upgrading your LLM. Start fixing your data.
Data Engineering in 2026: What Changes?
Mikio Braun → Coding Agents Meet Data Science
The Missing Layer: Why Your AI Agent Fails — and What Actually Fixes It
Jeff Hawke → World Models Are Here—But It’s Still the GPT-2 Phase

Support our work by subscribing to our newsletter📩

Transcript

Below is a polished and edited transcript.

Ben Lorica: All right, today we have Terrence Lee-St. John, founder of Enly, which you can find at enly.com.au. Their tagline reads: “Predictive systems that remain stable under change, where operating conditions shift and decisions rely on dependable inference. Enly maintains observable and governable predictive behavior over time.” And with that, Terrence, welcome to the podcast.

Terrence Lee-St. John: Thanks for having me, Ben. I’m really excited to be talking to you right now.

Ben Lorica: The reason we’re getting together is that Terrence is the lead author of an interesting new paper, which we will talk about in this episode. I will link to it in the episode notes, so make sure you check those out. The title of the paper, if you want to look it up, is From Garbage to Gold: A Data Architectural Theory of Predictive Robustness. All right, Terrence, let’s start from the basics. In this paper, what problem are you trying to address?

Terrence Lee-St. John: My main career has been in industry. In industry, you always hear “garbage in, garbage out.” It’s thrown around all the time at project kickoff meetings whenever we’re looking at new datasets. What the paper addresses is that, more recently, in the context of big data, you’re seeing more and more examples of models being trained on data that we would traditionally consider garbage, yet those models are reaching state-of-the-art predictive performance.

So it’s defying the mantra of “garbage in, garbage out.” And there hasn’t really been a comprehensive explanation for why this is possible. Most of the explanations have been based on algorithmic behavior or model characteristics, regularization, things like that. But none of them, at least to my knowledge, have really looked at the data side of the equation and focused on what characteristics of the data itself might allow this paradoxical behavior to emerge. The paper is really trying to provide a theoretical explanation for why you’re able to get good predictive results out of data with a ton of errors in it, to be frank. So that’s the background. The paper presents the foundations of the theory. It’s not a full theory—that would require a book at this point—but it’s the start of one, and I’m working to build off of it going forward.

Ben Lorica: For our listeners, there have obviously been people who have noted some of these phenomena at a high level. I think there’s a famous phrase, “the unreasonable effectiveness of data.” That phrase is basically making the observation that, ultimately, people are empirical. They care about empirical results. As long as the results seem to be good, they run with it. But for me, having been formally trained as an applied mathematician, I always like having some theoretical grounding to make my empirical confidence even higher. Terrence, why did you decide to tackle this problem? In other words, why is this problem significant? Are there really a lot of people who care about this question? What are the practical implications, for example, if you are correct?

Terrence Lee-St. John: There are a couple of things I want to unpack there. Why I really decided to tackle it was that, even though I do think practitioners in the CS and ML world are much more empirically driven and the evidence speaks for itself, there are lots of contexts where that’s still not the case. I’ve worked in healthcare in the past, and a lot of this came out of my work there. Even with empirical evidence that these types of models can work on garbage datasets, I personally received pushback from hospital administrators and clinicians who were unable to accept that a high-stakes prediction—like a stroke or heart attack—could be made accurately from hospital records.

Part of that is because they know what goes into hospital record data collection. It’s dirty, it’s messy, lots of things are coded incorrectly, and there’s a lot of missing data. And part of it is just people believing “garbage in, garbage out” without thinking about why and what’s behind it. So this is a frustration I’ve encountered personally. Even though I had empirical evidence that this would work, I felt the need to provide a more rigorous theoretical justification. So that was part one. Part two, in terms of the practical implications if this theory is correct—

Ben Lorica: Actually, before you go down that road, besides healthcare, are there other domains where you think this might be an issue as well?

Terrence Lee-St. John: The issue of disbelief that the data itself could work?

Ben Lorica: Yeah.

Terrence Lee-St. John: I do think the “garbage in, garbage out” mantra is generally cited across domains, even in business domains. Obviously, empirical weight is weighted more heavily in some of these other domains.

Ben Lorica: Yeah, yeah. In the regulated sectors, maybe financial services might be another one of these areas.

Terrence Lee-St. John: I mean, any sector that requires explanation for decision-making. A lot of regulated sectors tend to dislike the idea of throwing a bunch of dirty data into a model and just accepting that it works and will continue to work going forward.

Ben Lorica: Terrence, when you were talking earlier, it struck me—might one of the answers be, and I suspect the answer is no, a UX answer? In other words, if you are working with clinicians and you somehow provide them a UX that shows, “This is how I arrive at this decision,” with some level of explainability and transparency, is that enough?

Terrence Lee-St. John: If an explanation is given with the prediction that makes conceptual sense to the end user, yes, that definitely has helped in my experience. It helped with accepting that the model is there and is actually working properly behind the scenes.

Ben Lorica: So in other words, you can guide them through a UX that shows, “This is how the model works.” But let’s say you still don’t have your paper. You don’t have any theoretical foundations. Do you think that would have been enough?

Terrence Lee-St. John: In the practitioner or end-user space, maybe that’s enough. It depends on the regulation and how strong it is in the industry. But I would say there’s another benefit to having the theory beyond just adoption by the end user, and I go over that in the paper as well. One of the primary benefits is that if you understand how data contributes to a robust prediction, you can actually design your datasets to leverage some of the mechanisms underlying robust prediction itself. You can take a more proactive stance. Data isn’t something that we have to accept as is; it’s something we can adapt going forward. And if you understand the mechanisms of the data, then you can adapt it in an intelligent way that facilitates robustness.

That’s one piece. The other piece is that if you understand the mechanisms, you can also design algorithms and models that are explicitly intended to exploit the mechanisms of robustness in the data itself. So it guides the algorithmic or model development process. If certain things in the data enable robustness, then my model needs to be able to leverage those. So there are benefits beyond just the end user. But you’re right: the explainability piece does help the end user, and that’s certainly something that will always be important. Even if the theory is there, having it be explainable to the end user is going to be critical in healthcare and other regulated sectors.

Ben Lorica: So before this paper existed, obviously you had this big data—a lot of noise, maybe messy, right? People used it anyway. So what were the workarounds?

Terrence Lee-St. John: What were the workarounds that I used, or what were the workarounds that others used?

Ben Lorica: Or just generally, yeah.

Terrence Lee-St. John: Generally, it really depends. For some, there wasn’t a workaround. There are entire companies and industries dedicated to data cleaning, and the workaround for them is manual and intensive.

Ben Lorica: I have a good friend who has a startup, and he’s been working on it for years and years. They’re the world’s best company on one thing: entity resolution. They can do it in real time, at scale. And it’s something you might say, “I’ll just build it myself.” Yeah, you can build something, but it’s not going to be as good as theirs. And by the way, you have to maintain it as well, right?

Terrence Lee-St. John: For sure. Data cleaning, data quality in the traditional sense, is one workaround. Depending on how you’re operationalizing that, it may be extremely manually intensive, or maybe you have some automated tools that can handle some of it for you. But that has generally been the go-to playbook.

Ben Lorica: Because it’s also something that the end user can understand, right? It’s basically, “We had all this clinical data and clinical notes, and we cleaned them up first. Then once they were clean, we used them to build the model.”

Terrence Lee-St. John: Yeah, that aligns with the traditional logic of “garbage in, garbage out.” Our notes are dirty, our data is dirty, so cleaning it obviously is the way to go. That’s partially what this paper is fighting against. It’s not to say there’s no benefit in cleaning. Clean data is better than data with errors in it, absolutely. The problem is that cleaning usually entails a large tax. It’s expensive and time-consuming. If you are going to clean, depending on how you do it, it often creates a bottleneck on the dimensionality of the predictor set you can actually use. You can only clean so many variables. Even if I spend six months, I still can only clean a couple hundred variables, depending on how big the dataset is. So inevitably, you end up shrinking the dimensionality of your dataset, especially if you’re using manual cleaning and manual validation of patient or business records. And by shrinking the dimensionality of the predictor space, the paper shows you’re actually potentially harming the end prediction more than if you had just let the predictor space be much larger and allowed some of the errors to persist in the data itself.

Ben Lorica: So your paper mainly focuses on structured data, right?

Terrence Lee-St. John: Yes, the paper is about tabular data, structured data. It could be single tables, it could be relational, as long as it’s tabular and structured. There are extensions to unstructured data, but this paper doesn’t touch that yet.

Ben Lorica: And so at a high level, or at a practical level that practitioners can understand, what are some of the key findings?

Terrence Lee-St. John: The core insight is actually quite simple. In most complex data systems, the data we observe are driven by some underlying latent structure or latent drivers. That includes the outcome variable and the predictor variables. Once you realize that, if you’re working in a context where that’s true, the path to prediction—

Ben Lorica: This is different from the old statistical concepts like principal component analysis or factor analysis?

Terrence Lee-St. John: Conceptually, it’s similar. How you operationalize it is entirely different, because factor models or principal component models are very strict structural models about how you get to the latent factors or principal components. But at a high conceptual level, the idea is the same. You have these underlying drivers, and the data you see are actually shadows of the latent truth underneath. In healthcare, for example, an underlying driver might be metabolic syndrome. I can’t measure that directly. But whatever my value is on that underlying driver, it will affect the things you can measure. So the things I measure are just shadows of metabolic syndrome. Similarly, the outcome is a shadow of these latent underlying drivers as well.

Ben Lorica: And what you folks came up with is a procedure for uncovering these latent variables? In principal component analysis, it seems like there’s a dimension reduction benefit there as well, right? Is there something similar going on here?

Terrence Lee-St. John: Just to clarify, the paper is not directly about the exact procedure you use to uncover the latent factors. The paper is about the flow of information from the observed variables through the latent factors and back to the outcome you’re trying to predict. When you analyze that structure, two core insights come out. Just focusing on the predictor side of the equation, there are two types of noise that generally get conflated. One big takeaway from the paper is that we need to separate and partition those types of noise. One is observational error. Things are just measured incorrectly, or there are errors in the missing data. The observational process introduces errors. That’s the type of noise that we all understand intuitively.

Ben Lorica: The people who do surveys understand it, right?

Terrence Lee-St. John: Yeah, everyone gets that. But there’s another type of noise. The other type is that the variables themselves are just proxies for the underlying latent drivers. BMI is just a proxy for metabolic syndrome; it’s not a direct measure of it. BMI contains some information about the latent underlying driver, but it doesn’t contain all of it. So there’s structural ambiguity in these variables. Even if they’re perfectly measured with zero observational error, what we see are just imperfect proxies for what’s actually driving the system. So there are actually two types of noise: observational error and structural uncertainty.

Terrence Lee-St. John: When you think about it that way, why “garbage in, garbage out” doesn’t always hold is pretty clear. When your goal is to clean the variables, and you’re reducing the predictor space because cleaning is so manually intensive, you might end up with a perfectly clean dataset that’s only 20 or 100 variables wide. However, those 100 variables often do not fully cover the latent ambiguity. There’s still latent ambiguity in them. Even if you know everything perfectly, can you fully recover the latent drivers from that set? If you can’t, then there’s still structural uncertainty. So cleaning, no matter what you do, isn’t going to get you past that latent ambiguity. The only way you can break past that is to expand the set itself.

So instead of 100 variables, you say, “Okay, I’m going to use 1,000 or 2,000.” The idea is that even if these variables are error-prone, they’re providing different angles on the underlying latent drivers, so you can actually triangulate them. And what the math shows in the paper is that this triangulation is asymptotically perfect. You can perfectly recover the latent drivers with an infinitely large predictor set. Obviously, we don’t have infinite data, but as you go to infinity, the asymptotic properties become perfect. So the way to overcome structural uncertainty in a small set is actually to use more data. It’s very intuitive—more data seems like it should be better—but when you think about more data that contains errors, people generally don’t think that’s better. But what the paper shows is that more data, even if it contains errors, may actually benefit you.

Ben Lorica: So I get that you don’t have a precise procedure, right? But what you described got me thinking. If I have a model that relies on 10 predictors, and what you’re claiming is that I shouldn’t focus on cleaning those 10 predictors, but instead bring in more predictors—maybe 20 more, 40 more—number one, there’s the notion of confounding variables. I’m adding things that are highly correlated, so maybe they don’t really bring anything to the table. And secondly, if you don’t have an exact method, do you at least have some sort of classifier that will help me prioritize? Here are a thousand more variables, but I want to add only 20. Is there a way to prioritize those thousand? So there are two questions: one is this notion of confounding variables—adding things that are already highly correlated—and two, some sort of classifier to help me prioritize these new variables.

Terrence Lee-St. John: Okay, for the first question about collinearity: you’re right. Traditionally, people think of collinearity as waste, as redundancy, right?

Ben Lorica: I mean, people write PhD theses in economics building models around making sure their models are as Occam’s razor as possible, right?

Terrence Lee-St. John: Exactly. There are state-of-the-art feature selection algorithms—like minimum redundancy maximum relevance, or mRMR—that basically operationalize that mindset: redundancy is wasteful. What the math in the paper shows is that if you’re working in this latent hierarchy, the redundancy is actually helpful for recovering the latent states efficiently. If you think about observational error, if you only have one variable per latent driver, and that variable is just a proxy, it’s not a perfect measure. If you add more redundant variables, they are new views of that same latent driver. So the redundancy—yes, they’re correlated with each other—but they help you get at the missing part of the picture that your single variable had. If the single variable only captures 20% of the latent driver, you can throw in redundant variables. I’m not saying throw in perfectly collinear variables, because there’s no information gain in those. But if they’re distinct variables that are just correlated because they’re influenced by the same underlying driver, they’re effectively different angles on the same driver. You can piece together the puzzle. The redundancy actually helps the efficiency.

Ben Lorica: And then there’s the whole “correlation, not causation” issue, right? For example, the classic joke in the hedge fund industry, where I came from, is that you can build a model that is great at predicting stocks, but then it turns out the main driver is butter production in Bangladesh.

Terrence Lee-St. John: Yeah, for sure. So the correlation versus causation issue—

Ben Lorica: In other words, you think you found this perfect driver or predictor, but then it’s really not.

Terrence Lee-St. John: You’re right. If your goal is to understand the causal reality of the system you’re looking at, that’s really not what this paper is about. This paper is about maximizing prediction from the information that you have. And so whether something is causally driving—

Ben Lorica: It’s immaterial, yeah.

Terrence Lee-St. John: Yeah, it’s immaterial for this particular task. Now, for scientific theory, there’s a whole value in that, and I’m not—

Ben Lorica: And this is not about that. This is about people in industry who want to build better models, right?

Terrence Lee-St. John: Yeah, this is just about, when you’re purely trying to maximize prediction, what information is valuable? Redundant information is valuable, particularly if the proxies are not perfect. But also, if they’re not perfectly measured, having redundancy helps you triangulate past the observational error. And this isn’t a radical idea. I know it sounds very contrarian, but it’s actually grounded in the tradition of psychometrics. That field has been around for over 100 years. If you’re trying to measure unmeasurable human traits like intelligence or anxiety, they’ve known for decades that in order to get at those traits, you need a wide breadth of measurements that are all loosely measuring the same thing. Then you can estimate that underlying latent trait. To them, collinearity hasn’t been this problematic thing that you’re always trying to scrub out of your data. It’s actually been the signal or the footprint of the underlying latent structure. This just takes that logic and puts it in an information-theoretic framework so that it’s more general and can apply to some of these big data ML problems.

Ben Lorica: All right, I had that other question about having some sort of classifier. But before you answer that, there’s another detour I want to take. I don’t know if you’re familiar with this field called compressed sensing.

Terrence Lee-St. John: I’m not familiar with that, actually.

Ben Lorica: It’s an interesting field. Basically, it allows you to reconstruct high-resolution signals or images from fewer measurements. It’s how you can do compression algorithms on images, for example. It’s almost like the reverse of what you’re trying to do. Anyway, the classifier for helping me prioritize—or maybe what you’re saying is, why prioritize? Just use everything.

Terrence Lee-St. John: That is the asymptotic takeaway, sure. But practically speaking—

Terrence Lee-St. John: Practically speaking, that’s not always possible. You have to remember that ultimately these things are going to be put into a model, and the model has finite capacity.

Ben Lorica: Terrence, my interpretation of what you’re saying is: don’t worry too much about the cleanliness of your data. Rather, focus on things that will improve the coverage of your model with respect to these latent signals.

Terrence Lee-St. John: If that’s the case, then there’s still the notion of some extra predictors being more helpful than others, right?

Terrence Lee-St. John: Absolutely. Looking at it through this lens, the quality of data is no longer about how error-free your data are. It’s more about whether the portfolio of data you’ve accumulated covers the latent drivers comprehensively and redundantly. It’s a portfolio-level mindset. If you start with a certain set of features and you’re looking at adding more, there are going to be some features that add more information than others. Again, it all goes back to recovering that latent hierarchy.

Ben Lorica: And so, do you have a classifier to help me do that filtering?

Terrence Lee-St. John: To help you figure out which ones to choose? The paper lays out a few strategies. At Enly, we are working on operationalizing a classifier that uses the total correlation metric, which is the most rigorous version of this. But it’s also the most computationally expensive.

Terrence Lee-St. John: If you don’t want to use a total correlation metric, there are two issues. Is the outcome known at this point, or are we just trying to build a general-use dataset? If the outcome is known, and it’s a trusted, clean label, then you can use the most efficient method, because what you’re really trying to do is uncover the parts of the latent hierarchy that are important for prediction. Those are the parts that matter.

Ben Lorica: Let’s make this concrete. I’m a data scientist or an analyst in charge of improving our models for customer churn. I go to the data warehouse. There are a hundred clean variables there. I build a model. After I read your paper, I’m going, “Well, maybe I can add more predictors, and according to Terrence and team, I don’t have to go to IT and clean this data in advance.” So I can just start bringing in new data and see if it will improve the model. But obviously, I don’t have time to add a thousand new datasets or predictors to improve my churn model. So what are my options?

Terrence Lee-St. John: If you already have a model in place, that implies you know the outcome and it’s reasonably clean. The most straightforward way to do it is to use the prediction from that model and look at the residuals. The residuals represent the uncertainty that’s still left in the system. What you can do is correlate the potential variables—whether using an actual correlation or an information metric—with the residuals. The variable that has the highest correlation with the residuals is effectively the variable that provides the most information gain to the current predictor set you’re using. So if you had 100 predictors and I look at the residuals and say, “Oh, this predictor over here has the highest information gain,” you throw that in. Now you have a new model that has 101 predictors. You can repeat the process. It’s almost like boosting conceptually, except on the predictor space. You’re not increasing the model capacity; you’re increasing the predictor space through this iterative boosting method. The predictors that end up getting pulled in are the ones that add the most information to that particular prediction. Now, that only works if you know the endpoint and the endpoint is clean.

Terrence Lee-St. John: If your endpoint is dirty, the paper goes into a more general idea where you’re pulling in predictors based off of—I don’t want to say factor loadings, because it’s not strictly speaking a factor score—but if you think of it in terms of factor analysis, you pull in predictors that load highly on the factors in your dataset. You’re reinforcing the latent signal by pulling in variables that lift the signal higher. So your signal-to-noise ratio will ultimately look better. Those are the two general strategies. Again, how you calculate the correlation—whether it’s a true correlation or a total correlation information theory metric—there are lots of ways to operationalize that. But conceptually, that’s what we’re doing.

Ben Lorica: Some listeners may hear this conversation and say, “Wow, this is really old school.” Because now we’re in the age of AI. We have foundation models even for tabular data. We have startups like Jure Leskovec’s Kumo.ai and the folks at Relational.ai, which will go into your data warehouse or data lakehouse and use all of the context available there to improve their foundation models for tabular data. And then, boom, you can just start building models without worrying about all of these things. What do you say to that? People are actually deriving benefit from this, so maybe your approach can complement theirs. As I understand it, they start from using Databricks or Snowflake. You have a lakehouse or a warehouse. Based on all sorts of metadata from your warehouse, including query patterns, they build as much context as they can—a context graph, a knowledge graph—to help improve their foundation model, which was built specifically for structured data. I don’t know if you’ve been following this line of work, but how do you fit into that world?

Terrence Lee-St. John: Professor Leskovec’s work is amazing. I have so much respect for it.

Ben Lorica: By the way, for our listeners, I’ll link to an episode where I talked to him late last year.

Terrence Lee-St. John: Yeah, that’s a great episode. His core premise—that you need to leverage the raw tabular data in all its glory; you don’t feature engineer, you don’t reduce it—I think that’s absolutely correct. And that actually aligns with the theory in the From Garbage to Gold paper. When they’re ingesting a massive amount of uncleaned relational databases, they’re basically leveraging what the paper calls “dirty breadth.” It’s error-prone, but it’s wide-breadth data. The network is using high-dimensional structural redundancy to create a representation or to triangulate the latent signal underneath. I actually think the theory paper provides a principled explanation for why that general strategy is potentially successful. And honestly, how he operationalizes the graph representation is truly genius. It gets you past this ETL piece, and so much effort in data science and machine learning is in that first ETL piece. It’s amazing work. I would say we diverge, however, on the deployment of that concept and technology.

Terrence Lee-St. John: His premise relies on a centralized foundation model. For me, that’s hard, and here’s why. My formal background is in inferential statistics. So when I’m looking at architectures and strategies around data, the data-generating process is always at the forefront of my mind. It’s a different way to look at it from a computer scientist’s vantage point. In inference, the most fundamental rule is that your sample data needs to be representative of the specific context you’re making that inference to. If it’s not representative, your inference won’t be great. If you believe a foundation model can apply to all tabular data, you have to fundamentally believe that there’s a universal underlying latent topology to all tabular data systems, regardless of domain, context, or time.

Ben Lorica: It depends on the stakes too, right? For example, if you’re going to apply this to a recommender system, that’s very different from a diagnostic system. If the foundation model was trained on synthetic data for recommenders, and then I expose it to my data warehouse, where it can use my data to fine-tune or post-train the foundation model, I’d be okay with that because it’s a recommender system. It’s not going to kill anyone.

Terrence Lee-St. John: Right. If it’s a lower-stakes project, then maybe you can handle some mismatch.

Ben Lorica: Or the other competition might be zero. I might be working in a company where the data scientists are so slammed that I need a new model for marketing, but data science has no bandwidth. I use this foundation model approach supplemented with a context graph, and it produces something decent enough that I can use it.

Terrence Lee-St. John: Absolutely. There are all sorts of practical, pragmatic reasons why an out-of-the-box, one-shot foundation model, if it works well enough for the intended use case, is great. The one area where I’ve actually had more hands-on experience is in foundation models for time series. I came from the hedge fund world, so for a specific model, maybe I won’t use it, but for learning and exploring broadly, it’s great. Kind of like how mathematicians use these foundation models for ideation and exploration. Once I have some candidates for where to really look, then maybe I’ll start building some things myself. So I think it depends. If we’re talking about a foundation model for a very specific domain, that’s one thing. If we’re talking about a universal foundation model for all tabular data—

Ben Lorica: By the way, what their claim is, is that you start with a foundation model, and then there’s some sort of post-training and additional context that you provide the model. They’re not saying, “Here’s a foundation model for tabular data; you don’t have to do anything, it’ll just work with your data.” That’s not what they’re saying.

Terrence Lee-St. John: I’d say the issue with tabular data is that it’s less universal. Foundation models in language and vision have universal underlying principles governing the data. English has grammar, spelling, and universal rules that bound the possibility space. Vision has universal physics. Tabular data, in general, is not that. It’s just a modality for holding information. And because it’s just a modality, the universe of possibilities it can take is infinite.

Ben Lorica: That’s why you do the post-training and the knowledge graph and the context, right?

Terrence Lee-St. John: Yeah, so if you have a foundation model and whatever patterns it has encoded are somewhat represented in your local data, that’s great. And the argument is that to get that extra oomph, you can fine-tune it.

Ben Lorica: And like I said, it’s not like they’re deploying this for medical domains at this point. It’s recommenders, marketing, those kinds of things.

Terrence Lee-St. John: Fine-tuning is a step in the right direction. It allows your local context to come to the surface. However, those foundation models have massive pre-training. In order to really overwrite weights in a fine-tuning exercise, the amount of data required might be beyond the scope of a single enterprise. If that’s the case, your predictions are going to be biased toward the foundation weights, even if they’re pulled in the direction of your local data. Now, if the stakes are low, that may be good enough. And if your data science team is slammed, that’s a practical, real-world constraint.

Ben Lorica: Both of those are true in the real world. There are a lot of models where the stakes are lower, and a lot of companies where the people building the models don’t have the bandwidth.

Terrence Lee-St. John: Exactly. I would say if your goal is really high stakes, really mission-critical, especially if your context is very localized—

Ben Lorica: Like I said, similar to how mathematicians use them for ideation and exploration. You have 20 ideas, and you can use it to filter down to the two that you will really work on.

Terrence Lee-St. John: Absolutely. I’m in agreement. There’s immense value in these systems. It just depends on how critical the use case is. For me, if I’m a hospital and it’s a critical use case, maybe I want to focus more on letting my local data speak for itself rather than tweaking a foundation model. That’s kind of what the paper implies. It becomes more possible to do this when you don’t have to clean everything. The data reduction strategies outlined in the paper—we call them proactive data-centric AI strategies—are all designed to reduce the predictor space efficiently, meaning what you have left is still capable of high-fidelity recovery of those latent signals.

Terrence Lee-St. John: In a concrete example, when I was working at Cleveland Clinic Abu Dhabi, I started with 588,000 patients, millions of time points, and over 32,000 potential predictors. If I tried to throw all of that into a model, I literally couldn’t do it. It wouldn’t fit into memory. I was able to compress the predictor space and get it to a point where, instead of 32,000 columns, it was anywhere from 900 to 4,000 columns, depending on the time point. You’re still talking high dimensionality in a traditional sense, but it’s lower dimensionality than the original set.

Ben Lorica: So you went from 30,000 variables to a few thousand variables. Now that you have the paper, you have a theoretical reason for being comfortable with that. But is it something you can explain to the end user? Do you lose trust along the way because it becomes a black box?

Terrence Lee-St. John: If you think about how we traditionally explain models, people think about the columns themselves—the input variables. They say, “This model was looking at features A, B, and C, so those are highly relevant.”

Ben Lorica: Or it could be a composition of those columns and say, “We came up with a factor to represent BMI.”

Terrence Lee-St. John: Right, exactly. What’s implied by my research is that if you’re using thousands of variables and they have errors in them, the paper shows you can uncover the latent states reliably. Explanation then should be made at that latent layer. It shouldn’t necessarily be made at the observed variable layer. Instead of talking about BMI and its effect on stroke, or clicks and their effect on churn, you’re talking about what the latent drivers are, and you can semantically describe them based on the columns that load on them. The explanation ends up being a higher-level factor or latent-state explanation.

Ben Lorica: Which is very similar to what traditionally is done in stats anyway. People are able to come up with a label for the driver somehow.

Ben Lorica: We have to wrap up, but you have this company, enly.com.au, which will try to operationalize this. I suggest that you try to fit into this modern world of AI where it can’t be too statistical, and I shouldn’t have to write code to use any of these tools.

Terrence Lee-St. John: We understand that ease of use is key to anything. The product we’re working on: you point it at a tabular dataset, and it goes through the process of feature selection—

Ben Lorica: You should assume that if you’re going to sell to enterprise right now, I’m using Snowflake, Databricks, BigQuery, one of these things.

Terrence Lee-St. John: Yeah, of course. This technology we’re building is data warehouse native. It will sit where the data is, and it will have access to where the data are. The idea is that you can churn out these predictive models from your data in a semi-automated way. I don’t want to call it AutoML, because AutoML was really brute force. It’s a principled version of AutoML.

Ben Lorica: But implemented for this age of AI, where the interfaces are the familiar ones that people use, right?

Terrence Lee-St. John: Yeah, that is the goal. It will leverage all the data that’s available and potentially useful for prediction. You don’t need domain experts and content experts to tell the model what to do. It leverages the data. It’s similar to Professor Leskovec’s work; it’s just that how it operationalizes it is local versus a foundation model that gets transferred in. You end up with a model that represents your local space and your local time, so it doesn’t get stale. It updates automatically, and that’s why the system remains stable under change. It could even get better over time as more data come in. It’s a method-transfer paradigm. With this generalizable method, you can use your own data warehouses, even without cleaning them fully, and still get really good predictions. And explainability is at the latent level. We’re working on operationalizing that with LLMs, so you can actually communicate with the data at the latent level. The LLM doesn’t touch the raw data, which has errors in it.

Ben Lorica: Oh, yes. Just to wrap up—sorry for the abrupt—

Terrence Lee-St. John: No problem. The LLM doesn’t touch the raw data with errors in it. It touches the verified latent layer. So the problem of throwing an LLM at error-prone tabular data goes away. It is very much a traditional framework, but operationalized in a modern AI sense.

Ben Lorica: And with that, thank you, Terrence.

Terrence Lee-St. John: Thanks, Ben.

Terrence Lee-St. John on Dirty Data, Latent Signals, Predictive Robustness, and the Limits of Data Cleaning.

Transcript

Share this:

Like this:

Discover more from The Data Exchange