The Gap Between AI Hype and Enterprise Reality

Richard Garris and Barry Dauber on context, cost, multimodality, agents, and model choice in enterprise AI.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • YouTube • AntennaPod • Podcast Addict • Amazon • RSS.

In this episode, Ben Lorica talks with Richard Garris and Barry Dauber from Databricks, about what enterprises are actually struggling with as they move AI from demo to deployment. The conversation explores the difficult shift from deterministic software to probabilistic models, highlighting the critical need for robust evaluation, proper context management, and cross-functional collaboration. They also dive into practical use cases for AI agents beyond coding, the realities of multi-modal data, and how organizations are navigating data governance, shadow IT, and model agnosticism.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Why Your AI Agents Need Operational Memory, Not Just Conversational Memory
What Is An AI Delegate?
Your AI model isn’t the problem. Its environment is.
Kay Zhu → Your First AI Employee Is Already Clocking In
Mikio Braun → Coding Agents Meet Data Science
Matthew Glickman → The Junior Data Engineer is Now an AI Agent
Ben Lorica and Evangelos Simoudis → Why Your AI Committee Might Be Your Biggest AI Problem

Support our work by subscribing to our newsletter📩

Transcript

Below is a polished and edited transcript.

Here is a cleaned-up version of the transcript with timestamps removed and speaker names in bold. I only copy-edited for readability and did not change the meaning.

Ben Lorica: All right, so today we have Richard Garris and Barry Dauber, both from Databricks. And as you folks know, full disclosure: I’m still an advisor to Databricks. So, guys, welcome to the podcast.

Barry Dauber: Thanks for having us, Ben.

Ben Lorica: Today, we’re not really going to talk about Databricks in particular. I have them on because they talk to a lot of enterprises. Since we tend to, here in Silicon Valley, talk among ourselves and not to real companies, I thought it would be great to have people who are actually on the front lines interacting with regular companies—some mix of tech and non-tech companies.

So with that, first things first: what’s the one thing you see that we take for granted out here that people are still struggling with in the real world?

Richard Garris: Yeah, so the thing I’m seeing is that customers are still struggling with getting the quality they need from their agentic systems or from their LLMs in general. LLMs are non-deterministic models, and since we came from a world where computers are deterministic—1 plus 1 always equals 2—we’re now in a world that’s non-deterministic. There’s still some learning happening in the enterprise around the idea that this is a new paradigm, a new kind of system, and we need to build things around it to understand that and actually accommodate it.

I was literally just on a call right before this one, Ben, with one of our large media partners. They’re processing invoices, and the expectation was 100% accuracy between extracting the data from the fields and putting it into their SAP ERP system. That’s not realistic. We got them down to 70%, because it’s non-deterministic, and said, “Hey, you’ve got to build the testing and the harness around it to get to that point.” They’re just now learning that in order to get to that level of quality, they need to build the right testing, the right infrastructure, and the right context around how to provide context to these models so they produce the right answers.

Ben Lorica: So, Barry, when you talk to people about things like what Richard described, is your sense that they’re already trying things and hitting a wall, or are they completely lost about what to do?

Barry Dauber: To use the bad answer—the consulting answer—it definitely depends. But obviously Richard and I do a lot of similar meetings, and I think he’s spot on. It depends on the organization and where they’re starting.

Everybody is doing something, right? Their board and their C-suite are telling them, “Hey, you’ve got to get on this AI train, and we need to be doing more.” I think what people are forgetting—and what they’re now starting to really understand—is that when ChatGPT 3.0 came out, it was one of the greatest things ever, right? Everybody started deploying it, and a lot of access came via Azure OpenAI, and you saw a lot of non-Azure customers stand that up pretty quickly.

What people realized pretty quickly was that LLMs, as Richard just said, are very probabilistic. They can’t answer everything, especially when it comes to your proprietary information. People also forgot about the software development lifecycle work they’ve been doing for decades. It comes down to evaluation: how well does that approach, that system, that agent work for whatever it is you’re trying to do? There are different approaches for different use cases, and you’re not going to have one solution that solves everything in your enterprise. You have to figure out where to get started and then grow from there.

Ben Lorica: But at a high level, Richard, ChatGPT came out in late 2022. One of the first “hello world” examples was retrieval-augmented generation. So at a high level, I suspect that most people understand that taking their existing data and supplying it to the LLM as additional context is generally useful, right? Is this not the case? What’s the problem?

Richard Garris: I think a lot of people got used to the idea of a conversational agent like ChatGPT, because it was the first experience they had with these models. It’s kind of hard to believe, but vibe coding with tools like Claude Code and Cursor has only been around for maybe two years. That’s actually a completely different harness around those models—and a much better harness.

When people start building their first app with just the API layer, they forget they have to build the harness around it. That includes the things Barry mentioned: testing, validating that the output is correct, and, as I’ll talk about more, the context you provide to the model so it produces the right output. I have an example of that I can go into more detail on if we have time.

Ben Lorica: Yeah, it seems like people are aware of the basics, right? So: I need extra context, maybe I need some rules to govern the input and output—guardrails, whatever you want to call them—evals. Broadly speaking, Barry, they’re aware of these things, right? So it’s all just execution now, no?

Barry Dauber: It’s execution, but we struggle with this too when talking to a lot of organizations. If you walk into an organization and ask, “Who owns your data and/or your AI strategy?”—depending on how big that organization is, A, people might not know; B, those might not be the same people; and C, in very decentralized organizations, that might be multiple people.

So as you think about the governance and the approach to deploying these things, getting decisions made around who’s actually going to decide how this is handled is huge. IT obviously wants to own it, but a lot of business users are now driving it. At the same time, the backend infrastructure—access to something like Databricks within an organization or access to the hyperscalers—is typically owned by IT. So how do you partner with them more closely to enable the business to do what it’s trying to do?

Richard Garris: Yeah, if I could add quickly, Ben: I’m a data person at heart. I’ve been doing this at Databricks for 11 years, 20 years overall. So while I’m a data and AI person, I’m really a data person. A lot of the people adopting this technology are more like software engineers who understand how to use APIs but have never actually been data scientists or data practitioners.

Getting organizations to partner better between their core data team and their software development team that’s building these agentic systems is one of the keys to unlocking these problems. When you have the right data, the models produce the right results. That seems very obvious, but a lot of organizations struggle with the organizational politics required to make that happen.

Ben Lorica: Yeah, so pre-LLM, at least a big percentage of organizations had started to establish data science teams. They had roles that were roughly what we would consider data engineers, right? So there was awareness about pipelines and rudimentary predictive models or machine learning models. I guess the difference now is that it’s much more democratized—regular developers can build AI apps.

What Barry was alluding to, I think, is that now it creates confusion. If all these engineers are building AI apps but no one actually understands basic eval, then how do we know what to do? Is one of the strategies to establish—I hate to say this, and I hate this concept, by the way—a center of excellence or an AI committee? Is that one of the things these companies are doing?

Richard Garris: It’s definitely a mechanism. I don’t think it’s the best mechanism, to your point, Ben. I think the easier thing to do is actually to have a data team representative inside these projects along with the software developers. A lot of software developers may not have a data person involved.

I think it’s two parts, though. Eval is one, but a lot of software engineers understand eval because they understand test-driven development, and we’ve had that for many years. What they don’t understand is that the question they’re trying to answer requires data from whatever platform they use—Databricks being one, or a competitor that may have their data. That data is essential context for those models. If you don’t have that context, you’re not going to get the right result. Also, those software developers may not even know what data is available in their lakes.

Ben Lorica: So, Barry, do people still struggle with, “Should I work on supplying better context, or should I invest in fine-tuning?”

Barry Dauber: I’m going to say “it depends” for everything. I know it’s a terrible answer. Richard and I both came from consulting, so that was beaten into me 20-plus years ago.

Ben Lorica: But at a high level, when do you advise someone, “Hey, your problem can be solved by fine-tuning,” as opposed to—

Richard Garris: Here’s the answer I’d give, Ben. I think of fine-tuning like memorization. If you have a model that needs to do one thing really well, fine-tuning it is basically memorizing that task and making it do that one thing really well. But that’s all it can do.

What we’ve found, though, is that you can get a long way with what used to be called prompt engineering, which was often done by hand. Today, what Jonathan recommends is: don’t do prompt engineering by hand. Instead, start with the prompts and the output—or the expected task results—and then reverse engineer to get to the best prompt to produce that result every single time. That way you’re not leaving prompt engineering to chance. You’re starting with the end in mind and then working backwards.

We have two algorithms to help with that. One is called MIPROv2—open source, by the way. It uses DSPy to actually make that estimation.

Ben Lorica: Yeah, and also all the key people in DSPy are in Databricks, right?

Richard Garris: Yeah. Omar went back to MIT, but we still have a lot of developers working on DSPy. That core technology came into Databricks.

Ben Lorica: DSPy is one of my go-to tools, and I think a lot of people love DSPy, right?

Richard Garris: Absolutely.

Barry Dauber: And to finish my thought and agree with what Richard was saying: I think fine-tuning fits a very specific task. Ultimately, especially as you start to hear terms like “token maxing” and “token rich,” the question becomes: how do you optimize for whatever that goal is—quality, accuracy, the metric you’re setting—while doing it as quickly as possible, with as low latency and cost as possible?

As you start putting this into production, especially at large organizations, that becomes a huge challenge. I was talking to a senior leader at an organization a week or so ago, and they had already hit their token allotment spend for one of the major foundation models for the year by the second or third week of April. That’s insane. So how do you prepare and build out these agentic platforms and capabilities—what we like to call intelligent applications—so you can maximize the outcome but do it as cheaply and quickly as possible? Going back to eval: how do you measure to figure out where that is?

Richard Garris: Yeah. Can I give an example, Ben, if that’s okay? A small example of where these tokens get lost.

I literally used an LLM over the weekend to reset my password. I’m like most people—I have some variation of my password, but I hadn’t made a wholesale change in a while. So I thought, “Hey, an LLM is a great tool for that,” because it’s non-deterministic and produces somewhat random results. I used the open-source Llama model because I didn’t want to share that sensitive information with OpenAI or Anthropic.

Even when I was doing this, though, I went through multiple iterations. First I said, “Give me a password,” and it returned the word “password,” because that’s the most common password on the internet. Then I said, “Okay, I want something that’s hard to hack or crack using a GPU-based method,” and it gave me a really complicated string of symbols, numbers, and things I’d never be able to memorize. Then I said, “No, here are three key pieces of information about me that no one else would know. Create a password I can memorize, but that still meets these criteria.”

That cost me about 1,500 tokens going back and forth with the LLM. If I had just sat down and provided all the constraints up front—memorizable, hard to crack, all the requirements—it would have only produced about 400 tokens, roughly 30% of the original token budget. So providing context up front helps it produce the best result. People aren’t even understanding that yet. Instead, there’s a lot of guess-and-check, rather than thinking about how to design these systems so they consistently reach the right result at the lowest possible cost.

Ben Lorica: Or you can just use a password manager and—

Richard Garris: Well, that’s true. That’s actually what I ended up doing too, but—

Barry Dauber: But I guess it was your master password, your primary password for the password manager. I was going to ask Richard what his three prompts were about himself, but I figure he’s not going to tell us for obvious reasons.

Richard Garris: [Laughs]

Ben Lorica: So one of the problems is that we all love the fact that LLM providers are increasing the context window, right? But it’s well known that if you overstuff the context window, not only is it inefficient, but it can actually get confused. It may not really read everything.

What specific pieces of advice do you give people in terms of taking advantage of the big context window without losing money and also losing performance and reliability?

Richard Garris: Yes. Bigger context windows are a mixed bag. It’s great that you have them, but it doesn’t mean you should use all of it, because context rot is real.

Ben Lorica: By the way, the people who seem to really love it are the ones who want to dump the entire codebase into the thing, right? Civilians don’t actually—

Richard Garris: Yeah.

Barry Dauber: Most people don’t have access to that much information. For them, it’s more like, “Let’s give it the entire data store.”

Richard Garris: But it’s no different from a human being. If you dumped everything in your head into another person’s mind, they wouldn’t be able to respond. Human beings thrive on clarity.

Ben Lorica: But obviously the LLM providers feel like this is something they need to do to set their models apart, right?

Richard Garris: Of course. But there’s a practical reality. Even though you might have a million-token context window, there’s still the concept of an effective token window, which may only be 128,000. That’s because you have context rot, and you have a certain attention budget.

Ben Lorica: So is this notion of context rot still true with this generation of models? I mean, I read the papers in the past saying there’s context rot, but presumably the models have improved too, no?

Richard Garris: Of course they have. And I think the frontier model labs we partner with would say that too. But the reality—at least in my experience, when I vibe code every day—is that the more context I give, the more likely the model is to hallucinate. It’s more likely to look at the context and not know which piece it should rely on to address the task at hand. So the more specific you make it, the more likely you are to get the right answer—and, obviously, to use less of your token budget to solve that quality problem.

Barry Dauber: Just look at all the examples popping up in the legal world. There have been state Supreme Court cases where lawyers came to court thinking they had written these amazing briefs. I don’t know if it’s directly due to context rot, but they dump in a bunch of legal cases and, unfortunately, these LLMs can make up cases. This is where human-in-the-loop is obviously very important. Don’t go to a judge with something that wasn’t read by a human first. In these cases, the system is making up legal briefs and legal cases that didn’t exist. It’s because they’re putting in the entire legal history of whatever case they’re working on, and when there isn’t a direct match, it makes something up to support the point they’re trying to make. So it’s definitely still real.

Ben Lorica: So are there tools out there people can use where, all right, I have knowledge bases or PDF files or whatever, and I know I should be helping the LLM with my raw data sources—but is there a way to structure or semi-structure those data sources so they’re more helpful? So I don’t have to give the LLM too much, but because it’s somewhat structured, it can still provide the right response.

Richard Garris: Yeah, and this is not a sales pitch, but that’s really the vision of Agent Bricks: pre-built agents that do specific tasks. We have one—I was literally on a call about this a minute ago—called AI Parse Document. You give it a document and say, “Hey, I want to extract these key fields,” and it will do that for you. We’ve already done all the prompt engineering, the model testing to see what does the best job, and it produces that result every time as a managed service. We have others as well, but that’s the goal.

Ben Lorica: So this could be like a thousand documents in the same format and then it’ll—

Richard Garris: Actually, the value is that they’re all in different formats. I’m working with one of the largest mortgage producers in the U.S., and every mortgage document is different for every county, every city. It’s able to handle non-deterministic forms because it can use its intelligence to find the right fields to extract even when the format isn’t the same.

Ben Lorica: So if I work in a real estate office and I have a bunch of real estate analysts, why not do that pre-processing upfront? Instead of every time a real estate analyst types a prompt and it fires off Agent Bricks to parse all the documents from scratch—can I do what Agent Bricks is doing upfront and load it into a format that’s much more easily queryable?

Barry Dauber: Yeah, exactly. That’s exactly what we’re doing here. There are thousands of documents, and these organizations know the very specific pieces of information they want: deed number, county, execution date, loan number, and so on. In some cases there are a couple hundred pieces of information they want. What this does is move the data from a very unstructured state to a structured state so it can be queried easily.

Richard Garris: Right. This is like a pre-processing step.

Barry Dauber: Yeah. Then when you’re using an LLM, you can ask something like, “Which of our mortgages over the last N months did X, Y, and Z?” and the probabilistic model plus a function or tool call can understand those parameters and pull back the very specific answer. You don’t need the model to guess or hallucinate. We know what the date means. Go get the issuance date and bring it back. Don’t guess what the issuance date is.

Ben Lorica: And how do you do evals in this example to make the end user comfortable? In other words, how do I know it’s not leaving information on the table? Is there a UI that provides some level of explainability so a non-technical user is comfortable that it’s doing what it’s supposed to do?

Richard Garris: Oh yes, there’s a review UI. But more importantly, we don’t just take the document and say, “Here are the fields,” as a black box. We show the bounding boxes, descriptions of what it’s doing, and a full trace of what it does.

More importantly, though, a lot of companies are trying to do this with just LLMs and hand-crafted prompts. That’s never going to work—or it’ll work, but it’ll take three to five years to do all the prompts for every combination of documents. The better way is to take all those real estate analysts who’ve been doing this for a decade—the knowledge the company has that isn’t on the internet—and make that your test cases. Here’s an example document they handled, here’s the extracted data they produced manually, and that becomes your training data. That can then be used to optimize AI Parse Document so it does the job well, because that information isn’t available to these LLMs. You can use that to improve the results.

Ben Lorica: And AI Parse Document is doing what—fine-tuning or prompt engineering?

Richard Garris: It’s doing a couple of things. There’s a whole paper on this, but specifically it uses a two-pass approach. First, it uses one open-source model for the initial pass, because that’s cheaper than doing it on a foundation model. That step handles the bounding boxes and some of the preprocessing. Then it uses a foundation model from one of our partners—which I can’t disclose—to do the reasoning and extract the key fields. Our research team says that’s the best way to do it, and the two-pass approach is worth it in terms of both latency and quality. We’ve done a lot of research studies showing that.

Third, one of the things we’ve found that improves quality is synthetic data generation. People are familiar with the concept, but in this case, one of the big issues is that the stamp on the mortgage can appear in different places, or even upside down. They don’t have enough training examples to teach the model how to handle that. So we synthetically generate lots of variations of the document and use those as training data to improve quality, using the samples and test cases they already have.

Ben Lorica: So is what Richard described, Barry, basically a bunch of forward-deployed engineers and consulting? Am I going to be paying for a lot of consulting time here?

Barry Dauber: With some organizations, potentially. But I think the way we approached it was: how do we take the knowledge, intuition, and expertise of the Databricks research team and instill it into these capabilities so they become much more self-serve? At the end of the day, we want to make it easy for partners and customers to optimize for whatever the use case is.

And yes, the stamp on the deed is definitely a challenge. If you talk to mortgage providers, there are something like 1,800 counties in the United States alone, and every single county does this differently. So instead of having an army of humans show up, as you asked, how do you generically—but also specifically—build that information into the brick so it can essentially take care of itself?

But going back to what I said earlier about the software development lifecycle: this isn’t a one-and-done situation where you deploy a model, a system, or an agent and you’re done. You go through a loop and see how well it works. As Richard said, there are review interfaces that give the mortgage person or mortgage analyst the ability to go in and say, “Hey, this one is actually a little off. This isn’t the issuance date—it’s the accepted date,” and so on. Then the operation can run again. So that all happens automatically, while still giving humans the ability to adjudicate how well it works.

Ben Lorica: Speaking of context, last year there was a lot of discussion around taking seemingly unstructured data and giving it some structure. One way to do that is through knowledge graphs. There was all that attention around GraphRAG, but as best I can tell it’s still somewhat niche. A lot of that seems to be because, one, hardly anyone has a knowledge graph, and two, the tools for turning unstructured data into a knowledge graph automatically are still questionable. So the question is: is an inferior knowledge graph better than no knowledge graph? Do graphs and knowledge graphs come up in your conversations?

Richard Garris: GraphRAG and knowledge graphs come up all the time, Ben. Well—not recently, though. For me, not in the past year or more.

It started to come out of vogue, but that’s also because it hasn’t solved the core problem. I was doing knowledge graphs back in 2006 at my first startup, and we’re still trying to figure out how to make this work. The graph is a good construct for data, but it’s also very hard to build, as you said.

Ben Lorica: Right—the construction and maintenance.

Richard Garris: Exactly. And maintenance too. Also, there are performance challenges. Graphs are not a very natural format for storage or computation.

Ben Lorica: And also for how people think. You can show someone a table and they’ll understand it, but a graph can quickly become a hairball.

Richard Garris: It can definitely become a hairball. One of the benefits of these newer models over the last year to year and a half is that they’re all good at tool calling. Because they can call tools, they can use the right data structure for the right task.

For example, skills are basically markdown files of instructions or additional prompting to help the model do something specific. That’s important. But we’ve had SQL for 40 or 50 years, and SQL is still one of the best formats for very large-scale data analysis. So when you give an LLM SQL as a tool, it becomes a very powerful agent because it can pull in context using SQL as the query language, execute against your database, and summarize the results. That’s obviously something Databricks does very well, since that’s the world we came from.

Ben Lorica: Let me tell you what Neo4j will tell you.

Richard Garris: Oh, I know. My old manager is the head of field engineering at Neo4j.

Ben Lorica: They’ll say: yes, SQL is fine, but it’s super awkward to write graph queries in SQL. And secondly, their perspective is that graphs are more natural than tables because relationships are more naturally represented that way. That might be true, but the reality is most people use Excel. They understand tables at a glance.

Barry Dauber: Yeah, I think when people hear “graphs,” they think of a visual tree representation or some kind of 3D graph. If you can store and operate on that data in a table, but still render it in a 3D tree-like UI, that solves most of the challenge. What people want is to understand the relationships and the associated attributes. Supply chain is a great use case here too because there are so many pieces of information. If you can store it in a flat table and render it that way, it solves a lot of use cases.

Ben Lorica: And the beauty of where we are now is that I don’t even need to write that complicated SQL statement anymore. The LLM will write that awkward graph query in SQL for me. So it’s not really a problem.

Richard Garris: Exactly. That’s definitely true. And to your point, Neo4j is also a partner of ours, so you don’t have to choose. You can have one tool call for data represented as a graph and another for SQL.

Ben Lorica: Both Swedish CEOs, after all.

Richard Garris: Exactly. Very true.

Ben Lorica: All right. So, agents. Agents are a big thing in my world, but mostly around coding and programming. My question is: are people talking about agents in the enterprise outside of coding? And what are the best examples? In coding, we obviously have great agents now. I’m assuming that in other domains they don’t have the equivalent of Claude Code or Codeium, right?

Barry Dauber: They don’t, but organizations are increasingly using them. One of the reasons Anthropic is such a well-run company, and one reason Claude revenue is skyrocketing, is because of Claude Code. Claude Code has become incredibly prolific in the enterprise in a good way. Now business users outside of IT and outside the technical bubble you and we live in are embracing it quickly. As they use it, they’re using it to help accomplish specific tasks and challenges that are unique to whatever they’re trying to do. A lot of it comes down to individual and organizational productivity.

Ben Lorica: So when you say outside of programming, do you mean something like a marketing team made up of marketing professionals who want to build agents for the things they do—email campaigns, social media ads, and so on?

Barry Dauber: Campaign optimization.

Richard Garris: Those are easier examples. One of the things we do is work with Gartner. Gartner has this 400-page RFP with all these questions around databases, data platforms, AI platforms, and so on. We use agents to answer those questions, review them, and double-check them. On top of that, they ask for product demonstrations, and we have sub-agents that actually go into the product, perform those steps through a Chrome plugin, record it, and put it back in the document. Human beings don’t have to do all that manually. And marketing is doing that, not necessarily technical professionals.

Ben Lorica: And they built it themselves?

Richard Garris: Yes.

Barry Dauber: Yeah, it’s pretty cool.

Ben Lorica: Interesting. So going back to eval: eval is hard even for people who understand it. Now we have agents, potentially multi-agent systems. Who’s doing eval well, from what you’ve seen? You don’t have to name companies, but can you describe a non-tech company that impressed you?

Barry Dauber: One that impressed me—I’m gearing up for a call with the C-suite of a large Canadian insurer right after this, so I was going through our book of Canadian customers—is Royal Bank of Canada. They’re obviously a technically sophisticated bank, but they have lots of researchers and capital markets people going through financial statements, research documents, and other materials. It all starts with gathering the data. If I remember the stats correctly, it used to take them something like six weeks to consolidate the data into the format they needed. Now we do that in about 30 minutes.

Ben Lorica: But in this example, Barry, they’re inherently data-savvy and quantitative.

Barry Dauber: They are, but they had to get there. It went from six weeks to 30 minutes. Then for the agent specifically: if I’m a researcher trying to call you and Richard to sell mutual funds or other wealth-management products, there’s a huge amount of information coming out from every public company. That research comes out daily, and every company has different timings for quarterly earnings. It used to take a researcher about 45 minutes to go through all of that. We’ve dropped that to about 15 minutes. That may not sound like the biggest drop, but at scale, across an organization that size, the ROI becomes very large.

Ben Lorica: So how do they know when to pull the trigger—when it’s good enough?

Barry Dauber: They’re able to use capabilities that measure how well the system works and compare it against the benchmarks they already have. Because, to Richard’s earlier point, they have lots and lots of human-curated research summaries they’ve been producing for decades.

Ben Lorica: So this is inherently qualitative, then?

Barry Dauber: It’s qualitative and quantitative. You have to make sure the numbers are right. If you say a company’s earnings were 123 and they were actually 122, that’s a big difference. So you need source information to evaluate it properly.

Ben Lorica: There’s a guy at NYU, Aswath Damodaran—the valuation god. If you want the “right” valuation for NVIDIA, he’ll show up on CNBC. A friend of mine at NYU has been trying to build an agent to replicate what he does. They haven’t quite gotten there yet, because he still uses a lot of intuition. They can automate some of the information gathering, but getting to the final one-pager hasn’t happened yet.

Richard Garris: I have a slightly different example I was going to use: Fox Sports, because that one is public and on our website. The use case there is a chatbot, but the hardest problem was actually getting all the sports experts in the room with the data scientists—who, it turns out, are not sports experts—and producing all the combinations of complicated questions hardcore fans ask, around things like sports betting and statistics, and writing all that down.

Ben Lorica: And this is where synthetic data might help as well, right?

Richard Garris: Exactly. But we actually do something slightly different from just synthetic data. We do something called a judge builder. The goal is to take those sports experts—super busy people, podcast hosts, media personalities, people like Colin Cowherd, the kinds of people who get these questions all the time—and ask them to provide just 25 really high-quality examples of questions they get on their shows or on their X feeds.

We use that to build a judge model using an algorithm called MemoLine, so it approximates how they think. Once that judge builder is created, it becomes almost a digital twin of them, which can then evaluate responses instead of requiring them to manually check everything by hand. We also use it to generate new examples they would plausibly come up with, which then become part of the testing process.

Ben Lorica: Another question I have is about multimodality. Is that something enterprises increasingly require or ask about? Beyond text, what are the common modes of information? Audio and video, I guess. And images in certain domains. Is that a capability they’re asking for? Some models, like those from OpenAI and Google, are multimodal; Anthropic’s isn’t. Talk to me about multimodality.

Richard Garris: We see it a lot, but it depends—going back to Barry’s consulting answer. Some companies build their whole business around processing video or audio. For example, we have a number of companies in manufacturing that record videos of what’s happening on the factory floor. They also create recordings that show incorrect behavior that could lead to a safety violation—things like not wearing protective equipment or walking in the wrong areas. They use that training data along with live video to detect and alert on safety violations so they can address them. So we definitely see those use cases. I’d say maybe 20–30%, but it really depends on the use case.

Ben Lorica: But my question around multimodality is more about whether enterprises are actually ready for multimodal data. A lot of what we discussed assumes a certain level of sophistication in your data platform, your data pipelines, your evals, and so on. Particularly the data infrastructure. I suspect most enterprises aren’t ready for multimodal data, no?

Richard Garris: Yeah, maybe the 80/20 Pareto principle is a good default answer there. Barry?

Barry Dauber: I’d agree. You see it a lot in manufacturing. We work with a large tire manufacturer, for example, and they want to know, every nth tire, how to scan them all and make sure—

Ben Lorica: So what do they do? Just dump everything in—

Barry Dauber: They use multimodal models to identify things like, “That tread that just went by doesn’t match the depth we expected.” So it’s the combination of video sensors filming a conveyor belt and models that understand what they’re looking at.

Ben Lorica: Yeah, but that’s just one pass. What if I want to keep asking questions afterwards?

Barry Dauber: That’s where the combination comes in: take the multimodal data and then use natural language on top of it. So you can say, “Show me every tread that failed over the last two weeks.”

Ben Lorica: Right, but then your data infrastructure itself needs to be multimodal.

Richard Garris: Maybe I’ll put it this way. We do have volumes and unstructured data in files—we’ve had that for at least five years. So there’s a place to put the data into the system and work on it.

Ben Lorica: The Lance file format is kind of like Parquet for multimodal, though.

Richard Garris: Yeah, Parquet isn’t the best file format for video and images. But because Databricks was built on the lakehouse and Hadoop-style foundations, we do have the ability to put that data into the system and work with it.

Ben Lorica: Whoever wrote that lakehouse post must be a genius.

Richard Garris: Yeah, very smart guy, Ben. But I think we’re seeing two things converge. There are individual use cases for video, images, and audio, and separate use cases for text. We’re still early in combining them and getting real intelligence from multiple modalities together—say, video plus sensor measurements like machine temperature and all the other contextual information. Right now I see more single-modality use cases, but eventually people will realize that combining those contexts can help models come up with the right answer, or at least a more novel one.

Ben Lorica: So how easy is it, Barry, for an enterprise to be model-agnostic? In other words, not tied to a single model provider. Is that increasingly hard to do?

Barry Dauber: I actually think it’s getting much simpler. Again, not to advertise Databricks, but we have an AI gateway where you can evaluate across all the major model providers. It comes down to evaluation: what’s the best model for what you’re trying to do? How do you evaluate it against open-source options? And then how do you swap it?

Ben Lorica: But earlier, Barry, part of our conversation was about things like DSPy or JEPA being used to optimize a prompt. I imagine that’s specific to a given model. So what happens when my application becomes very prompt-sensitive? I’ve talked to agent startups where I ask, “What model are you using?” and they say OpenAI. I say, “Switch to Gemini, it’s cheaper,” and they say, “Oh no, I can’t, because my prompt is too tied to OpenAI.” Is that increasingly a problem?

Richard Garris: I think we’re getting there, Ben. The key thing is that if you have really good evaluation, then the exact prompt matters less. If you know exactly what the result should be, you can reverse engineer the prompt for Gemini versus OpenAI. But if you don’t have that, and you’ve hardcoded all your prompts, then yes, it’s very hard.

Ben Lorica: So that’s the one thing you definitely need if you want to be model-agnostic. Anything else?

Richard Garris: The other thing is that you have to be very careful with how frontier model providers are designing their APIs, because they’re increasingly diverging. For a while everyone standardized on the OpenAI SDK, and that became the de facto spec. Now Anthropic has artifacts, Gemini has other concepts, and so on. There are a number of open-source projects that help make this more agnostic. If you standardize on the OpenAI SDK or some version of that, ideally you’re not locked into individual models, because these models are all competing. Sometimes one is cheaper or better for a given use case. You want to be able to test them against each other so you don’t get locked into one model that won’t be the right fit for the entire life of your application—or your startup, for that matter.

Ben Lorica: What’s the level of MCP usage in the enterprise, Barry?

Barry Dauber: I think it’s pretty rampant. The bigger question—the one corporate IT and C-suites are asking—is how rampant it is and where these shadow IT, shadow MCP servers are being stood up, and how to get ahead of that.

Richard Garris: Yeah, I’m surprised you didn’t ask about H2O, Ben.

Ben Lorica: So is there going to be a Lake Claw or a Bricks Claw?

Richard Garris: I think you’ll have to wait for Data + AI Summit for that announcement. Although there was a post by Mike Lo about something in that vein. But we’ll have some bigger announcements at Data + AI Summit.

Ben Lorica: It seems like this is another area where there’s a bit of a shadow IT phenomenon happening, because people are starting to love these AI delegates—my term for AI systems that do things on your behalf.

Richard Garris: You mean personal assistants?

Ben Lorica: For personal use, yes. I would imagine that’s the next thing in the enterprise. But Barry, you really have to do this well. There’s a lot that can go wrong.

Barry Dauber: You have to do it well. And I think it comes back to what we started with. Not to beat a dead horse, but this is probably the most important thing we talk about with every organization. It all comes down to governance. Who has access to that information? Who has access to that agent? Who has access to that MCP server? Whether it’s connecting to another agent or to a data source, how do I make sure the permissions associated with what I’m allowed to see—my role, my attribute-based access control—get passed on to whatever those third-party agents are doing?

If Richard does the same thing, how does his access get passed on? Increasingly, that’s what people are trying to figure out: this agent is now going to act on my behalf. Does it have read access, write access, or both? Maybe for Richard it has both, but for me only read. Figuring that out at scale, and doing it in a way that can be configured once and applied consistently, is what corporate IT is increasingly asking about.

Barry Dauber: So Richard, you didn’t use OpenAI or Claude to help with your password this weekend?

Richard Garris: No, I did not.

Ben Lorica: So Barry’s brought up governance a few times. And obviously we’re in a confusing state around AI regulation. There’s the EU AI Act, but in the U.S. it’s a bit confusing. The Trump administration told the states not to pass laws, but the states are passing laws. My question is more general: when people build AI applications internally—and I guess it depends on whether they’re inward- or outward-facing—to what extent do they even talk about regulation and compliance before they go live? Are lawyers increasingly part of these conversations, or do people just push the button and go live?

Barry Dauber: It definitely depends on where they’re operating and the kind of application they’re building. Lawyers are involved, but not always.

Ben Lorica: And how early are they involved?

Barry Dauber: Again, it depends on the application. If you’re in a heavily regulated industry like health or life sciences, with HIPAA and similar regulations, lawyers and regulators are going to be involved pretty early. If you’re in a digital-native, technology-first company, lawyers are still involved, but they may let the teams innovate first.

Ben Lorica: And as I said, it’s a confusing landscape because certain states are passing rules, right?

Barry Dauber: And certain states are passing rules, yes. I don’t think anybody really knows what to do in the U.S. at a macro level. But we’re obviously a global firm, and Richard and I cover customers everywhere. This is a topic that comes up very quickly when talking to anyone in the EU. There’s the EU AI Act, and they ask about it right away. A lot of that concern is actually tied to data sovereignty and data access. If I’m a European company operating in France and Germany, I want to make sure my French data never leaves France and my German data never leaves Germany.

Ben Lorica: Oh, I see. So that’s mainly their concern.

Barry Dauber: It’s definitely one of the top concerns. I don’t know if it’s the main one for every organization, but it’s a big one. That’s why you hear more discussion about sovereign clouds in those countries.

Ben Lorica: Because the EU AI Act is kind of a confusing mess. But it also speaks directly about the models themselves. So do people not have conversations early on about whether their models will comply with provisions of the EU AI Act?

Richard Garris: The big question is often: where are the GPUs physically located? We have GPUs available across the U.S., Europe, and Asia-Pacific, but some companies are getting very specific: Japan has to use GPUs in Japan, Australia has to use GPUs in Australia. That’s been the main concern.

On top of that, there are questions about security: what audit mechanisms are in place, how do you track everything going in and out of the model, what guardrails exist to filter out PII? Those are the main things happening. From a legal perspective, it’s both the data sovereignty issue—which we’ve had for a long time—and the auditability and tracking mechanisms for the platform or application they’re building.

Ben Lorica: Because I think that moving forward, lawyers should be part of these conversations much earlier, right?

Barry Dauber: Most likely, yes. It depends on the organization. They typically don’t want lawyers getting in the way of innovation. But how do you protect access to your crown jewels—your corporate information—while still enabling innovation, whether the goal is making money, reducing costs, or something else? How do you let practitioners operate without stifling them with rules so strict that they can’t get anything done?

Ben Lorica: Because the naive approach a lot of teams are taking is: we’ll just solve this with guardrails—input guardrails and output guardrails. But you still need lawyers, because how do you know your guardrails actually cover what—

Richard Garris: Exactly. Legal is about making sure you’re not violating laws, and ultimately about risk management. The big thing is appointing at least one lawyer who really understands the technology and what it does. The biggest problem is that legal teams are often far behind in understanding how the technology works and what the actual risks are. So having one person who really learns it is probably the first step.

And to your point, it’s not about guardrails alone. There should be evaluation on the developer side to make sure the system does what it’s supposed to do. Guardrails are your last resort to prevent something bad from happening. They shouldn’t be your first line of defense for blocking toxic output, mistakes, or the disclosure of PII. They should be the backup option.

Ben Lorica: So, Barry, what’s the level of comfort enterprises have with Chinese open-weight models? By “using,” I mean not necessarily calling the API endpoint of a Chinese provider, but taking the models and running them in-house. Are they doing this?

Barry Dauber: You’re asking the provocative questions. Again, it depends on how regulated the organization is. I think most organizations—curious what Richard thinks—are comfortable in a sandboxed, closed environment, running on their own infrastructure or infrastructure like ours, to kick the tires and see how well it works. Running them in production is a different story.

Ben Lorica: So do they hire a security company to do additional audits and make sure it’s not phoning home?

Barry Dauber: They generally address that before they even start, by putting it in a sandboxed, closed VPC or a similar environment where it doesn’t have access to proprietary or source-system information.

Ben Lorica: But even then, how do they know it isn’t sending their queries somewhere?

Richard Garris: Prompt injection and phoning home are real concerns. That said, we are seeing more comfort with models like Qwen and GLM-4.

Ben Lorica: And you attribute that to what—leaderboards?

Barry Dauber: Leaderboards and cost. Going back to what I said earlier: how do you optimize for the quality and accuracy you need while doing it as cheaply and efficiently as possible? And can you experiment to find a way to get there?

Richard Garris: And going back to your legal example, Ben: you have to read the legal terms for these open-source models very carefully. There are some models—we haven’t published on them yet, so I won’t name them—that contain specific terms that make them almost a poison pill for production use. Before you even start sandbox testing, a lawyer needs to review that legal language carefully to make sure it won’t create problems for your company later.

Ben Lorica: Do enterprises care whether the model is closed or open-weight?

Barry Dauber: They do. It depends on the enterprise, but they definitely care. We talk to some that say, “I’m an OpenAI shop, that’s all I use.” Others say, “I’m an Anthropic shop.” In Europe you see more Mistral, often because of sovereignty concerns. The open-vs-closed question comes back to what they’re trying to do, and again to evaluation. If I can get an open-source model that’s a fraction of the size and run a test set against it and get the same or better accuracy at a fraction of the cost, why wouldn’t I do that? So a lot of people start with the big foundation model and then work their way down, testing along the way to see whether they can get the same outcome with something cheaper and faster.

Ben Lorica: Yeah, because the supply of open-weight models keeps changing. It feels like Alibaba might get out of the game, Meta seems out, and then you have lesser-known Chinese open-weight providers who have to keep up a release cadence every six months or even faster. There’s no guarantee they’ll keep going. So I guess this is another reason to be model-agnostic. Even if you’re determined to use open-weight models, you don’t know whether your provider will still be releasing them a year from now.

Richard Garris: I’d also argue that sometimes you don’t need to keep changing models. A lot of these models are getting so good that you should upgrade in some cases, but for simple use cases like sentiment analysis, a Llama 8B model is perfectly fine. It hasn’t changed in a year, it still works, and we still have customers using it today.

Barry Dauber: And you can fine-tune that to your specific use case very cheaply and be super effective—at a fraction of the cost—while still getting the same sentiment scores or directional classification you’re looking for.

Richard Garris: The other benefit of open-weight models is that providers like OpenAI and Anthropic are very quick to deprecate versions. They move on from one version to the next because GPUs are limited and they need that capacity for newer models. For enterprises, that’s stressful because they build something around one version, prompt it in a certain way, and then suddenly it’s moving to another version and they don’t know whether it’ll still work. By using an open-weight model, you have more control, and for some use cases that makes sense.

Ben Lorica: So every six months or once a year I write a post saying, basically, reinforcement learning is here. I just did it again a few months ago, this time around agents. Is there reason to hope that we’ll finally have reinforcement learning tools that are more usable by non-experts—and actually useful? Not as easy as fine-tuning has become, but getting closer. Why are people looking at reinforcement learning now anyway?

Richard Garris: I’m going to say no for the enterprise. That doesn’t mean reinforcement learning isn’t important, Ben.

Ben Lorica: There are so many open-source tools now, right?

Richard Garris: I know. But what our research team has found is that most companies still don’t have eval. If they don’t have eval, how can they even think about reinforcement learning? I think of it as step by step. First, do eval and prompt engineering. If that works, great. If not, try fine-tuning, because that’s easier. And only if that’s still not enough should you move to reward models or reinforcement learning using methods like REINFORCE or PPO. That’s harder. It’s the third step in the process, not the first step for most enterprises.

Barry Dauber: We released something called RLHF—the Knowledge Assistant for Reinforcement Learning—which is focused on a lot of this. But it still requires someone with more of a research background. It’s not yet a push-button solution. I think we’ll get there. But again, the goal is to push against that Pareto frontier of quality, cost, and latency.

Ben Lorica: Have you gotten any actual enterprise usage?

Barry Dauber: We do, but it tends to be more tech-first organizations, because you’re usually dealing with researchers. As the ability to use this becomes more accessible to less technical people, I think we’ll see more uptake.

Richard Garris: Internally we talk a lot about what the right surface area is to expose these products to our enterprise customers. For RLHF, that’s going to show up in our multi-agent supervisor solution. It’s used there because the problem it’s trying to solve is making sure agents call the right tools with the right API specs the first time. Rather than exposing all the research, the papers, and the open-source project directly, we package it into the product so it just works, instead of making customers do all that themselves.

Ben Lorica: And with that, thank you, Richard and Barry.

Barry Dauber: Thank you for having us.

Richard Garris: Thank you very much, Ben.

Richard Garris and Barry Dauber on context, cost, multimodality, agents, and model choice in enterprise AI.

Transcript

Share this:

Like this:

Discover more from The Data Exchange