Predictability Beats Accuracy in Enterprise AI

Anant Bhardwaj on Agents, RAG, and the Pitfalls of Enterprise AI.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

In this episode, Anant Bhardwaj, CEO of Instabase, provides a pragmatic guide for AI practitioners building enterprise solutions. He shares a controversial take on AI agents, arguing they are best suited for designing predictable workflows rather than autonomous runtime execution. Anant also debunks the hype around RAG, explaining why data quality and system design are more critical than the models themselves, and why trust in AI stems from predictability, not just accuracy.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Structure Is All You Need
Context is King: Long Live Graph-Based Reasoning
Douwe Kiela → Building Production-Grade RAG at Scale
Shreya Shankar → Unlocking Unstructured Data with LLMs
Josh Pantony → How Agentic AI is Transforming Wall Street

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Understanding AI Agents and Automation

What is your definition of an AI agent?

An agent takes only two inputs: a goal and a set of tools. That’s it. It has complete autonomy to figure out a potential plan, execute steps, and even change the plan if it realizes the initial approach was wrong. It can use the tools you provide, ask for more information, succeed, fail, or even misuse the tools—just like a human. The agent makes a self-determination of when it’s done. Your only real control is to give it the initial goal and tools, and then to kill the agent if you need to stop it.

How are AI agents different from Robotic Process Automation (RPA)?

RPA records exact human actions and replays them robotically—like clicking exactly 5 pixels from the top and 2 pixels from the right. It’s essentially an Excel macro for user interfaces, offering predictability but no adaptability. RPA works for mundane, repetitive tasks where the process never changes.

Agents, powered by large language models, can understand a goal and dynamically create a plan to achieve it. They can handle variations and unstructured problems—for example, if it’s raining, an agent knows it needs to find an umbrella before proceeding. Most real-world problems are unstructured, where tomorrow isn’t exactly the same as today. Agents can accomplish everything RPA does but without the rigid, brittle nature of “record and replay.”

Will AI agents replace RPA completely?

Yes, but with important nuances. The main concern is predictability—RPA is 100% predictable while agents have autonomy. The solution is to use agents to create predictable workflows during build time. An agent can build an automated process, a human can review and approve it, and then that predictable workflow is what runs in production. This gives you the intelligence of agents with the reliability enterprises need. Think of it like coding agents—they show you what files they’ll edit and what changes they’ll make. You review and approve before deployment.

Agent Deployment Strategy

Where should agents be deployed in enterprise systems?

This is our controversial point of view: agents should primarily be a build and design time phenomenon with human collaboration, not a runtime one, especially for broad operational enterprise processes.

Consider a hospital workflow. A group of smart people designs the exact workflow for admitting a patient. They don’t want the receptionist making dynamic, independent decisions every time. That initial design process is where agents can be incredibly useful. A human can collaborate with an agent to brainstorm, design, and refine a complex workflow, pointing out blind spots and iterating quickly. This is human-AI collaboration to build a robust, predictable process.

Once that “AI workflow” is designed and approved, it’s put into production. At runtime, you want predictability. You don’t want an agent in a lending process to randomly decide to change a step, even for the better, because it introduces unpredictability at massive scale. Even humans don’t have complete autonomy in organizations—a hotel receptionist follows specific rules and pricing rather than dynamically deciding based on their “intelligence.”

How does this apply to individual versus operational workflows?

Operational workflows (like lending) that go through multiple teams and systems require complete predictability. Use agents at design time to co-create workflows, then run fixed, predictable processes in production.

Individual workflows (like an analyst researching earnings) can use agents as “executive assistants” at runtime, gathering data and suggesting new sources. But even then, the human is the final checkpoint. The agent does the legwork, but the human reviews and takes responsibility for the final output. It’s still human-AI collaboration, not fully autonomous runtime agents.

Can agents fully automate complex tasks like multi-city travel booking?

Technically possible, but user specification is the bottleneck. Humans often underspecify goals, so agents work best gathering options and executing with approval, rather than booking everything autonomously.

RAG, Fine-tuning, and Data Strategies

What’s your stance on the “RAG vs. fine-tuning” debate for enterprise applications?

My answer is neither nor. We wrote back in 2022 that fine-tuning is not the future for most applications. Fine-tuning has limited value—models perform best as large, general-purpose systems with context provided at inference time. Fine-tuning works only for data very similar to the training set.

As for Retrieval-Augmented Generation (RAG), it has been overhyped. RAG is just one tool, and its effectiveness depends entirely on the quality of the underlying retrieval system. The biggest problem with RAG is recall—if your top-k retrieved chunks miss the actual answer, it doesn’t matter how intelligent your model is. You’ve prevented it from finding the right answer. This can be worse than hallucination because you might get no answer at all, or an answer based on incomplete context.

You shouldn’t just throw RAG at every problem. If you’re querying a structured database, you should run a reliable SQL query, not try to “rag” on the database. Think of RAG as one step in a larger workflow, not a complete solution.

Why is enterprise search still unsolved despite all the AI advances?

People confuse internet search with enterprise search. Internet search works because it has three key attributes:

Curation: Web pages (URLs) are maintained by their owners who have incentive to keep them accurate
Ranking Signals: Strong signals exist for ranking, like trusting mit.edu more than a random Reddit post, plus PageRank
Scale and Ownership: Clear ownership and responsibility for content

None of this applies to an enterprise’s internal documents. How do you rank a collection of Google Docs? Is a document from the CEO more important? Maybe, but it might contain contradictory internal thoughts about future direction. Is a newer document more relevant? Not necessarily—it could be a rough draft. Sales materials might differ from engineering documentation.

The core issue is that enterprise data quality is an unsolved problem, and without solving it, enterprise search will remain unsolved.

How can enterprises make RAG effective?

Limit retrieval to vetted, well-maintained datasets—what we call “answer engines.” For example, an HR answer engine where the HR team is responsible for maintaining the accuracy and timeliness of all policy documents. In that controlled environment with curated corpuses where specific teams maintain quality and own the accuracy of results, RAG works well.

Building Reliable AI Systems

How do you achieve the accuracy needed for enterprise applications?

Don’t focus on component-level metrics. Vendors claim “90% field extraction accuracy,” but with 60 fields per document, that’s only 2% document-level accuracy (0.9^60). This means almost every document has errors, and if you don’t know where the errors are, you have to review everything, negating the value of automation.

A much better system is one that can process a million documents and tell you with high confidence, “These 60% are 100% correct and require no human review.” Getting from 2% to 60% straight-through processing is where the real value is.

This requires designing systems with checks and balances, just like we do with humans. Use multiple validation steps, adversarial checking between models, and clear understanding of which outputs need review. It’s about system-level workflow design, not just a single model’s accuracy.

How do you build trust in AI systems?

Trust doesn’t come from high accuracy—it comes from predictability. A system that is 70% accurate but completely predictable (meaning you know exactly when it will be right and when it will be wrong) is more trustworthy than a 90% accurate but unpredictable system. You trust a predictable horse more than an unpredictable one, even if the unpredictable one runs faster.

This is the core reason we advocate for using agents at build-time to create predictable workflows for runtime. When the system’s behavior is predictable, users can trust it. They know the exact path the process will follow, what happens when certain checks fail, and when human intervention is required. It’s like airport security—everyone follows the same predictable rules, no exceptions for “intelligent” decisions.

Enterprise Implementation

What are the biggest barriers to enterprise AI adoption?

Beyond data quality, the biggest barrier is organizational: too many people, too many opinions, too much management, which makes it very hard to get anything done. There’s a lot of talking about AI and not enough doing.

To truly understand new tools like AI, you have to use them and practice with them. Many decision-makers have a good picture of what AI is but have never “tasted” it by getting their hands dirty on a real problem. It’s like Steve Jobs said about consultants—they have beautiful pictures of bananas and apples but have never tasted them.

There’s also too much noise about being “AI-first” without understanding workflows, predictability needs, or appropriate tool selection. You wouldn’t use AI to multiply 20 by 30 when a calculator is more reliable.

How should enterprises start their AI initiatives?

Start with the problem, not the technology. Don’t make the Big Data mistake of “let’s collect everything and figure it out later” by creating AI initiatives and building data lakes hoping value will appear.

The right way is to identify 5-6 massive, core business problems:

Developer productivity? Get tools like Cursor
Operational workflow automation? Understand your data and processes first
Contract analysis or customer service? Department-specific needs require specific solutions
Information finding? Fix data quality first

Let leadership define the core areas for massive impact, then work backward from desired outcomes to figure out what data, tools, and processes are needed. Each problem requires different tools and approaches.

Recommendations for Foundation Model Builders

What capabilities should foundation model builders prioritize for enterprise adoption?

Three critical needs:

Predictability: The same question asked three times shouldn’t give three different answers. This is the number one priority. Enterprise applications are built on reliability. I care more about predictability than raw accuracy—70% accuracy with predictability (knowing when you’re right or wrong) is more valuable than 80% accuracy that’s unpredictable.
Explainability/Auditability: Show how answers are derived—not just token probabilities, but actual reasoning steps, data sources considered, and what was missed. Users need to know the model’s reasoning process. Tools like Cursor succeed because they show their approach—which files they’ll examine, what analysis they’ll perform.
Smaller, efficient models: Reduce the gap between small and large models so teams can run them cheaper, on smaller devices, maintain privacy, and enable reasonable offline functionality without burning through cloud compute. This would enable more powerful on-device, offline, and privacy-preserving applications at much lower cost.