Building Production-Grade RAG at Scale

Douwe Kiela on AG 2.0, Document Intelligence, and The Future of Agents.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Douwe Kiela, Founder and CEO of Contextual AI, discusses why RAG isn’t obsolete despite massive context windows, explaining how RAG 2.0 represents a fundamental shift to treating retrieval-augmented generation as an end-to-end trainable system. He covers critical topics including document intelligence as the linchpin of effective RAG, grounded language models for eliminating hallucinations, and how reasoning models and agents are changing the retrieval paradigm. The conversation also explores multimodal RAG challenges, fine-tuning strategies, and what foundation model improvements would most benefit enterprise RAG practitioners. [This episode originally aired on Generative AI in the Real World, a podcast series I’m hosting for O’Reilly.]

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
RAG Reimagined: 5 Breakthroughs You Should Know
Structure Is All You Need
GraphRAG: Design Patterns, Challenges, Recommendations
AI Unlocked – Overcoming The Data Bottleneck
Tom Smoker → Why ‘Structure’ Is All You Need: A Deep Dive into Next-Gen AI Retrieval
Manos Koukoumidis → How a Public-Benefit Startup Plans to Make Open Source the Default for Serious AI
Vaibhav Gupta → Unleashing the Power of BAML in LLM Applications
Semih Salihoglu → The Intersection of LLMs, Knowledge Graphs, and Query Generation
Mars Lan → The Security Debate: How Safe is Open-Source Software?

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

RAG and Long Context Windows

With frontier models now advertising massive context windows (some up to 10 million tokens), is RAG becoming obsolete?

No, RAG is definitely not obsolete. While long-context models and RAG address the same core problem—getting relevant information into the language model—it’s incredibly wasteful and often counterproductive to use the full context all the time. To use Douwe’s example: if you want to find out who the headmaster is in Harry Potter, you don’t need to read all seven books.

More critically, performance actually degrades with extremely long contexts. Recent empirical studies, including tests on Gemini models with 2 million token contexts, show performance degradation after just a few thousand tokens. Simply cramming everything in doesn’t guarantee it will work effectively.

The optimal solution is RAG plus long context models. You use RAG to efficiently narrow down to the most relevant information, then leverage the model’s long context capabilities on that sensibly sized window. This approach is particularly important when considering cost and latency—especially with reasoning models that compound these issues—and for high-repeat usage scenarios that most teams are building for.

What guidance do you give teams wrestling with the temptation to “just shove it all in”?

If latency or cost matter at all, treat long context as a fallback, not a first resort. The key is making RAG easy to implement. If you can automate ingestion, retrieval, and prompt assembly, developers won’t feel that “RAG is annoying while long context is easy.” The goal is to make the efficient choice also the easy choice.

RAG 2.0: A Systems Approach

What exactly is “RAG 2.0,” and how does it differ from early RAG implementations?

RAG 2.0 represents a fundamental shift from treating RAG as a collection of off-the-shelf parts to treating it as one trainable, integrated system. Instead of the “Frankenstein” approach many early adopters used—piecing together random embeddings and language models—RAG 2.0 means document parsing, chunking, embeddings, hybrid retrieval, re-ranking, and the generator are optimized together.

The core insight is that the language model is only a small part of a much larger system. If the overall system doesn’t work harmoniously, even an amazing language model won’t produce the right answers. This systems thinking approach has been proven in other domains like speech recognition, where end-to-end optimization dramatically outperforms modular systems.

How does Contextual AI implement this end-to-end philosophy in practice?

We expose only two main APIs: ingest (push data) and query (OpenAI-style chat completion). Under the hood, there’s a sophisticated data-store layer handling document intelligence—layout segmentation, ML-based chunking, metadata extraction. At query time, we use a “mixture of retrievers,” cascaded re-rankers, and a grounded language model (GLM) that work as a cohesive unit.

All components were trained jointly so their failure modes align. This opinionated orchestration saves developers from having to evaluate thousands of embedding models or figure out optimal chunking strategies. The system is designed so that the way you extract information is optimal for your retriever, your retriever and re-ranker are tightly coupled, and your language model is strongly grounded in the retrieved information.

Document Intelligence & Data Ingestion

Why do you insist that document parsing is the linchpin of effective RAG?

Document parsing is critically underestimated. If extraction is noisy or incomplete, better chunking or embeddings can’t rescue you. Many people obsess over embeddings and chunking, but if you can’t extract information properly in the first place, what are you going to chunk?

Complex PDFs with tables, circuit diagrams, charts, or mixed layouts require sophisticated extraction involving OCR, Vision Language Models (VLMs), and hierarchy detection. We found that for production use cases—especially in sectors like finance dealing with millions of complex documents—off-the-shelf solutions simply weren’t adequate. Open-source libraries like Unstructured, DocETL, or LlamaParse are fine for demos with a few documents, but they crumble when you need production-grade accuracy at scale.

What specific structures and metadata do you extract that matter most for downstream retrieval?

Beyond raw text extraction, we focus heavily on document hierarchy and structure. For a research paper, knowing whether text belongs to the “introduction,” “methods,” or “results” section is crucial metadata. In enterprise documents, understanding organizational hierarchy (is this from the CEO or an intern?) becomes vital for ranking.

Our extraction system:

Identifies and processes different modalities (tables, charts, code blocks, circuit diagrams) with specialized models
Preserves hierarchical cues (sections, subsections, table captions)
Extracts metadata that guides the retriever and re-ranker
Converts everything into a unified representation optimized for RAG

This metadata becomes essential for the re-ranker to apply business rules—for example, prioritizing recent documents, preferring certain data sources, or handling conflicting information appropriately.

Retrieval & Re-ranking Best Practices

Is hybrid retrieval (BM25 + neural) still state-of-the-art, or have we moved beyond that?

Hybrid retrieval is still important, but it’s now just one component of what we call a “mixture of retrievers.” This can include dense vectors, sparse keywords, graph-augmented paths, and even text-to-SQL, depending on the query and data characteristics. The system intelligently chooses different retrieval modalities based on the specific needs.

The retrieval process works in stages:

First-stage retrieval casts a wide net quickly using techniques like query expansion and reformulation
Smarter but computationally intensive re-rankers then narrow down the results
Potentially cascaded re-rankers apply increasingly sophisticated filtering

This approach manages the trade-off between computational cost and accuracy—you can’t run a very smart model on millions of documents initially, but you can afford it on the top 1,000 chunks.

How do you incorporate business rules without killing latency?

The key innovation is instruction-tunable re-rankers. Similar to how you instruct a language model, you can tell the re-ranker: “Favor recent documents; prefer source A over B; penalize anything older than Q2.” The re-ranker learns these ranking policies during training.

Because re-ranking operates only on the top-K candidates (typically around 1,000 chunks), latency remains acceptable. This is vital in enterprise settings where you need to handle messy, conflicting data with specific business logic about source credibility, recency, or departmental hierarchy.

What about observability and debugging in such a complex pipeline?

Observability is core to RAG systems—you need to understand what’s happening at each step. Every stage logs inputs, outputs, scores, and reasoning traces. When answers go wrong, you can diagnose whether:

Extraction missed a crucial table
Retrieval skipped an important chunk
The re-ranker applied rules incorrectly
The GLM ignored the evidence

This granular observability is essential for both debugging and continuous improvement of the system.

Grounding & Hallucination Prevention

Does RAG eliminate hallucinations, or do general-purpose LLMs still “know better” than the provided context?

RAG alone doesn’t eliminate hallucinations. General-purpose language models are trained to be versatile—useful for creative writing, brainstorming, and many other tasks. This means they need to be able to discuss things beyond the provided context, which is essentially what we call “hallucination” in RAG contexts.

Our solution is a specialized Grounded Language Model (GLM) that’s fine-tuned specifically to answer only from retrieved passages. These models are trained to:

Be strongly grounded in the provided context only
Say “I don’t know” when information isn’t in the context
Never talk about anything they don’t have explicit context for

This dramatically reduces hallucinations without blocking creative use-cases elsewhere in the organization—you use the GLM for RAG applications and general-purpose models for other tasks.

Reasoning, Agents & Test-Time Compute

How do reasoning models and agent frameworks change the RAG paradigm?

Reasoning models and test-time compute represent a paradigm shift that’s still underestimated. They enable the system to decide when and how to retrieve, treating RAG as one tool among many. Our domain-agnostic planners can:

Decide what to retrieve based on the query
Determine how to retrieve it (which retrieval method to use)
Evaluate results and decide on further actions
Chain multiple operations (fetch docs, call APIs, write to databases) until the goal is met

The key evolution is from “passive retrieval” (always retrieve for every query) to “active retrieval” (the model reasons about whether, when, and what to retrieve).

Don’t agents eliminate the need for RAG? Some say they just need agents, not RAG.

This is a fundamental misunderstanding. Agents use RAG under the hood. When deep research tools or other agent systems work with your data, they’re still retrieving information to augment their generation—that’s exactly what RAG is.

The confusion often comes from viewing agents and RAG as alternatives. In reality, agents are sophisticated generators (the “G” in RAG) that use retrieval (the “R”) as one of their tools. Whether it’s web search, database queries, or document retrieval, the agent is augmenting its generation with retrieved information.

How do you handle the increased latency and cost of reasoning models?

Our platform offers flexibility through different modes. Users can choose:

“Think mode” for queries requiring deep multi-step reasoning (accepting higher latency and cost)
“Normal RAG mode” for standard queries with lower latency

Interestingly, users seem more tolerant of higher latency when it produces significantly better or more comprehensive answers. It’s a trade-off that depends on the use case—real-time applications need low latency, while complex research queries might justify waiting for a perfect answer.

Fine-Tuning & Continuous Optimization

With an end-to-end optimized system, is fine-tuning still necessary?

You don’t have to fine-tune, but you can if those last few percentage points matter. Out of the box, a well-optimized RAG 2.0 system provides production-ready accuracy for most use cases. However, fine-tuning can provide 10-20% accuracy improvements on harder domains, which might be the difference between:

Meeting production requirements
Satisfying compliance thresholds
Unlocking significant business value

Fine-tuning in this context isn’t about extensive labeling. It involves:

Providing a small set of domain examples showing desired behavior
Using synthetic data pipelines to expand the training set
Jointly training all components (embeddings, re-rankers, generator) to work together optimally

The decision depends on your ROI calculation—for some customers, those extra percentage points are worth millions; for others, the out-of-the-box performance is sufficient.

Multimodality in Practice

What does multimodal RAG look like in production?

Multimodal RAG is crucial because much real-world data is inherently multimodal. Consider Qualcomm’s documentation with circuit diagrams, code snippets, tables, and charts all in the same documents. Effective RAG must handle all these modalities.

Our approach:

The extractor identifies different content types (tables, charts, code blocks, diagrams)
Specialized models process each modality into a text-forward representation
Retrievers operate over a unified index containing all modalities
The system maintains awareness of the original modality for proper rendering

Current Vision Language Models (VLMs) are good but still struggle with fine-grained tasks like extracting specific values from complex charts. This remains an active area of improvement.

Platform Architecture & Implementation

Why build your own infrastructure instead of using existing orchestration layers like Ray?

We run on Kubernetes, which provides the control we need over our pipeline. While Ray is excellent for many use cases, we found that for our specific requirements—tight integration between components, custom scheduling, and optimization—building on Kubernetes gave us the flexibility we needed. The overhead of adopting another orchestration layer wasn’t justified by our needs.

What parts of your stack are open source versus proprietary?

We follow more of an “open research” model than pure open source. We’ve published significant research on better RLHF methods, fine-grained evaluation techniques, and contributed models like our state-of-the-art mixture of experts model developed with the Allen Institute.

The production pipeline itself is commercial, but individual components (like our document parser or GLM) can be used independently. This allows teams to adopt parts of our system while maintaining flexibility.

Looking Ahead

What developments from foundation model builders would most benefit RAG practitioners in the next 6-12 months?

Two main areas need attention:

Actually working long context: Current long context claims don’t match reality. Models that could reliably use longer contexts without quality degradation would simplify development, though RAG would still be necessary for efficiency at scale.
Better Vision Language Models: Current VLMs struggle with fine-grained understanding, especially extracting specific values from charts or understanding complex diagrams. This is critical for handling real-world multimodal documents in sectors like engineering, finance, and healthcare.

Are enterprises adopting Chinese frontier models like DeepSeek for production use?

Chinese models like DeepSeek have provided impressive proof-of-concepts and pushed the field forward. However, Western enterprises remain cautious about production deployments due to security, compliance, and geopolitical concerns. While we use them for research and benchmarking, we don’t typically deploy them for customer workloads. The capabilities are there, but practical adoption faces non-technical barriers.