How to Build and Optimize AI Research Agents

Jakub Zavrel on Deep Research, Multi-Agent Systems, and Prompt Evolution and Optimization.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Jakub Zavrel, CEO of Zeta Alpha, joins the podcast to discuss the practical evolution from traditional enterprise search to powerful “deep research” systems. He explains why standard RAG is insufficient for complex knowledge work and outlines the necessity of building upon a solid search foundation with iterative, multi-agent AI systems. For practitioners, Zavrel details the real-world challenges of customization and evaluation, sharing advanced, accessible techniques like using LLMs as judges and genetic algorithms to automatically optimize agent prompts for superior performance.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Beyond RL: A New Paradigm for Agent Optimization
The Enterprise Search Reality Check
AI Deep Research Tools: Landscape, Future, and Comparison
Anant Bhardwaj → Predictability Beats Accuracy in Enterprise AI
Josh Pantony → How Agentic AI is Transforming Wall Street
Douwe Kiela → Building Production-Grade RAG at Scale
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Understanding Deep Research

What’s the fundamental difference between traditional search and deep research?

Traditional search, whether web or enterprise, returns a list of links optimized for speed and navigation. Deep research is fundamentally different in both process and output. Instead of links, it generates comprehensive, multi-page reports that synthesize information from numerous sources—like what you’d ask a research assistant or management consultant to produce.

Deep research uses agentic AI systems that reason about the task, break it down into sub-questions, perform multiple iterative searches across different aspects of the topic, and assemble findings into complete answers. It’s intentionally slower than search but aims to reduce research work that would normally take hours or days down to minutes. The goal isn’t just finding information—it’s solving part of the actual work that knowledge workers need to do.

What should teams expect from the deep research user experience?

Expect an agent to “run off” for minutes, explore multiple facets of your question, then return with a structured report containing findings, supporting citations, internal context, and recommended next steps. Unlike search where speed is everything, deep research prioritizes completeness and quality over velocity.

The output should be tunable—you should control format (1-page executive brief vs. 3-5 page technical review), inclusion rules (internal vs. public sources), and domain lenses (chemistry vs. market analysis). The key is high signal-to-noise ratio, not word count.

Enterprise Implementation Foundations

Is enterprise search a prerequisite for deep research, and why is it still challenging?

Yes, enterprise search is absolutely foundational. Deep research is only as good as your retrieval layer. After 30 years in AI, enterprise search remains challenging because it’s not just a technical problem—it’s organizational, process-oriented, and involves knowledge management complexities. You need to understand how companies manage knowledge, connect different siloed applications securely, handle access control, and still deliver high-quality search relevance.

No magical LLM will solve enterprise search. You need secure connectors across data silos, robust access control (ABAC/RBAC), and high-quality relevance—often requiring domain-tuned embeddings. Without this foundation, the deep research agent will spend cycles on the wrong material and produce unreliable outputs.

What percentage of enterprises are ready to support deep research?

Surprisingly low. While everyone has some form of enterprise search, over the past decades many organizations gave up on it because they couldn’t achieve Google-quality search internally. Progress stalled until AI and LLMs renewed investment, since you can’t make meaningful progress with AI applications unless you connect them to internal knowledge effectively.

Should companies start with specific domains like HR or finance, or go enterprise-wide?

Start where your competitive differentiation lives: internal R&D, complex engineering challenges, product development roadmaps, process knowledge, and proposal reuse. While high-volume use cases like HR or finance FAQs are tempting, they often don’t require deep research and can be handled adequately by generic tools like Microsoft Copilot.

The most differentiating value comes from tackling problems involving complex internal knowledge where proprietary information is key. That’s where customization creates significant competitive advantage and drives adoption among knowledge workers.

Technical Architecture and Approaches

Why can’t teams just use standard RAG for deep research?

Standard RAG is the simplest form of connecting internal documents to generative AI—you enrich prompts with a few document snippets before sending them to the LLM. It works well for simple, direct questions but breaks down for complex research tasks.

Deep research requires iteration and reasoning. Your initial query rarely finds all the documents needed to write a comprehensive report unless someone has already written that exact report. You need agentic approaches that reason about missing information, break down complex questions, generate different keywords and search strategies, and iterate until the answer is complete. RAG is a core building block of this process, but it’s just one component, not the entire solution.

How should deep research systems handle both structured and unstructured data?

An effective agentic system should be equipped with multiple specialized tools. Document search handles unstructured content, but for structured data like customer interview logs, sensor measurements, financial data, or analytics—you want direct database queries, not RAG. These become different tools that deep research agents can access, such as generating SQL queries or calling analytics APIs.

To provide complete answers, you must integrate all relevant information sources. Nobody wants partial answers, so the system needs a unified approach that can seamlessly combine insights from both structured and unstructured sources.

What’s driving the shift toward multi-agent systems instead of monolithic LLM applications?

This is like applying good software engineering principles (microservices) to AI. Instead of one frontier model doing everything—reasoning, tool use, planning, synthesis—you create modular systems where each agent specializes in a specific sub-task with tailored prompts and tools.

The practical benefits are significant:

Greater control and customization: It’s easier to optimize a specialized agent than a giant, all-purpose one
Improved transparency: You can better understand and debug the system’s reasoning by examining agent interactions
Maintainability: Modular design is easier to update and maintain over time
Targeted optimization: Each agent can be tuned for its specific function without affecting others

Evaluation and Optimization

What are the biggest practical challenges when implementing deep research?

The primary challenges aren’t with foundation models themselves—their quality constantly improves. The real difficulties lie in customization and quality assurance. You can spin up a generic deep research agent relatively quickly, but optimizing it for specific, complex domains requires robust evaluation and improvement processes.

How do you evaluate and improve these systems without overwhelming busy domain experts?

You cannot improve what you don’t measure, but getting experts to spend weeks annotating data is impossible. The practical approach involves bootstrapping with minimal expert time:

Bootstrap with expert feedback: Get a small amount of expert time to provide initial examples of good vs. bad outputs (think hours, not weeks)
Calibrate LLM-as-judge: Use expert feedback to train an automated evaluation system where another LLM acts as a proxy for human experts
Automate evaluation at scale: Use tournament-style ELO ratings (like chess rankings) to compare different system versions continuously

This approach provides stable evaluation signals with minimal ongoing expert involvement while enabling continuous improvement.

When is the system “good enough” to deploy?

You’ll typically see a classical machine learning learning curve—quick improvements that double or triple initial accuracy over baseline, then diminishing returns. Deploy when expert review confirms the system clears the bar for your initial use cases, but continue improving in production with ongoing feedback.

Don’t equate usage frequency with value. A user running 1-2 deep research tasks per day can still see high ROI if each task displaces hours of manual work. Track time saved, decision quality, downstream actions (proposals created, experiments prioritized), and expert satisfaction rather than just usage volume.

Advanced Optimization Techniques

What problem does reflective prompt evolution (like the GEPA technique) solve?

In multi-agent systems, each agent’s behavior is controlled by human-written prompts. The challenge is that initial prompts written by developers are almost certainly not optimal as a system. GEPA (Genetic Pareto Optimization for Reflective Prompt Evolution) provides a way to automatically evolve and improve entire collections of prompts to create much more effective systems, without needing access to model weights or massive computational resources.

How does GEPA differ from traditional reinforcement learning approaches?

Traditional RL for optimizing LLMs often requires direct access to model weights and significant data and compute resources, putting it out of reach for many teams. GEPA is more accessible—it uses LLMs themselves as “operators” to propose new, potentially better versions of prompts, then uses genetic algorithms to search the vast space of possible prompt combinations.

Empirical results show it can outperform RL-based fine-tuning on relevant benchmarks while using fewer iterations and far less engineering and computational overhead. This makes advanced prompt optimization practical for more teams.

How would practitioners implement this approach?

The workflow leverages examples of desired outcomes:

Provide examples: Domain experts provide a small set of high-quality outputs (excellent research reports, preferred analysis formats)
Create prototype: Developers build a simple first-pass multi-agent system designed to produce similar outputs
Calibrate evaluation: Use expert examples to calibrate automated LLM-as-judge evaluation
Run optimizer: The system generates thousands of prompt variations, tests them, and evolves the entire agent ecosystem to match expert-provided examples in quality and style

This enables the system to keep learning from user feedback in production through ELO-style scoring without requiring full benchmark re-runs.

Practical Implementation Guide

What models should teams use for different components of deep research systems?

Use a hybrid approach:

Generation and reasoning: Deploy the best available frontier models within your IT policy constraints (OpenAI through Azure, Anthropic through Bedrock, or open-source models for highly security-conscious environments)
Retrieval: Fine-tune smaller, state-of-the-art open-weights embedding models on your specific data to optimize search relevance for your domain and terminology

What’s the essential checklist for getting started safely and seeing value quickly?

Establish secure foundations: Wire secure connectors and implement proper access control (ABAC/RBAC) first
Optimize retrieval: Tune embedding models and search relevance on your domain-specific data
Define high-value use cases: Focus on 1-2 scenarios where deep research provides clear competitive advantage
Bootstrap evaluation: Create expert examples and calibrate LLM-as-judge systems early
Build modularly: Develop specialized agents rather than monolithic prompts for better control and optimization
Integrate structured data: Add SQL and analytics tools early to avoid partial answers
Close the feedback loop: Implement continuous evaluation with ELO-style scoring to evolve the system over time

What improvements would most help enterprise adoption in the next 6-12 months?

From foundation model builders: more predictable outputs across model versions (consistency matters for enterprise reliability), continued access options that satisfy strict data-control policies, and sustained development of strong open-weights models to maintain customization flexibility.

The key insight is that deep research isn’t just about better search—it’s about creating AI systems that can genuinely augment knowledge work by producing complete, trustworthy reports that integrate both public and proprietary information sources.