Beyond the Chatbot: What Actually Works in Enterprise AI

Jay Alammar on RAG Systems, Enterprise Evaluation, and AI Agents.

Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

Jay Alammar is Director and Engineering Fellow at Cohere and co-author of the O’Reilly book “Hands-on Large Language Models.”  We discuss the evolution from simple RAG systems to sophisticated LLM-powered agents, emphasizing how enterprises are moving beyond basic chatbot implementations to deploy more practical applications like document summarization and data extraction. Jay shares insights on evaluation as intellectual property, the importance of grounding models in company data for security, and emerging trends in multi-modality and smaller, more efficient models for enterprise deployment. [This episode originally aired on Generative AI in the Real World, a podcast series I’m hosting for O’Reilly.]

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript


Related content:


Support our work by subscribing to our newsletter📩


Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Enterprise Adoption Strategy

What are the key lessons enterprises need to learn when first adopting large language models?

First and most critically, don’t think of the chat experience as your first deployment with LLMs. Chat interfaces are complex and unpredictable. Companies should start with simpler, more constrained use cases like summarization or extraction – for example, extracting names of people from a contract. Think of these as individual building blocks to build applications with.

Second, understand that the most valuable AI capabilities aren’t actually generative in nature. The representation capabilities through embeddings enable superior classification, categorization, and clustering – helping companies make sense of vast amounts of unstructured data. Some of the most powerful applications rely on the model’s ability to create powerful representations of data rather than generate new text.

Third, never rely solely on a model’s internal pre-trained information. Early in 2023, people mistakenly treated models like search engines, expecting truthful results. But models can be confident even when speaking about something untrue – they can “hallucinate” information. This realization led to the widespread adoption of retrieval augmented generation (RAG), where you retrieve context from reliable sources before the model generates an answer.

Finally, for enterprises, data security and privacy are paramount. Companies need models that don’t require data to leave their network security boundaries – this means private deployment where the model comes to the data, not the other way around.

RAG Systems Evolution

How should companies think about the evolution from simple RAG to more sophisticated agent systems?

It’s a natural progression based on complexity needs. Basic RAG systems start with searching one data source and providing top documents as context. As requirements grow more complex, teams add enhancements:

Query rewriting comes first – where a language model reformulates the user’s question into better, clearer search queries. Instead of using the raw user question, the LLM generates optimized queries that make retrieval more effective.

Next comes multi-query RAG for comparative questions like “compare Nvidia’s 2020 vs 2023 financial results.” A simple system would fail unless a single document contained both pieces of information. Advanced systems understand they need to conduct two separate searches and synthesize results from both, searching multiple queries simultaneously.

The most sophisticated level involves sequential, multi-step questioning, like “who are the top car manufacturers in 2024 and do they each make EVs?” Here the system first gets the list of manufacturers, then queries each one individually about EVs. This sequential, multi-step behavior where the model orchestrates multiple retrieval and reasoning steps is what evolves into LLM-backed agents.

What are the emerging trends in context engineering that teams are actually implementing?

The most practical trend is giving models an “onboarding phase” when connecting to new data sources. For databases, this means providing metadata about tables, columns, and data types. For code repositories, it’s creating markdown summaries of the tech stack, testing frameworks, deployment processes, and commands to run. This initial snapshot helps the model operate more effectively.

This approach leverages the structure that document creators already built in – chapters, abstracts, footnotes, hierarchies. Rather than waiting for perfect knowledge graphs (which most companies don’t have and can’t easily build), teams are exploiting existing document organization.

The key insight is that even with million-token context windows, managing context strategically is still crucial. It’s expensive, slows things down, and models exhibit human-like behavior – they pay more attention to the beginning and end of contexts, often missing middle information. Smart preprocessing beats brute force.

What’s the reality of graph RAG adoption in enterprises?

Honestly, there isn’t much direct evidence of widespread graph RAG adoption in practice. While there are interesting developments and lots of conference presentations about it, the practical reality is that most companies don’t have mature knowledge graphs and building quality ones remains challenging.

This follows a common pattern in emerging technologies – initial excitement from vendors and researchers, but actual enterprise adoption lags because the foundational requirements (in this case, good knowledge graphs) aren’t in place yet. The focus remains on more direct methods of context engineering like leveraging metadata and improving the core retrieval pipeline.

Evaluation as Core Capability

How should companies approach evaluation as a core capability?

Evaluation should be treated as intellectual property – it’s what differentiates you from competitors and helps you choose the right vendors and models for your specific use cases. Think of your evaluation framework and datasets as the equivalent of unit tests in software development.

Start by clearly defining your use cases and gathering internal examples using your actual data. Even a small dataset of 50-100 examples gives tremendous direction for vendor and model selection. This dataset should contain representative inputs and desired outputs for your specific use case.

Having this internal benchmark allows you to objectively compare different models and vendors to see which performs best on tasks that matter to you, set realistic expectations for what the technology can do, and detect when new model versions regress on specific capabilities you rely on. Companies developing strong evaluation capabilities can catch regressions early, compare different approaches objectively, and make data-driven decisions about their AI implementations. Without this, you’re essentially flying blind.

Build vs Buy Decisions

Should enterprises build their own RAG systems or use specialized providers?

This depends on company capacity and timeline pressures. Building a production-ready RAG system from scratch – with proper chunking strategies, embedding approaches, query rewriting, re-ranking, and all the enterprise features like security and access control – typically takes one to two years. Each step in the pipeline has numerous options, leading to a combinatorial explosion of choices that need to be optimized.

Most companies are under pressure to show ROI quickly and don’t have that timeline. The combinatorial explosion of choices at each step creates a complex optimization problem that specialized providers have already solved. For this reason, using a specialized vendor or platform is often the more practical approach.

It’s similar to databases – technically you could build one from scratch, but for most companies, using a proven vendor makes more sense so they can focus on defining their use case and building the final application rather than reinventing the underlying infrastructure.

Do most companies need their own AI platform for post-training?

For most companies, no. You can achieve significant value through prompt engineering and context management with state-of-the-art foundation models. The benefit is that you ride the “rising tide” of model improvements without changing your implementation – as underlying models from providers continue to improve, your application benefits automatically.

Post-training techniques like fine-tuning make sense only after you have exhausted prompt engineering and RAG and are still not achieving desired performance. It’s most appropriate for companies with machine learning engineering capacity where AI is core to their product and where they have high-quality data to do it effectively. The most advanced companies are doing reinforcement fine-tuning, but this requires considerable sophistication.

Multimodality and Code Generation

Where do enterprises stand with multi-modal AI implementation?

Text remains dominant – probably 90% of use cases – but multi-modality is crucial for companies dealing with PDFs containing charts and images, medical organizations with imaging data, or any industry with significant visual documentation.

The most practical advancement is embedding models that can handle both images and text in the same embedding space, enabling cross-modal retrieval. This allows searching for text and getting relevant images back, or vice versa. Video and audio remain more challenging due to computational intensity, though audio products are expected to see significant development this year.

What’s the current state of AI-assisted software development adoption?

The use of AI in coding has matured significantly beyond simple function writing (like LeetCode-style problems) to more advanced software engineering tasks. Modern evaluations use benchmarks like SWE-bench, where models receive GitHub issues and must solve them across multiple files, with verification through existing unit tests.

Most developers are open to AI-assisted tools, with junior developers being most enthusiastic and senior developers also generally accepting. The resistance tends to come from mid-career developers. It’s also important to remember that code generation is a key capability for agents serving non-technical users – when an analyst asks an agent to create a bar chart, the model is writing and executing Python code in the background.

The key is understanding both strengths and limitations of current tools, and maintaining ownership of your code while leveraging AI assistance.

AI Agents in Practice

What’s the realistic state of agent deployment in enterprises?

We’re still in early days. Most companies are implementing simple, task-specific agents or multiple specialized agents working in parallel. Multi-agent systems where agents coordinate and communicate among themselves aren’t quite production-ready yet.

The most mature agent implementations are actually advanced RAG systems – tools like Perplexity or specialized research tools are essentially agents doing very sophisticated information gathering and synthesis. These systems demonstrate the practical reality of agents today: deep research and information gathering rather than complex multi-agent orchestration.

For practical deployment, the pattern emerging is single primary agents that can delegate well-defined tasks to specialized sub-agents, rather than complex swarms of agents coordinating with each other. For example, a developer might have multiple instances of a coding agent working on different parts of a problem, then integrate the results.

How should teams think about the transition from tools to agents?

Many teams are already using agents without realizing it. If you’re using advanced research tools or have multiple instances of coding assistants working on different parts of a problem, you’re essentially working with agents.

The key requirements for effective agents are: access to the right tools and information, clear scopes of operation, and transparency – they should show users their work and reasoning process. Crucially, the end user should remain the final arbiter, empowered with visibility into the agent’s decision-making process and the sources it used to arrive at answers. A well-designed agent system is transparent rather than a black box.

Smaller Models and Future Developments

What role will smaller models play moving forward?

Smaller models (2-4B parameters) will prove sufficient for most well-defined enterprise tasks. They can now handle tasks that once required frontier models, and they’re cheaper, faster, have lower latency, and are suitable for offline or edge deployment.

The key is evaluation – if you’ve clearly defined your use cases and a smaller model can satisfy them reliably, you get significant benefits in cost and performance. Larger models tend to be better general problem solvers with superior reasoning capabilities for complex tasks. But as you identify specific, well-scoped tasks, smaller specialized models become increasingly viable.

The pattern emerging is using larger models for complex reasoning and coordination, while delegating specific, well-scoped tasks to smaller, faster, cheaper models. It’s often more exciting to see a new, highly capable small model released than another massive one.

What emerging foundation model capabilities should practitioners watch for?

Beyond the obvious multi-modality and reasoning improvements, text diffusion models represent an interesting under-the-radar development. Unlike autoregressive transformers that generate tokens sequentially (committing to each word one at a time), diffusion models work by starting with a noisy rough draft of the entire output and then iteratively refining it over several steps until it becomes clear and coherent.

Early demos show dramatically faster output speeds since models don’t have to generate token by token. This architecture could enable new paradigms where models can reconsider and refine entire responses rather than being locked into early token choices. While still early, this could introduce fundamentally new model behaviors and dramatically increase generation speed.