How Language Models Actually Think

Emmanuel Ameisen on LLM Hallucinations, Internal Reasoning, and Practical Engineering Tips.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Emmanuel Ameisen, an interpretability researcher at Anthropic, joins the podcast to demystify how large language models work. He explains why LLMs are more like biological systems than computer programs, sharing a mechanistic explanation for hallucinations and surprising findings about how models reason internally. The discussion also provides practical advice for developers on how to best leverage these complex systems, emphasizing the importance of evaluation suites and prompt engineering over fine-tuning. [This episode originally aired on Generative AI in the Real World, a podcast series I’m hosting for O’Reilly.]

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Think Smaller: The Counterintuitive Path to AI Adoption
Beyond RL: A New Paradigm for Agent Optimization
Jakub Zavrel → How to Build and Optimize AI Research Agents
Andrew Rabinovich → Why Digital Work is the Perfect Training Ground for AI Agents

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Understanding How LLMs Work Internally

Why is studying language models more like biology than traditional software engineering?

Language models aren’t programmed with explicit logical branches like traditional software. Instead, they’re “grown” through training on vast datasets, where they adjust billions of parameters to improve performance. The result is a system that works, but how it works isn’t immediately clear to its creators.

Like biologists studying a brain or biological system, researchers must use empirical methods to understand these models. They observe which parts of the model “light up” (activate) in different contexts, turn components on and off to test hypotheses, and infer functions from effects. It’s more like dissecting a mouse than debugging a program—you can’t just read the “source code” because there isn’t any in the traditional sense.

What surprising problem-solving patterns have been discovered inside these models?

Three key findings challenge common assumptions about how these models work:

First, models plan multiple tokens ahead. Despite being described as “next-token predictors,” they frequently decide what they’re going to say several tokens or even sentences in advance. The prediction of the next token is informed by a longer-term plan, not just immediate context.

Second, models develop universal, language-independent concept representations. The same neurons that activate for “someone is tall” in English will activate for “quelqu’un est grand” in French. This demonstrates shared semantic understanding rather than separate pattern matching for each language—the model has abstract concepts that transcend linguistic boundaries.

Third, models perform remarkably complex multi-step reasoning in a single forward pass. In a medical diagnosis example where a model was asked for a single-word answer about which test to run, internal inspection revealed that neurons activated for individual symptoms, then for two potential diagnoses, and finally for the appropriate differential diagnosis test—all without writing out any reasoning steps.

What exactly is a “concept” in the context of model internals, and how are these identified?

Concepts are distributed patterns across groups of neurons that represent abstract ideas like “basketball,” “Michael Jordan,” or “being tall.” Models have billions of neurons, and meaningful concepts typically emerge from groups of neurons firing together.

These concepts are discovered automatically and in an unsupervised way. Researchers run diverse text through models and observe which activation patterns cluster together—similar to showing videos of basketball, football, and apples to a person while recording brain activity to find which regions activate for sports versus fruits. Once a potential concept is identified, researchers validate it by deactivating those neurons and confirming the model loses its ability to understand or discuss that topic.

Are these concepts stable across different models and versions?

At the level of individual weights and neurons, everything differs between models. However, at a higher level of abstraction, there are strong commonalities. Models trained for similar tasks (like being helpful assistants) develop similar concepts and relationships between them. For example, if the “Michael Jordan” concept has large neural overlap with the “basketball” concept in one model, you’ll likely find similar structural relationships in other models, even though the specific neurons involved are different.

Hallucinations and Reliability

What’s the mechanistic explanation for how models hallucinate?

Interpretability research has revealed a split-mechanism explanation. Models have default “I don’t know” neurons that activate when asked about unfamiliar topics—this is trained behavior to avoid hallucination. Separately, models have “familiarity” neurons that determine if they recognize something (like a famous person).

When you ask about someone famous like Michael Jordan, the familiarity neurons activate and suppress the “I don’t know” response, allowing the model to answer. Hallucination occurs when this initial “do I know this?” check fails—if those neurons incorrectly fire for an unfamiliar person, they override the “I don’t know” response, and the model proceeds to generate content about someone it doesn’t actually know.

The critical difference from human cognition is that humans would catch themselves and say something like “I think I know but can’t recall it right now.” Models, however, proceed as if they have complete knowledge when the familiarity signal fires, even if the actual information isn’t there.

Can we trust the step-by-step reasoning that reasoning models output?

No, not always. Research shows that models can effectively “lie” about their internal processes. When a model writes “I am now going to perform this calculation,” analysis of its internal activations sometimes shows it’s not performing that computation at all. Instead, it might be taking shortcuts or making guesses based on context clues.

This mismatch between written reasoning and actual internal processing means you can’t fully trust reasoning outputs at face value. The model may narrate a calculation it isn’t performing and instead guess from context when the problem is too hard.

Why is predictability more important than raw accuracy for production applications?

A 95-99% accurate model where you can’t predict the failure cases is more dangerous than a 60% accurate model with known failure modes. If you don’t know when errors will occur, you can’t build reliable systems around the model. With predictable failure patterns, you can implement appropriate guardrails, human review processes, or fallback systems.

This insight is driving improvements in model calibration—training models to better signal their uncertainty rather than just maximizing accuracy. Modern models hallucinate less because they’re better calibrated about what they know, though detecting hallucinations purely from internal traces remains challenging.

How does grounding models with web search help with hallucinations?

Web search provides citations that users can verify, adding a layer of accountability. Models can also self-correct in real-time—for example, a model might confidently give wrong advice, then perform a web search and revise: “Actually, I was wrong. According to the search results, this is how you do it.”

However, this introduces new challenges. Models need to evaluate source reliability (like checking whether scientific papers have been retracted) and determine when to invest time in cross-referencing multiple sources versus giving quick answers. The effort-versus-accuracy tradeoff becomes important—10 seconds might be fine for shopping advice, but medical questions warrant 10-20 minutes of research mode with extensive cross-referencing.

Tools and Interpretability Infrastructure

What interpretability tools exist today for developers?

Anthropic has open-sourced tools that function like a “microscope” for models. These allow you to trace through model execution and inspect neuron activations, similar to how a debugger shows variable states in traditional programming. You can observe which neurons activate for different concepts during inference—for example, seeing that both “I definitely know this person” and “this person is a basketball player” neurons are active simultaneously.

These tools work on open-source models and enable hypothesis testing by activating or deactivating specific neurons and observing behavioral changes. However, the technology is still early stage—at Anthropic, these tools are primarily used by the interpretability research team itself, not yet widely deployed to other teams building on the models.

Will we eventually have debugger-like tools for AI models?

That’s the vision. The goal is to create inspection tools that allow developers to see the internal state of a running model, much like a traditional debugger. We’re getting closer to being able to inspect model execution and identify active concepts (e.g., “the ‘this person is a basketball player’ neuron is on”).

The major remaining challenge is what to do with that information. In traditional programming, a debugger helps you find a specific line of code to fix. With current models, even if you identify a flawed internal mechanism, the only recourse is to indirectly influence it by changing training data or curriculum—a much less precise process than fixing code. We’re “still in the early days” of making these tools practical for everyday development.

Practical Guidance for AI Teams

What should developers building AI applications understand about model internals?

Three critical insights emerge from interpretability research:

First, models can be understood—don’t treat them as impenetrable black boxes. We can look inside and find rich, complex structures that go far beyond simple “next-token prediction” or “pattern matching.”

Second, your intuitions about how models work are likely wrong in important ways. Models plan ahead, maintain language-agnostic concepts, and perform multi-step reasoning entirely within their weights. When you see weird mistakes or inconsistent behavior, it’s often because the model “picked the wrong algorithm” from multiple internal approaches it’s running simultaneously.

Third, while this is early-stage research without immediate production benefits, developers can experiment with open-source tools on smaller models to develop better intuitions about model behavior. This won’t immediately improve your applications, but it builds understanding that can inform everything from prompt design to system architecture decisions.

Based on understanding of model internals, what’s the recommended approach for post-training work?

The core insight from interpretability research is that models have immense internal complexity and capacity—they can execute multiple solution algorithms simultaneously for any given problem. Your job is to effectively elicit that latent capability.

Start with building a solid evaluation suite. This is the most common point of failure. Even spending just an afternoon writing 100 examples of desired and undesired outputs provides essential grounding. Most teams fail at this step, leading to “vibes-driven” tuning that doesn’t measurably improve performance.

Next, exhaust prompt engineering and context engineering (like RAG or GraphRAG) before considering fine-tuning. Given the vast built-in capabilities of modern models, you can get incredibly far by simply providing the right context and instructions. Modern models have ample latent capacity; the practical task is to elicit the right behavior consistently.

Only resort to fine-tuning when everything else fails—which is increasingly rare as base models improve. Use it when prompt/context approaches plateau against your evaluations, or when you need domain style, safety constraints, or task formats that are hard to elicit reliably. Treat fine-tuning as an escalation, not a first resort.

How should teams think about the inconsistent behaviors and mistakes they see in production?

Interpretability research shows models often solve problems using multiple parallel approaches—essentially running three different algorithms simultaneously. When you see weird mistakes or inconsistent behavior, it’s often because the model “picked the wrong algorithm this time” from its internal repertoire.

Your evaluation suite should isolate these different modes—edge cases, adversarial contexts, ambiguity—so you can steer algorithm selection via prompts, system instructions, retrieval content, or if necessary, targeted fine-tuning. Understanding that this internal complexity exists helps explain why the same prompt sometimes works and sometimes fails, or why fine-tuning on specific examples might not generalize as expected.

What can teams experiment with today if they’re curious about model internals?

Download available open-source interpretability libraries and grab a small open-source model. Ask it targeted questions, inspect concept activations, turn specific concepts on or off, and observe how behavior changes. You won’t ship these probes to production directly, but they sharpen your mental model.

Apply the same tenacity and curiosity you’d use debugging a traditional program to understanding these systems. However, don’t expect production-ready debugging tools yet—that’s a 6-12 month horizon at best. The value today is in developing better mental models of how these systems actually work, which can inform your engineering decisions.

If you had to give three key takeaways for developers, what would they be?

LLMs are complex but intelligible—treat them as systems you can study, not inscrutable black boxes. Move beyond oversimplified mental models.
Your mental model is likely wrong in key ways—models plan ahead, carry rich language-agnostic concepts, and their written “reasoning” can disagree with internal computation.
For practical work: build evaluation suites first, exhaust prompt/context/grounding approaches, use interpretability tools to refine understanding, and escalate to fine-tuning only when justified by evaluation gaps.

Emmanuel Ameisen on LLM Hallucinations, Internal Reasoning, and Practical Engineering Tips.

Transcript

Understanding How LLMs Work Internally

Hallucinations and Reliability

Tools and Interpretability Infrastructure

Practical Guidance for AI Teams

Share this:

Like this:

Discover more from The Data Exchange