Teaching AI How to Forget

Ben Luria on Unlearning, Guardrails, Jailbreaks, and Enterprise Trust.

Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

In this episode, Ben Lorica speaks with Ben Luria, CEO and co-founder of Hirundo, about the emerging necessity of machine unlearning for enterprise AI. They discuss the limitations of bypassable guardrails and explore how “neuro-surgery” on model weights can permanently remove unwanted data, biases, and security vulnerabilities. Luria explains how teaching AI to forget specific information—from PII to copyrighted material—is the key to making large language models trustworthy enough for mission-critical deployments.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript



Related content:


Support our work by subscribing to our newsletter📩


Transcript

Below is a polished and edited transcript.

Ben Lorica: All right, today we have Ben Luria, CEO and co-founder of Hirundo, which you can find at hirundo.io. Their taglines are “Remove unwanted data and behaviors from your AI models” and “It’s the first machine unlearning platform removing AI issues at the core.” With that, Ben, welcome to the podcast.

Ben Luria: Thank you, Ben. As we said earlier, great name. I think Ben Luria and Ben Lorica—it doesn’t get much closer than this.

Ben Lorica: All right, let’s start with the basics. We’ll get into unlearning in a second, but first: what problem are you trying to address? Set aside the solution for a moment—what is the core problem?

Ben Luria: I think we’re seeing a lot of different solutions tackling the same problem from different angles. It’s basically the grand question of whether AI and LLMs will make the transition from being something very appealing and hyped to the end user to reaching meaningful ROI and deployment in the enterprise. Every enterprise wants to use AI for internal and external purposes, but at the end of the day, AI still carries inherent risks. It knows things it shouldn’t know, and it acts in undesired or unexpected ways. It could be biased, it could hallucinate, and it could be vulnerable to attacks.

At our core, we’re trying to make AI trustworthy and deployment-ready for mission-critical tasks. What differentiates our view of the problem is that we tap into a core principle of AI: the fact that AI can learn, but once it has learned, it can’t really forget. It can’t forget information, but it also can’t undo the impacts that different data points had on how it behaves, thinks, and acts. We’re trying to undo that.

Ben Lorica: Ben, how did you arrive at this problem? Before you embarked on a startup, there were many paths you could have pursued. Why this particular problem? You’re saying that the fact that models may not behave as intended is what holds enterprises back from deployment. How did you measure the size of this problem?

Ben Luria: A bit about the background of the team and how we started our exploration: I come from a non-technical background—I was one of the first Rhodes Scholars from Israel, focusing on innovation and public policy at Oxford. But my co-founders bring the necessary tech expertise. Our chief scientist, Professor Oded Shmueli, was the Dean of Computer Science and Executive VP at the Technion, Israel’s leading STEM university. Michael Leibovich, our CTO, was an award-winning R&D officer for the Israeli government and a researcher at the Technion. We bring a mixture of non-tech and tech expertise.

When we joined as a team about three years ago, we started interviewing data science teams in various organizations. We heard many different problems, but we found one common thread: once you discover that something is not working with your AI model, it’s a moment too late to solve it. Due to the nature of AI, once something goes wrong, you usually have to go back to the starting point. This was common for both non-generative models and LLMs. Solving issues at their core is possible in other areas of software, but in neural networks, it’s not really possible today. Given the team’s deep tech expertise, we wanted to go to the bottom of this rather than focusing on the surface. We found a tough technical challenge that was still largely unattended in the industry—a “blue ocean” that uses our scientific acumen to solve a big challenge with enterprise impact.

Ben Lorica: Regarding the problems you described—what are the common ways people currently address them? For example, people use context engineering or Retrieval Augmented Generation (RAG). As listeners of this podcast have heard me say, just because you put things in the context window doesn’t mean the LLM won’t opine, because the LLM is opinionated. Am I right that the main current approach is RAG or context engineering?

Ben Luria: Right, context engineering. In more mature organizations, you’ll sometimes see fine-tuning, depending on whether they use open or closed models. Another angle is AI security, where the common approach is to shove guardrails on top to filter problematic inputs and outputs—whether that’s violating company policy or addressing security risks.

I think those solutions fail because they focus on the outside. Everyone has witnessed an issue where they played around with ChatGPT and it told them it couldn’t answer, but then they found another way to phrase the question and got the answer anyway. These guardrails are bypassable. The question remains: what does the model know that it shouldn’t, and how does it behave in ways that could be exploited? Unlearning doesn’t focus on the outside; it focuses on the inside.

Ben Lorica: The thing about fine-tuning and guardrails is that they require anticipation. You can only fine-tuning if you have the dataset, which presupposes you know what you’re fine-tuning for. This is why some people are moving toward reinforcement learning, because it’s more robust, though it’s much more difficult and lacks accessible tools. So, in one or two sentences, what is “unlearning”?

Ben Luria: The tagline is “teaching AI how to forget.” In practice, we are removing undesired behaviors or information from the model itself—from the weights in open models or from log probs in closed models.

Ben Lorica: You use a nice metaphor for this: “neuro-surgery” on the model’s internal representations. But in actual surgery, you know exactly which part of the brain you are attacking. Does that map over to unlearning? Are you very targeted in how you modify the model?

Ben Luria: There are two layers to this. Our R&D over the last two years focused on building the core engine that allows for the mapping and understanding of where things are represented. It is “neuro-surgery” in the sense that, given an input, we detect where those representations reside.

Ben Lorica: So there is some overlap with research around explainability? For example, groups like Anthropic and OpenAI have found parts of a model that represent specific concepts like “bridges” or “cars.”

Ben Luria: There are similarities. The first step in the process is understanding what you’re looking at. Existing research often skips from finding a representation to removing it, but we can talk later about why that often fails or hasn’t reached the market yet. We focus heavily on that mapping.

Ben Lorica: Let’s make this concrete. In an enterprise setting, what kind of representations are people commonly looking for?

Ben Luria: One area is AI security—vulnerabilities or weaknesses where the model can’t deflect prompt injections or jailbreaks. Another is bias, whether regarding gender, race, ideology, or nationality. Then there are hallucinations, such as summary or RAG hallucinations, where the model makes mistakes despite having the right context. We can also detect representations of PII (Personally Identifiable Information) that the model might have been fine-tuned on.

Ben Lorica: So if I have an open-weights model and I pass it through your tool, you can find how the model represents PII?

Ben Luria: Exactly. If you want to stay compliant or if a user asks to be forgotten, and you don’t want to start from scratch with training, we can remove that in a very lean process.

Ben Lorica: For people who don’t follow the field closely, what are the core strategies for unlearning?

Ben Luria: Unlearning has been a research field for about 10 years. Usually, it fails at scale because once you detect where things reside and try to remove them, one of two things happens: either you didn’t forget everything (there are still “breadcrumbs” left), or you hurt the model’s overall utility. Neuro-surgery needs to be precise so you don’t impact the “brain” of the patient.

Our IP focuses on two things: how to erase those “crumbs” more effectively, and how to do it without impacting the utility of the model. We want to ensure you remove what you want while keeping everything else intact. There are scientific approaches involving replacing representations—like making a model think the Eiffel Tower is in Italy—or forgetting bigger concepts, like Microsoft’s research on “forgetting Harry Potter.”

We focus on two types: behavioral unlearning (eliminating tendencies regardless of the data that caused them) and data unlearning (erasing specific data points). Forgetting PII is actually easier for us than erasing a large concept like Harry Potter, which was still recoverable in Microsoft’s experiments.

Ben Lorica: I’m trying to wrap my head around the PII example. In an enterprise situation, if I take an open-weights model and fine-tune it, why would it have PII?

Ben Luria: We currently have a POC with an enterprise involving models fine-tuned on customer interactions to build support chatbots. You want the content of those interactions, but some of them might include an email address in the message body.

Ben Lorica: Right. And the traditional approach would be to use guardrails—NLP, Regex, or even another LLM—to check for email addresses in the response. Why is that not enough?

Ben Luria: Because there is always a risk of leaking. In probabilistic systems, if a risk exists, it eventually manifests. Once you have tens of thousands of interactions, some will slip through. Furthermore, AI regulation is fragmented, but in some areas, it’s not just about whether the model outputs PII, but whether the model knows the PII. The same applies to copyright or confidential information.

Ben Lorica: I imagine that if unlearning works well, it’s much more performant than guardrails because you aren’t adding a secondary check during inference. It should be faster because the model has already “forgotten” the data.

Ben Luria: Right. In unlearning, we create a copy of the model’s weights without those undesired bits. We don’t affect latency at all. It requires a small amount of GPU power for a short time, and the end result is a new copy of the model with the same number of weights but without the risk. Other solutions add latency and real-time inference costs, which can be expensive at scale.

Ben Lorica: How well does it work? If I pass a model through your system, how do you measure if you did a good job?

Ben Luria: This goes back to our two kinds of unlearning. For behavioral unlearning, we use benchmarks representative of the behaviors you care about. We have many predefined in our system, but you can also customize them. We measure the benchmarks before and after. For example, to get rid of vulnerabilities, we use red-team benchmarks like Meta’s CyberSecEval.

We also measure to ensure that the utility you want to keep remains intact using industry benchmarks like MMLU or IFEval. We aspire to see a huge difference in the behavior you want to eliminate and as little change as possible in everything else. In behavioral unlearning, we don’t claim to get rid of 100% of bias—the model would become paralyzed—but we mitigate it. We see up to an 85% reduction in vulnerabilities and similar numbers for biases. For data unlearning, my chief scientist told me never to say 100% in AI, so I’ll say up to 99.9% removal of fine-tuned PII.

Ben Lorica: Let’s reinforce the definitions of behavioral and data unlearning for the listeners.

Ben Luria: A behavior is a tendency of the model that poses a risk, like being biased against certain categories or being easy to manipulate through jailbreaks. Data is specific information the model learned that it shouldn’t know—copyrighted material, PII, or proprietary confidential information. One is a general trait we mitigate; the other is specific knowledge we erase.

Ben Lorica: When you go into an enterprise, do teams use different solutions for these two things?

Ben Luria: For data, they mostly try to pre-clean and pre-process it. But like guardrails, things fall through the cracks. Compliance and legal are edge-case games; it doesn’t matter if you did a great job on 90% of the data if the remaining 10% still poses a risk. For behaviors, they use system prompts, context engineering, and guardrails. It’s always a game of pushing risk to a minimum, and we believe Hirundo takes that to the next level.

Ben Lorica: It’s a more unified, holistic approach. At the end of the day, you just want the model to behave as intended.

Ben Luria: Exactly. For enterprises, the value proposition is eliminating risk. When we talk to the big labs that create these models, we describe it as a new “superpower” for post-training and alignment. They use fine-tuning and reinforcement learning (RLHF), and we aren’t saying unlearning should replace those. We’re saying it’s a missing piece of the infrastructure that can achieve the desired results in a much leaner way.

Ben Lorica: In some ways, “neuro-surgery” is more descriptive than “unlearning” because unlearning implies a very specific direction. Surgery suggests flexibility—maybe there’s a behavior I want to “re-learn” or adjust because the model was too bias-conscious, for example.

Ben Luria: Absolutely. You can push it in the directions you want. One thing inherent to unlearning is that we don’t aim to teach the model new information, but we can definitely steer it. If we called it “AI Neuro-surgery,” people might think we’re a medical tech company, though!

It reminds me of something Andrej Karpathy said: that a “bad” memory is part of what makes children intelligent. AI currently lacks that; it memorizes too easily and struggles to find patterns because of it. Real intelligence involves forgetting so you can learn patterns instead of just raw information.

Ben Lorica: Two things have trended heavily this year: reasoning and multi-modality. Do you make a distinction between reasoning and non-reasoning models?

Ben Luria: There is a distinction in how they are used. From a technology perspective, our platform is agnostic. A model is still a neural network, whether it’s a Transformer, a Mamba, or a hybrid. Reasoning models have better internal control mechanisms if they have the right context, which might solve some bias issues. However, recent research shows reasoning models can actually be more vulnerable because of their recursive nature—they can “overthink” themselves into a jailbreak. I think the higher adoption of reasoning models actually increases risk for enterprises.

Ben Lorica: You keep bringing up jailbreaking and prompt injection. Is this a top-tier concern for enterprises? The traditional approach is to red-team the hell out of the model. You’re offering something different.

Ben Luria: I bring it up because it’s an easy anchor point to show the difference. Guardrails are like monitoring tools in classic MLOps—they track how the model acts in real time and block things. Guardrails are popular because they are easy to understand and there are many open-source projects for them. We don’t think they will disappear; every mission-critical system needs a monitoring layer. But what’s missing is deep mitigation. Hirundo is currently the only solution focusing on the risk within the model itself, rather than just a firewall approach.

Ben Lorica: It seems like you should lead with that on your website—solving the specific problem rather than the technique.

Ben Luria: That’s part of the evolution we’re going through. We’re having those discussions now. The risk is losing our differentiator—the fact that we bring a deep scientific approach—but we can certainly emphasize the value proposition more.

Ben Lorica: What about multi-modality? Can you help if a model is multi-modal?

Ben Luria: The answer is yes, but right now we are focusing on the language part. Risks are often higher with multi-modal models regarding NSFW content or copyright in music, video, and images. The same logic of detecting representations in the weights applies to multi-modal models. It’s in our thoughts for 2026, but for now, we are focusing on text.

Ben Lorica: Walk us through the general flow of unlearning. Fine-tuning is commoditized now—you upload a dataset and an hour later you have a model. What is the workflow for unlearning?

Ben Luria: We try to make things as predefined as possible while allowing for granular customization. You choose a model, then choose what to unlearn—such as bias categories, hallucinations, or vulnerabilities. If you want to define something very specific, you input a dataset of around 100 question-answer pairs of things to avoid and 100 pairs of the “opposite” or desired direction. We run the process from there.

Ben Lorica: So it’s similar to the fine-tuning workflow, but you’re modifying the weights without overfitting.

Ben Luria: Exactly. With naive fine-tuning, you might improve one aspect but hurt other metrics. Our technology allows you to “freeze-frame” and keep everything else intact while isolating a specific vector.

Ben Lorica: Last year, most people were using Llama. Now, many leading open-weights models are coming from China—DeepSeek, Qwen, MiniMax. Are people starting with those now?

Ben Luria: Yes and no. Everyone agrees the Chinese models are a great starting point, but they have also “unlearned” things like Tibet, Tiananmen Square, and Xinjiang. In the US, we aren’t seeing blind adoption. Enterprises ask if we can make these models more attuned to a US corporate environment. Looking toward 2026, I wouldn’t jump ship on US-provided open-weights models; there is a lot in the works. I believe we’ll see an equal distribution between open-weights for sensitive applications and closed-source for less sensitive stuff.

Ben Lorica: You can only work with open weights, right? Since you need control over the weights.

Ben Luria: For data unlearning, yes. For behavioral unlearning, we can use log probs. We can apply techniques to models like Gemini that give access to log probabilities. We are developing an algorithm we call a “prism” that re-ranks the top tokens to steer the model away from bias or vulnerabilities. We have a dependence on having access to those log probs—Google offers this, while OpenAI and Anthropic are more limited now—but it gives us a way to change outcomes at a deeper level than external solutions.

Ben Lorica: But most of your current users are on open weights?

Ben Luria: Yes, because that’s where we started and many of our customers work on-prem for sensitive internal knowledge bases. But the “prism” approach expands our landscape significantly.

Ben Lorica: What about agents? Usually, you have a big model orchestrating smaller specialist models. How much do I need unlearning if I’m using smaller models or calling tools?

Ben Luria: With agents, there’s still a question of whether they will fulfill their promise. But because agents work autonomously on business-critical tasks, they are more high-stakes than a chatbot. Their chance of error compounds. A 1% error rate can become a 20% chance of a mistake after several iterations in a closed loop. If you’re talking about a financial transaction, that’s an unbearable risk. Reducing the chance of a mistake in an agent provides a massive opportunity.

Ben Lorica: What about AI for software engineering? I imagine unlearning is useful for IP protection—ensuring developers don’t use code that might be problematic.

Ben Luria: That’s an area we’re very excited about. It’s not just about IP; it’s also about outdated libraries or malicious code that might have found its way into a fine-tuned model from Hugging Face. It’s very interesting to apply these techniques to specialized coding models.

Ben Lorica: Looking ahead to the rest of 2026, what is on your R&D roadmap?

Ben Luria: I want us to become even better at the product level. We are a scientific company, but I want the product to be seamless and intuitive. We need to be everywhere—integrated into the Python SDK, FastAPI, and the customer’s release cycles. Beyond that, the “prism” for log probs and multi-modality are key. My main task is to make unlearning the new industry standard. Being the “underdog” from Israel claiming we’ve solved something that the tech giants are still working on is a challenge, but I’m excited to create a new standard.

Ben Lorica: Is your platform model-agnostic? If a new architecture comes out next week, how quickly can you adjust?

Ben Luria: We had that question with Mamba and hybrid models. It took us about a week or two to adapt the engine. As long as you have neural networks with weights represented in vectors, we can support it. The industry will develop, and we will develop with it.

Ben Lorica: Regarding the product—guardrails are easy to explain. How do you engender trust? How do you make Hirundo’s process transparent so it’s not just a black box?

Ben Luria: We are big believers in explainability; we see it as the first step toward actionability. We are upgrading our evaluation component to show users every benchmark that was impacted. If there is even a 1% degradation in a metric, we put it front and center so the user can decide if the model is production-ready.

We are also introducing a scale for how “aggressive” you want the process to be. There is a delicate balance between utility preservation and risk mitigation. If you don’t care about a chatbot’s coding capabilities, you can set the bias reduction to be as aggressive as possible. If it’s a general-purpose model, you might choose a middle ground.

Ben Lorica: This might sound cheesy, but do you think a visualization—like an MRI that “lights up” when you’re working on bias—would help? Some teams aren’t ML-savvy and might not understand the raw metrics.

Ben Luria: That’s exactly the image I had in my mind. We are revamping the design to be more intuitive. I’m not technical, and I tell my team my dream is for a user like me to be able to do this end-to-end without being a trained data scientist. Using simpler language and visualizations is definitely something we are optimizing for.

Ben Lorica: And with that, thank you, Ben.

Ben Luria: Thank you, Ben.