From NotebookLM to Audio Companions: Why Google’s AI Team Went Startup

Raiza Martin on NotebookLM Origins, Audio-First AI, Privacy Concerns, and Consumer Companionship.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

Raiza Martin, co-founder of Huxe and former leader of Google’s NotebookLM team, discusses leaving Google at the height of NotebookLM’s success to build an audio-first personal AI companion. The conversation explores Huxe’s vision of creating personalized interactive audio experiences for users’ daily routines, the challenges and opportunities of voice-based AI interfaces, privacy considerations when handling personal data, and the emerging trend of AI being used more for personal companionship than traditional productivity tasks. [This episode originally aired on Generative AI in the Real World, a podcast series I’m hosting for O’Reilly.]

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
Yishay Carmiel and Roy Zanbel: Why Voice Security Is Your Next Big Problem
Your AI playbook for the rest of 2025
“Massive Scrum” of Models: New Data on China’s AI Gold Rush
The Knowledge Work Agent Ecosystem
Hjalmar Gislason: Unlocking Spreadsheet Intelligence with AI

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

Founding Story & Vision

Why did you leave Google NotebookLM at the height of its success to start Huxe?

We realized AI is really blowing up and the speed at which you can build and iterate at a startup is fundamentally different. After seven months since leaving in December, I think we made the right choice. Working on NotebookLM was incredible, and it was a pleasure to take it from an idea to something that resonated so widely. But at a startup, we wanted to experience that speed and try building products from a different angle – not just solving acute pain points, but building specifically delightful experiences for people. The scrappy iteration loop of a startup allows us to explore unexplored product spaces that were easier to chase outside a big company.

For those unfamiliar, what is NotebookLM and what were its key use cases?

NotebookLM provides contextualized intelligence where you give it sources that you care about, and it stays grounded to those sources. You upload materials like class notes, research papers, or user manuals, and it becomes an expert grounded only in that information. A prominent use case was students uploading their class materials and automatically having an expert in all those materials they could talk with back and forth. It resonated pretty strongly because it provided personalized, grounded AI assistance. A great power-user tip is to create a notebook for your home, uploading all the user manuals for your appliances.

What unmet need does Huxe address with a “personal daily podcast”?

Rather than solving acute pain points, we approached this by asking: what if we could build specifically delightful things for people? We mapped out people’s daily lives and looked for moments where we could make something more interesting or delightful. We landed on commute time – when people are sitting in traffic or walking somewhere. The core provocation was: what if we could create really personalized, novel audio content for that time? Instead of chasing a single acute pain point, we asked: “Where are the dull moments in a typical day—commutes, lines, walks—and how could AI make them delightful and useful?” We started with basics like news, email, and calendar summaries, but we see a huge opportunity in that space.

Audio-First Design Philosophy

Why choose an audio-first approach over text-based interfaces?

You learn fundamentally different things when you change the modality. I noticed this from my own use of ChatGPT’s voice mode – when I go on walks and talk about my day, it’s a very different interaction from typing. It’s a more personal interaction. With NotebookLM, something about the audio format made the types of sources people uploaded different too. By focusing on voice in, voice out, we’ll learn different use cases than the chat use cases everyone is experiencing now. Voice-in/voice-out interactions surface different use cases than chat. Audio lets people engage hands-free during transit or walks and can unlock sources and interaction patterns text interfaces rarely touch. Voice is still a largely untapped modality, and we think there’s a lot to learn about how people want to interact with AI in this way, which will lead to new products.

What different audio formats are you exploring?

I think about this in two categories: passive and interactive formats. Passive formats involve what the content is about and how flexible it is – how short, long, fast, or malleable to user feedback. Within passive formats, you can play with the content itself and its flexibility—how it adapts to user feedback.

Interactive formats are where you want to join in, maybe have friends join, or share conversations you had with AI. Maybe you’re listening to an audio stream but want to jump in, have friends join, or even publish a conversation you had with an AI. Both concepts, particularly in AI, are new. We don’t have established terms for these formats yet, but I think they’ll emerge in the next 5-10 years, maybe sooner. These formats will be fundamentally different from what we use chat or video for.

What lessons did you take from the smart speaker era?

Two key lessons: First, the technology wasn’t ready for what people really wanted to do. It’s hard to think how that could have worked without AI as we know it today. The capabilities were too narrow.

Second, audio has no UI – the learning curve was steep. You had limited technology that only did 5-10 things, with no UI telling you what those things were. A smart speaker is a physical device, but it doesn’t tell you what to do. When you combine limited technology with a non-existent UI, users don’t know what to use it for, so it defaults to being a timer. Today’s AI can do much more, so maybe users can just try things and it will work, creating a delightful experience, but it still requires user input. The challenge remains to build a system so supportive that the user doesn’t have to figure out how to use it.

Privacy & Data Management

How do you handle privacy concerns, especially compared to always-listening smart speakers?

Smart speakers presume they’re always on with wake words, which feels creepy because users know if they say something, it’s probably logged. Wake-word devices felt “always listening,” which creeped people out. With Huxe being an app, you have to turn it on to use it. It’s clunkier but allows us to push boundaries carefully. We’re seeing more tools that are always on and listening, like Granola and Clewly. I’m curious if consumers are ready for that, because you can get the most utility from something always on, but consumer readiness is still TBD. We’ll push the boundary of ambient capture carefully, matching consumer readiness rather than leaping past it.

What’s your approach to handling sensitive user data like email and calendars?

We’re very privacy-focused and privacy-first, similar to our approach with NotebookLM where we didn’t train on user data. Email and calendar data is really sensitive, so we use it to improve your personal experience, not to improve the experience for all users. We’re building personalized recommendation models where your data’s utility doesn’t go beyond your usage of the app. Data is used to improve your experience, not to train global models. It’s harder in the short term, but more privacy-respecting and builds user trust. We are focused on building personalized recommendation models where the utility of your data doesn’t go beyond your own usage of the app.

How do you provide services like flight status updates without compromising privacy?

I think about it like this: if I were in your pocket personally and wanted to help you, what would I do? If I saw you have a flight at 4 PM, I could check that flight without leaking your personal information – just look up the flight number. There are many cases where you can provide utility without leaking user data to another service. We try to be action-oriented while being respectful of user privacy. Think like a human assistant in your pocket. We can look up a flight by number or check traffic without exposing personal identifiers.

How do you maintain privacy when using external foundation models?

When we use commercial models, we specifically opt for models that don’t train on the data we send them – that’s a P0 requirement for us. We choose providers that don’t train on our payloads.

Technical Architecture & Working with Foundation Models

How do you handle context and memory to create a continuous experience?

For continuity, people expect AI to remember past conversations, so we build a durable context layer at the application layer. We don’t pass everything to the model due to context window limitations anyway. We store conversation history ourselves and figure out what’s the right subset to pass to the model, trying to pass as little as possible depending on what the user is talking about. Long-term memory lives in our app layer; we algorithmically decide what tiny slice to send per turn.

We simulate how human memory works, where you only access relevant parts of your knowledge for a given conversation. This context layer is stored at our application layer, not passed to the models. We have a process to determine what the user is trying to talk about and then pass only the minimal, necessary information to the model for that specific task. Context windows force discipline—just like humans retrieve only relevant memories in conversation.

What improvements would you like to see from foundation model providers?

Three main areas: First, latency – there should be no latency in conversation, you should be able to interrupt, just like human conversation. For a natural, human-like conversation where you can interrupt each other, we need to drive down latency significantly.

Second, better tool calling and parallelizing it with voice model synthesis, which is critical for good experiences. Making tool calling better and, critically, parallelizing it with the voice model synthesis is essential for a good user experience.

Third, diversity of voices, languages, and accents – sounds basic but it’s actually pretty hard. This sounds basic, but improving the diversity of voices, languages, and accents is actually quite hard and very important.

These are well-known needs that will probably take the rest of the year to see more progress on.

What about the gap between cloud and on-device models?

When you think about supporting an actual LLM plus a voice model on-device, plus what it takes for an app to actually support those models and deliver a good experience, it gets hairy. Most people would abort and go back to commercial models and APIs. The technical complexity is still too high for most practical applications. We’re making a lot of progress on smaller, on-device models. However, when you think about what it takes for an app to support a full LLM and a voice model on-device, and what that user experience would be like, most developers would likely abort and go back to using commercial APIs. The gap is still significant.

Product Strategy & User Experience

How do you prevent creating a filter bubble without social components?

That’s a philosophical question about what a person should or shouldn’t consume. I think about it from relevance rather than editorial perspective. I don’t necessarily need to show someone the opposite opinion if they skew one way. Relevance of information is probably enough for utility. We bias toward relevance over editorial balance. The focus is giving you what’s useful, plus clear controls for “more of this / less of that.”

The key is surfacing the correct controls for users to say “I like this, here’s why” or “I don’t like this, here’s why not.” It takes time to figure out where to inject tension and where the system should push back. We’re still learning where to inject tension or diversity; user feedback loops will shape that.

What is the line between an AI assistant and an AI companion?

Fundamentally, I’m not sure the question matters from a capabilities perspective, as a user will use a tool how they intend to use it. I see people treat ChatGPT as a best friend and others who treat it like an employee. Capabilities overlap. The distinction is mostly packaging and prioritized use cases. Users decide whether something is a “friend,” “coach,” or “employee.” Our job is to make chosen paths excellent rather than force a role.

For a builder, the question helps you decide which set of use cases to prioritize and make work extremely well. Beyond simple conversation, the capabilities of an assistant and a companion will start to diverge. We’ll likely build the shared capabilities first.

Metrics & Early Stage Product Development

How do you measure success in the early stages?

I think about metrics differently in super early stages. Trial behavior creates curiosity-driven usage that’s hard to interpret from raw metrics. I care about basic quality metrics like onboarding rates and ability to actually try the product. Of course, I care about basic quality metrics like onboarding rates. But expect noisy “tourist traffic” early—separate curiosity from durable value.

But for knowing if you’ve crossed the chasm, I look for two types of people: those who really love it and rave about it (even if you can barely understand why), and those who hate it after actually using it. Both had high expectations. Their strong negative reaction often comes from having a high expectation that wasn’t met, which tells you what they really wanted the product to do. These two groups help you triangulate how your product looks to the outside world and whether the market is ready for something this weird.

With NotebookLM, we saw that students “got it” instantly, even before ChatGPT existed, because their need was so great. That’s the first chasm to cross: finding people who understand the product’s potential and feel strongly about it. The goal isn’t to measure if it’s a “hit,” but rather to measure if the market is ready for something like this.

Future of Consumer AI

What’s your prediction about consumer AI that people would find surprising?

More people use AI for companionship than we think, but not in ways we imagine. Almost everyone I talk to about their ChatGPT or Claude usage – the utility is very personal. Personal use cases have more area for discovery than work ones, which are well-known and finite. The work use cases are well-known and somewhat finite, but the personal use cases are where there is a huge area for discovery.

For example, I use ChatGPT as my running coach – it ingests my running data, creates plans, and I use voice mode while running. I don’t know where to slot that – it’s not productivity, but I’m not talking to it like my best friend. It’s a coach. More people are doing these complicated personal things closer to companionship than traditional enterprise use cases. People are becoming more curious, and AI is a natural outlet for that curiosity. These are sticky, high-frequency touchpoints that don’t fit classic productivity buckets.

Do these personal relationships create model stickiness?

Absolutely. I’ve talked to ChatGPT so much that there’s no way for me to port my data – it would be really inconvenient. It becomes data lock-in. UX quality plus accumulated personal history creates data lock-in. If you want to steal somebody’s customer, you have to figure out the data lock-in problem. The personal use cases are stickier than work use cases because people become more curious and have immediate access to knowledge, not just information. If you want to win a user away, you must solve migration—port their context and memories seamlessly.

Was Character.AI ahead of its time?

Yes. Companionship-style interactions are mainstreaming; Character.AI tapped that vein early.