Site icon The Data Exchange

Why Voice Security Is Your Next Big Problem

Screenshot

Yishay Carmiel and Roy Zanbel on Voice Cloning, Deepfake Detection, and the Future of Audio LLMs.

Subscribe: AppleSpotify OvercastPocket CastsAntennaPodPodcast AddictAmazon •  RSS.

In this episode, Yishay Carmiel and Roy Zanbel of Apollo Defend discuss the rapidly evolving landscape of voice AI and its emerging security threats. They explain how accessible voice cloning technology has created a new attack vector for social engineering and identity theft, making voice a unique biometric risk. The conversation covers defensive strategies like real-time deepfake detection and voice anonymization, and looks ahead to the security challenges of next-generation speech-to-speech models.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript



Related content:


Support our work by subscribing to our newsletter📩


Transcript

Below is a heavily edited excerpt, in Question & Answer format.

The State of Voice AI and Foundation Models

How does the current state of voice foundation models compare to the large language model (LLM) space?

Voice AI is not yet as mature as the LLM space, where a few dominant foundation models are used off-the-shelf. However, the industry is moving in that direction. Currently, most voice applications use a “cascading model” with three separate steps:

  1. Speech-to-Text (ASR): OpenAI’s Whisper is the de facto open-source foundation model that most developers use or build upon
  2. Language Processing: An LLM processes the transcribed text
  3. Text-to-Speech (TTS): The LLM’s text output is converted back into speech

The TTS space is more fragmented, with commercial options like ElevenLabs and various open-source models. The next evolution is the move to end-to-end “speech-to-speech” or “audio LLM” models that treat speech as both input and output, using internal token representations instead of converting to text.

What foundation models are available from major players and international sources?

Beyond Whisper, the landscape includes:

While these indicate rapid global innovation, most are not yet widely available to general developers like LLMs are.

Is real-time speech-to-speech (speech-in, speech-out) technology generally available to developers?

Not yet. While companies like Appolo Defend have demonstrated this capability, it remains largely proprietary. Most current architectures still rely on cascading through text. The “holy grail” of pure speech-to-speech processing is coming but isn’t generally available to developers today in the way LLMs are.

For a full transcript, see our newsletter.

Exit mobile version