Why Voice Security Is Your Next Big Problem

Ben Lorica

6 months ago

Yishay Carmiel and Roy Zanbel on Voice Cloning, Deepfake Detection, and the Future of Audio LLMs.

Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.

In this episode, Yishay Carmiel and Roy Zanbel of Apollo Defend discuss the rapidly evolving landscape of voice AI and its emerging security threats. They explain how accessible voice cloning technology has created a new attack vector for social engineering and identity theft, making voice a unique biometric risk. The conversation covers defensive strategies like real-time deepfake detection and voice anonymization, and looks ahead to the security challenges of next-generation speech-to-speech models.

Subscribe to the Gradient Flow Newsletter

Interview highlights – key sections from the video version:

Jump to transcript

Related content:

A video version of this conversation is available on our YouTube channel.
2025 AI Governance Survey Results
New Threat Vector: Prompt Injection at the Raw Signal Level
The Rise of Voice as AI’s Interface Layer: Why AI Security Must Come First
Securing Generative AI: Beyond Traditional Playbooks
Shreya Rajpal → The Essential Guide to AI Guardrails
Mars Lan → The Security Debate: How Safe is Open-Source Software?
Manos Koukoumidis → How a Public-Benefit Startup Plans to Make Open Source the Default for Serious AI

Support our work by subscribing to our newsletter📩

Transcript

Below is a heavily edited excerpt, in Question & Answer format.

The State of Voice AI and Foundation Models

How does the current state of voice foundation models compare to the large language model (LLM) space?

Voice AI is not yet as mature as the LLM space, where a few dominant foundation models are used off-the-shelf. However, the industry is moving in that direction. Currently, most voice applications use a “cascading model” with three separate steps:

Speech-to-Text (ASR): OpenAI’s Whisper is the de facto open-source foundation model that most developers use or build upon
Language Processing: An LLM processes the transcribed text
Text-to-Speech (TTS): The LLM’s text output is converted back into speech

The TTS space is more fragmented, with commercial options like ElevenLabs and various open-source models. The next evolution is the move to end-to-end “speech-to-speech” or “audio LLM” models that treat speech as both input and output, using internal token representations instead of converting to text.

What foundation models are available from major players and international sources?

Beyond Whisper, the landscape includes:

Amazon: Recently released Amazon Nova for speech-to-speech
Meta: Working on VoiceBox and AudioBox for speech synthesis; rumors suggest an upcoming “Voice Llama” for speech-to-speech tasks
Google: Demonstrated near real-time translation systems
Chinese companies: Models like CozyVoice for speech synthesis and voice conversion, AudioLabs from StepFunction, and initiatives from Alibaba and Baidu

While these indicate rapid global innovation, most are not yet widely available to general developers like LLMs are.

Is real-time speech-to-speech (speech-in, speech-out) technology generally available to developers?

Not yet. While companies like Appolo Defend have demonstrated this capability, it remains largely proprietary. Most current architectures still rely on cascading through text. The “holy grail” of pure speech-to-speech processing is coming but isn’t generally available to developers today in the way LLMs are.

Yishay Carmiel and Roy Zanbel on Voice Cloning, Deepfake Detection, and the Future of Audio LLMs.

Transcript

The State of Voice AI and Foundation Models

For a full transcript, see our newsletter.

Share this: