Yishay Carmiel and Roy Zanbel on Voice Cloning, Deepfake Detection, and the Future of Audio LLMs.
Subscribe: Apple • Spotify • Overcast • Pocket Casts • AntennaPod • Podcast Addict • Amazon • RSS.
In this episode, Yishay Carmiel and Roy Zanbel of Apollo Defend discuss the rapidly evolving landscape of voice AI and its emerging security threats. They explain how accessible voice cloning technology has created a new attack vector for social engineering and identity theft, making voice a unique biometric risk. The conversation covers defensive strategies like real-time deepfake detection and voice anonymization, and looks ahead to the security challenges of next-generation speech-to-speech models.
Interview highlights – key sections from the video version:
-
-
- Framing Voice AI vs. NLP Foundations
- Are Foundation Models Ready for Voice?
- Whisper, Cascaded Pipelines & Today’s Tooling
- Chinese & Big-Tech Voice Foundation Efforts
- Toward Real-Time Speech-to-Speech Systems
- Human-Level TTS: Quality Breakthroughs
- Rise of Voice Agents & Real-Time Architectures
- Lightning-Fast Voice Cloning & Its Risks
- Voice as a Biometric Threat: Impersonation Tactics
- Threat Models: White-, Gray- & Black-Box Attacks
- Government Demand & Enterprise Security Outlook
- Anti-Cloning Defenses and Deepfake Detection
- Consumer Exposure & Need for Protective Layers
- Audio LLMs on the Horizon & Security Implications
-
Related content:
- A video version of this conversation is available on our YouTube channel.
- 2025 AI Governance Survey Results
- New Threat Vector: Prompt Injection at the Raw Signal Level
- The Rise of Voice as AI’s Interface Layer: Why AI Security Must Come First
- Securing Generative AI: Beyond Traditional Playbooks
- Shreya Rajpal → The Essential Guide to AI Guardrails
- Mars Lan → The Security Debate: How Safe is Open-Source Software?
- Manos Koukoumidis → How a Public-Benefit Startup Plans to Make Open Source the Default for Serious AI
Support our work by subscribing to our newsletter📩
Transcript
Below is a heavily edited excerpt, in Question & Answer format.
The State of Voice AI and Foundation Models
How does the current state of voice foundation models compare to the large language model (LLM) space?
Voice AI is not yet as mature as the LLM space, where a few dominant foundation models are used off-the-shelf. However, the industry is moving in that direction. Currently, most voice applications use a “cascading model” with three separate steps:
- Speech-to-Text (ASR): OpenAI’s Whisper is the de facto open-source foundation model that most developers use or build upon
- Language Processing: An LLM processes the transcribed text
- Text-to-Speech (TTS): The LLM’s text output is converted back into speech
The TTS space is more fragmented, with commercial options like ElevenLabs and various open-source models. The next evolution is the move to end-to-end “speech-to-speech” or “audio LLM” models that treat speech as both input and output, using internal token representations instead of converting to text.
What foundation models are available from major players and international sources?
Beyond Whisper, the landscape includes:
- Amazon: Recently released Amazon Nova for speech-to-speech
- Meta: Working on VoiceBox and AudioBox for speech synthesis; rumors suggest an upcoming “Voice Llama” for speech-to-speech tasks
- Google: Demonstrated near real-time translation systems
- Chinese companies: Models like CozyVoice for speech synthesis and voice conversion, AudioLabs from StepFunction, and initiatives from Alibaba and Baidu
While these indicate rapid global innovation, most are not yet widely available to general developers like LLMs are.
Is real-time speech-to-speech (speech-in, speech-out) technology generally available to developers?
Not yet. While companies like Appolo Defend have demonstrated this capability, it remains largely proprietary. Most current architectures still rely on cascading through text. The “holy grail” of pure speech-to-speech processing is coming but isn’t generally available to developers today in the way LLMs are.
