Yishay Carmiel on Generative AI for audio, voice cloning, real-time speech translation, and more.
Subscribe: Apple • Spotify • Overcast • Google • AntennaPod • Podcast Addict • Amazon • RSS.
Yishay Carmiel is the CEO of Meaning1, a startup at the forefront of building real-time speech applications for enterprises. We discuss the state of AI for speech and audio, including trends in Generative AI, automatic speech recognition, diarization, restoration, voice cloning, speech synthesis and more.

Interview highlights – key sections from the video version:
- Generative AI for Audio (text-to-speech; text-to-music; speech synthesis)
- Speech Translation
- Automatic Speech Recognition and other models that use audio inputs
- Speech Emotion Recognition
- Restoration
- Similarities in recent trends in NLP and Speech
- Diarization (speaker identification), and implementation challenges
- Voice cloning and risk mitigation
- ❛ I believe the primary concern at the forefront is what we refer to as voice cloning. So, if you consider recent advancements, such as text-to-speech technology adapting to specific user voices, one of the major challenges we faced was determining how much new user data was required to tailor the system to a new individual’s voice. Until about two years ago, this process demanded anywhere from 30 to 60 minutes of immediate recording. However, asking someone to speak for 30 to 60 minutes with high-quality results proved quite challenging.
- A video version of this conversation is available on our YouTube channel.
- Call Center Survey Results: Exploring Solutions for Call Center Agent Challenges
- New open source tools to unlock speech and audio data
- Piotr Żelasko: The Unreasonable Effectiveness of Speech Data
- Yishay Carmiel: End-to-end deep learning models for speech applications
- Casey Ellis: The Future of Cybersecurity – Generative AI and its Implications
- What We Can Learn from the FTC’s OpenAI Probe
- Large Language Models in Cybersecurity
- Entity Resolution: Insights and Implications for AI Applications
Now, with the latest and most advanced models, we can achieve remarkable quality with as little as three to five seconds of voice data. Yes, just three to five seconds! For example, models like VALL-E or VALL-E X exhibit this capability, and similar models exist. This represents a significant risk.
To address this issue, there are ways to mitigate the risk by implementing robust detectors to verify the authenticity of a voice. Researchers in the field are actively exploring this avenue. Several factors come into play: how easy it is to clone a voice, distinguishing between synthetic and authentic voices, and even identifying the specific individual behind the voice.
Moreover, there’s the possibility of protecting against voice cloning by introducing watermarks or other identifying information. The concern here is that our voices could become readily available for download or misuse, a serious matter indeed.
Think about the potential consequences; people could impersonate others, even posing as your friends or acquaintances to obtain sensitive information through social engineering. This is undoubtedly a significant threat. Some major companies are withholding their models, but they do publish research papers, and in many cases, the models can be replicated.
However, it’s worth noting that while replication attempts of these powerful models exist in open-source communities, the performance might not always match the claims in the research papers, although it’s often very close. So, while we’ve moved far beyond the 30 to 60-minute requirement, the risks remain substantial.❜
– Yishay Carmiel, CEO of Meaning
Related content:
If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter:
[1] Ben Lorica is an advisor to Meaning and other startups.