AI and the Future of Speech Technologies

Ben Lorica

3 years ago

Yishay Carmiel on Generative AI for audio, voice cloning, real-time speech translation, and more.

Subscribe: Apple • Spotify • Overcast • Google • AntennaPod • Podcast Addict • Amazon • RSS.

Yishay Carmiel is the CEO of Meaning¹, a startup at the forefront of building real-time speech applications for enterprises. We discuss the state of AI for speech and audio, including trends in Generative AI, automatic speech recognition, diarization, restoration, voice cloning, speech synthesis and more.

Subscribe to the Gradient Flow Newsletter

Yishay Carmiel will be speaking at the AI Conference in San Francisco (Sep 26-27). Use the discount code FriendsofBen18 to save 18% on your registration.

Interview highlights – key sections from the video version:

Now, with the latest and most advanced models, we can achieve remarkable quality with as little as three to five seconds of voice data. Yes, just three to five seconds! For example, models like VALL-E or VALL-E X exhibit this capability, and similar models exist. This represents a significant risk.

To address this issue, there are ways to mitigate the risk by implementing robust detectors to verify the authenticity of a voice. Researchers in the field are actively exploring this avenue. Several factors come into play: how easy it is to clone a voice, distinguishing between synthetic and authentic voices, and even identifying the specific individual behind the voice.

Moreover, there’s the possibility of protecting against voice cloning by introducing watermarks or other identifying information. The concern here is that our voices could become readily available for download or misuse, a serious matter indeed.

Think about the potential consequences; people could impersonate others, even posing as your friends or acquaintances to obtain sensitive information through social engineering. This is undoubtedly a significant threat. Some major companies are withholding their models, but they do publish research papers, and in many cases, the models can be replicated.

However, it’s worth noting that while replication attempts of these powerful models exist in open-source communities, the performance might not always match the claims in the research papers, although it’s often very close. So, while we’ve moved far beyond the 30 to 60-minute requirement, the risks remain substantial.❜
– Yishay Carmiel, CEO of Meaning

Related content:

If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter:

[1] Ben Lorica is an advisor to Meaning and other startups.

Yishay Carmiel on Generative AI for audio, voice cloning, real-time speech translation, and more.

Share this: