AI and the Future of Speech Technologies

Yishay Carmiel on Generative AI for audio, voice cloning, real-time speech translation, and more.


Subscribe: AppleSpotify OvercastGoogleAntennaPodPodcast AddictAmazon •  RSS.

Yishay Carmiel is the CEO of Meaning1, a startup at the forefront of building real-time speech applications for enterprises. We discuss the state of AI for speech and audio, including trends in Generative AI, automatic speech recognition, diarization, restoration, voice cloning, speech synthesis and more.
 

Subscribe to the Gradient Flow Newsletter


Yishay Carmiel will be speaking at the AI Conference in San Francisco (Sep 26-27). Use the discount code FriendsofBen18 to save 18% on your registration.



Interview highlights – key sections from the video version:



    ❛ I believe the primary concern at the forefront is what we refer to as voice cloning. So, if you consider recent advancements, such as text-to-speech technology adapting to specific user voices, one of the major challenges we faced was determining how much new user data was required to tailor the system to a new individual’s voice. Until about two years ago, this process demanded anywhere from 30 to 60 minutes of immediate recording. However, asking someone to speak for 30 to 60 minutes with high-quality results proved quite challenging.

    Now, with the latest and most advanced models, we can achieve remarkable quality with as little as three to five seconds of voice data. Yes, just three to five seconds! For example, models like VALL-E or VALL-E X exhibit this capability, and similar models exist. This represents a significant risk.

    To address this issue, there are ways to mitigate the risk by implementing robust detectors to verify the authenticity of a voice. Researchers in the field are actively exploring this avenue. Several factors come into play: how easy it is to clone a voice, distinguishing between synthetic and authentic voices, and even identifying the specific individual behind the voice.

    Moreover, there’s the possibility of protecting against voice cloning by introducing watermarks or other identifying information. The concern here is that our voices could become readily available for download or misuse, a serious matter indeed.

    Think about the potential consequences; people could impersonate others, even posing as your friends or acquaintances to obtain sensitive information through social engineering. This is undoubtedly a significant threat. Some major companies are withholding their models, but they do publish research papers, and in many cases, the models can be replicated.

    However, it’s worth noting that while replication attempts of these powerful models exist in open-source communities, the performance might not always match the claims in the research papers, although it’s often very close. So, while we’ve moved far beyond the 30 to 60-minute requirement, the risks remain substantial.❜
    Yishay Carmiel, CEO of Meaning


    Related content:


    If you enjoyed this episode, please support our work by encouraging your friends and colleagues to subscribe to our newsletter:


    [1] Ben Lorica is an advisor to Meaning and other startups.