Friday, August 18, 2023

Meta announces Voicebox, a generative model for multiple voice synthesis tasks








‘Flow Matching’

Voicebox is a generative model that can synthesize speech across six languages: English, French, Spanish, German, Polish and Portuguese. Like large language models (LLMs), it has been trained on a very general task that can be used for many applications. But while LLMs try to learn the statistical regularities of words and text sequences, Voicebox has been trained to learn the patterns that map voice audio samples to their transcripts.
 
Replicating voices across languages, editing out mistakes in speech, and more

Unlike generative models that are trained for a specific application, Voicebox can perform many tasks that it has not been trained for. For example, the model can use a two-second voice sample to generate speech for new text. Meta says this capability can be used to bring speech to people who are unable to speak, or customize the voices of non-playable game characters and virtual assistants.

Voicebox also performs style transfer in different ways. For example, you can provide the model with two audio and text samples. It will use the first audio sample as style reference and modify the second one to match the voice and tone of the reference. Interestingly, the model can do the same thing across different languages, which could be used to “help people communicate in a natural, authentic way — even if they don’t speak the same languages.”

The model can also do a variety of editing tasks. For example, if a dog barks in the background while you’re recording your voice, you can provide the audio and transcript to Voicebox and mask out the segment with the background noise. The model will use the transcript to generate the missing portion of the audio without the background noise.

The same technique can be used to edit speech. For example, if you have misspoken a word, you can mask that portion of the audio sample and pass it to Voicebox along with a transcript of the edited text. The model will generate the missing part with the new text in a way that matches the surrounding voice and tone.

Model not released


There is growing concern about the threats of AI-generated content. For example, cybercriminals recently tried to scam a woman by calling her and using an AI-generated voice to impersonate her grandson. Advanced speech synthesis systems such as Voicebox could be used for similar purposes or other nefarious deeds, such as creating fake evidence or manipulating real audio.

“As with other powerful new AI innovations, we recognize that this technology brings the potential for misuse and unintended harm,” Meta wrote on its AI blog. Due to these concerns, Meta did not release the model but provided technical details on the architecture and training process in the technical paper. The paper also contains details about a classifier model that can detect speech and audio generated by Voicebox, to mitigate the risks of using the model.

No comments:

Post a Comment

Web RTC:

WebRTC (Web Real-Time Communication) is an open-source project that enables real-time communication capabilities directly within web browser...