Meta has recently developed Voicebox, a state-of-the-art generative speech model that can be applied to various speech tasks such as speech synthesis, denoising, and content editing. Voicebox generates high-quality speech and is touted as the first model capable of handling diverse speech generation tasks without task-specific training. To mitigate the risk of misuse due to its powerful capabilities, Meta has decided not to release the Voicebox model and source code publicly but provides audio samples and research papers for academic research purposes.
Similar to image and text generation models, Voicebox generates speech signals. This model can create speech in six different languages from scratch and perform tasks like noise removal, content editing, style transfer, and diverse sample generation. Researchers note that prior to Voicebox, AI-based speech generation required specially curated training data for each speech task. In contrast, Voicebox utilizes the novel method of Flow Matching, enabling it to learn directly from raw audio and transcribed text.
By learning from raw audio and transcribed text, Voicebox offers certain advantages in audio processing and speech generation. Many current speech synthesis and speech recognition models require labor-intensive preprocessing of training data and rely on processed data for training, which significantly increases training costs.
Additionally, unlike autoregressive models, Voicebox allows modifications to any part of the sample, not just the end of audio segments. This advantage makes Voicebox more effective in editing audio and creating long and continuous speech content. Autoregressive models typically generate and modify only a small portion of audio and thus require considerable computational time to generate sufficiently long audio segments.
The power of Voicebox lies in its Flow Matching method, which enables the model to learn highly uncertain mappings between text and speech. This non-deterministic mapping refers to the non-one-to-one conversion relationship between text and speech. The same text can be expressed in various ways with different speech rates, tones, accents, and emphasis, representing different meanings.
Traditional speech models require meticulous post-processing of training data, such as marking the position of each phoneme and prosody, or having humans read the text with specific intonation and speech rate. These processes not only consume significant time but also require expertise. The ability of Voicebox to handle non-deterministic mappings allows the model to learn from unmarked variations. In other words, researchers can train the model with a wider range and more diverse data, resulting in more natural and expressive speech generation.
Voicebox was trained on 50,000 hours of recorded speech data from public domain sources in English, French, Spanish, German, Polish, and Portuguese, along with transcribed text. Once Voicebox learns to contextualize speech, it can be used for various speech generation tasks. By inputting a speech sample and its corresponding transcript, Voicebox can read the text in the style of the provided speech. Additionally, Voicebox can edit speech segments, such as re-synthesizing corrupted sections or replacing misspoken sentences.
The significance of this research lies in Voicebox being the first successful and versatile model for generalized speech tasks. Meta has published detailed papers describing the methods and achievements of Voicebox, including the development of an efficient classifier capable of distinguishing between Voicebox-generated and real speech. While there are emerging applications for speech generation models, there is also a risk of misuse. Currently, Meta has chosen not to release the Voicebox model and source code but aims to foster research within the AI community by providing audio samples and research papers.