Microsoft’s new VALL-E 2 text-to-speech synthesis achieves human level performance

You’re thinking about Morgan Freeman reading you bedtime stories, right?

3 min. read

Published onJune 11, 2024

published onJune 11, 2024

Share this article

Read our disclosure page to find out how can you help Windows Report sustain the editorial teamRead more

Microsoft has come up with VALL-E 2, a new model that takes human-like speech synthesis to another level. This is not just an improvement; it’s a big step forward in making computer-generated voices sound more natural and high quality. The creation of this advanced technology marks significant progress from previous versions likeVALL-E, which were already at par with human speech patterns but still lacked some crucial elements such as intonation control or avoiding monotonous tone repetition.

The researchers fixed the token repetition issue and more

The latest development overcomes these limitations by introducing fresh aspects like Repetition Aware Sampling and Grouped Code Modeling – all aimed at enhancing stability and efficiency during the process of generating spoken words through machine learning techniques. But what does this mean? Well, let’s dive into the details and find out.

One of the sampling problems is token repetition. Sometimes, the model can produce repetitive sequences which might cause stability issues and infinite loops as mentioned above. This method known as Repetition Aware Sampling takes decoding history into account for more stable and reliable results. Have you ever heard speech synthesis that doesn’t sound quite correct? This feature is here to fix that.

Next is Grouped Code Modeling, a method that focuses on efficiency. By grouping together codec codes, it can greatly shorten the sequence length. This approach speeds up inference and deals with issues related to modeling of long sequences. Think about a situation where you have to quickly synthesize a long speech; this feature makes it possible without losing quality.

VALL-E 2 will talk just like a human

These are not merely technical terms; they empower VALL-E 2 to produce speech that is extremely natural, even for intricate sentences. The model’s elegance lies in its simplicity: it only needs a simple set of speech-transcription pairs for training. This makes the process of collecting and handling data much easier.

According to thetechnical paper of VALL-E 2, on the LibriSpeech and VCTK datasets, the new LLM showed better results in terms of speech robustness, naturalness and speaker similarity. It is the initial model to reach human equality on these tests. The new version can produce very good quality speech which deals well with complicated and repeated sentences.

VALL-E 2 holds great promise for aiding people who have difficulty speaking, but its possible uses are not limited to these areas alone. Think of being able to give a voice to someone who struggles with talking because of conditions like aphasia or amyotrophic lateral sclerosis. Yet, we should not overlook the dangers of misuse, like voice spoofing or impersonation. It is very important for practical uses of this technology to have rules about approving speakers and recognizing if a speech is real or made by computer.

Could you, for instance, have all your e-books on your PC narrated by Morgan Freeman? You probably could. Publishing them online? That would be a totally different story, and you shouldn’t be able to do that for the obvious reasons.

What do you think about VALL-E 2 and speech synthesis? Let’s talk about that in the comments below. We’ve learned about this fromAIM.

With KB5043178 to Release Preview Channel, Microsoft advises Windows 11 users to plug in when the battery is low#

Copilot in Outlook will generate personalized themes for you to customize the app#

Microsoft will raise the price of its 365 Suite to include AI capabilities#

Death Stranding Director’s Cut is now Xbox X|S at a huge discount#

Outlook will let users create custom account icons so they can tell their accounts apart easier#

Microsoft’s new VALL-E 2 text-to-speech synthesis achieves human level performance#

The researchers fixed the token repetition issue and more#

VALL-E 2 will talk just like a human#

With KB5043178 to Release Preview Channel, Microsoft advises Windows 11 users to plug in when the battery is low

Copilot in Outlook will generate personalized themes for you to customize the app

Microsoft will raise the price of its 365 Suite to include AI capabilities

Death Stranding Director’s Cut is now Xbox X|S at a huge discount

Outlook will let users create custom account icons so they can tell their accounts apart easier

Microsoft’s new VALL-E 2 text-to-speech synthesis achieves human level performance

The researchers fixed the token repetition issue and more

VALL-E 2 will talk just like a human