This AI only needs three seconds of your voice to imitate it
As far as advances in AI video creation have come, a significant amount of source material, such as headshots from various angles or video footage, is still required for someone to create a convincing fake version of their image. When it comes to faking your own voice, that’s a different story, eg Microsoft researchers Recently revealed a new AI tool He can simulate someone’s voice using a sample of only three seconds of them talking.
The new tool, a “Neural Markup Language Paradigm” called VALL-E, It is based on EnCodec audio compression technology from Meta, unveiled late last year, which uses artificial intelligence to compress better-than-CD-quality audio at data rates 10 times smaller than even MP3 files, with no noticeable loss in quality. Meta envisioned EnCodec as a way to improve phone call quality in areas with spotty cellular coverage, or as a way to reduce bandwidth requirements for music streaming services, but Microsoft is leveraging the technology as a way to make text-to-speech audio realistic based on a very limited source sample.
Today’s text-to-speech systems can produce sounds that sound very realistic, which is why the voices of intelligent assistants sound so authentic even though their verbal responses are generated on the fly. But it does require clean, high-quality training data, which is usually captured in a recording studio with professional equipment. Microsoft’s approach makes VALL-E able to simulate almost anyone’s voice without spending weeks in a studio. Instead, the tool is trained with Meta Libri-light datasetwhich contains 60,000 hours of recorded English speech from more than 7,000 unique speakers,” extracted and processed from the audiobooks of LibriVox‘, all in the public domain.
Microsoft shared a file Wide range of samples generated by VALL-E So you can hear for yourself just how great its voice simulation capabilities are, though results are currently mixed. The tool sometimes has difficulty recreating accents, including even subtle accents from the original samples where the speaker sounds Irish, and its ability to change emotions in a given sentence is laughable at times. But more often than not, the samples generated by VALL-E sound natural, warm, and are almost impossible to distinguish from the original speakers in the three-second source clips.
In its current form, trained at Libre Lite, VALL-E is limited to simulating English speech, and although its performance is still not perfect, it will definitely improve as its model dataset expands. However, it will be up to Microsoft researchers to improve VALL-E, as the team does not publish the source code for the tool. in Recently published research article Detailing the development of VALL-E, its creators fully understand the risks it poses:
Because VALL-E can synthesize speech that preserves the speaker’s identity, it may have potential risks of model abuse, such as voice recognition spoofing or impersonation of a specific speaker. To mitigate these risks, it is possible to build a detection model to differentiate whether a sound clip was synthesized by VALL-E. We will also apply Microsoft Principles of Artificial Intelligence When we continue to develop models.
“Pop culture advocate. Troublemaker. Friendly student. Proud problem solver.”