Impressive Artificial Intelligence program that recreates faces from audios

Speech2Face is a study that showed that it is possible to know what a person's face looks like with just a small fragment of his voice

Technology continues to grow by leaps and bounds, drawing on several areas to explore new capabilities and functions. One of them is to be able to “reconstruct” a person's face through a fragment of voice.

The Speech2Face study presented in 2019 at a Vision and Recognition Patterns conference showed that an Artificial Intelligence (AI) can decipher a person's appearance through short audio segments.

The paper explains that the goal of researchers Tae-Hyun On, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman and Michael Rubinstein of the MIT Research and Science Program is not to reconstruct people's faces identically, but to make an image with the physical characteristics that are related with the analyzed audio.

To achieve this, they used, designed and trained a deep neural network that analyzed millions of videos taken from YouTube where people are talking. During the training, the model learned to correlate voices with faces, allowing it to produce images with physical attributes similar to speakers, including age, gender and ethnicity.

The training was conducted under supervision and using the concurrence of the faces and voices of Internet videos, without the need to model detailed physical characteristics of the face.

They detailed that because this study could have aspects sensitive to ethnicity, as well as privacy, it is that no specific physical aspects have been added to the recreation of faces and they assure that, like any other machine learning bsystem, it improves over time, since in each use increases its knowledge library.

While the tests shown show that Speech2Face has a high number of coincidences between faces and voices, it also had some flaws, where ethnicity, age or gender did not match the voice sample used.

The model is designed to present statistical correlations that exist between facial features with the voice. It should be remembered that AI learned through YouTube videos, which do not represent a real sample of the population in the world, for example, in some languages it shows discrepancies with training data.

In this sense, the study itself recommends, at the end of its results, that those who decide to explore and modernize the system, consider a wider sample of people and voices so that machine learning has a broader repertoire of matching and recreating faces.

The program was also able to recreate the voice in cartoons, which also bear an incredible resemblance to the voices of the audios analyzed.

Because this technology could also be used for malicious purposes, the recreation of the face only remains as close to the person and does not give full faces, as this could be a problem for people's privacy. Still, it has been surprising what technology can do from audio samples.

KEEP READING: