Hyderabad: Imagine watching Justin Bieber singing on TV in Telugu, Marathi, Assamese or any other Indian vernacular language of the viewer’s choice! Such automatic translation for regional viewers could soon be a possibility, thanks to a Machine Learning model from the International Institute of Information Technology, Hyderabad (IIIT-H) that can automatically translate a video of any person speaking or singing in one language to another. This could also do away with badly dubbed movies with out-of-sync lip movements, in turn, helping us watch Spiderman speak flawless Hindi or Telugu.
According to a recent post in the official IIIT-H blog, a team led by Prof CV Jawahar, Dean (Research and Development), comprising his students Prajwal KR, Rudrabha Mukhopadhyay, Jerrin Phillip and Abhishek Jha, in collaboration with Prof Vinay Namboodiri from IIT Kanpur worked on translation research culminating in a research paper titled ‘Towards Automatic Face-to-Face Translation’ which was presented at the ACM International Conference on Multimedia at Nice, France.
“Earlier when we spoke about automatic translation, it used to be text-to-text, then came speech-to-speech translation. In this research, we have gone further by introducing a novel approach to translating video content, known as face-to-face translation,” said Prof. Jawahar.
The team built upon speech-to-speech translation systems and developed a pipeline that can take a video of a person speaking in a source language and deliver a video output of the same speaker speaking in a target language such that the voice style and lip movements match the target language speech.
How it works
The system first transcribes the sentences in the speech using automatic speech recognition (ASR). This is the same technology that is used in voice assistants (Google Assistant, for example) in mobile devices. The transcribed sentences are then translated to the desired target language using Neural Machine Translation models. Finally, the translated text is spoken out using a text-to-speech synthesizer.
To obtain a fully translated video with accurate lip synchronization, the researchers introduced a novel visual module called LipGAN. This can also correct the lip movements in an original video to match the translated speech obtained above. For example, badly dubbed movies with out-of-sync lip movements can be corrected with LipGAN. What makes LipGAN unique is that since it has been trained on large video datasets, it works for any voice, any language and any identity.
Apart from making content such as movies, educational videos, TV news and interviews available to diverse audiences in various languages, there are potential futuristic applications such as cross-lingual video calls, or the ability to have video calls in real-time with someone speaking a different language.
Prof. Jawahar also draws attention to perhaps a simpler, pressing need today, which this model could soon address – the ability to translate English video content into Indian-accented English.