Abstract
We interact with the world around us through multiple sensory streams of information such as audio, vision, and text (language). Each of these streams complement each other, but also contain redundant information, albeit in different forms. For example, the content of a person speaking can be captured by listening to the sounds in the speech, or partially understood by looking at the speaker’s lip movements, or by reading out the text transcribed from the vocal speech. This redundancy across modalities is
utilized in human perceptual understanding that helps us to solve various practical problems. However, in the real-world, more often than not, information in individual streams is corrupted by various types of degradation like electronic transmission, background noise, and blurring which lead to deterioration in the content quality. In this work, we aim to recover the distorted signal in a given stream by exploiting the redundant information in another stream. Specifically, we deal with talking-face videos involving
vision and speech signals. We propose two core ideas to explore cross-modal redundancy: (i) denoising speech using visual assistance, and (ii) upsampling very low-resolution talking-face videos using audio assistance.
The first part focuses on the task of speech denoising. We show that the visual stream helps in distilling the clean speech from the corrupted signal by suppressing the background noise. We identify
the key issues prevailing in the existing state-of-the-art speech enhancement works: (i) most of the current works use only the audio stream and are limited in their performance in a wide range of realworld noises, and (ii) few recent works use the lip-movements as additional cues with an aim to improve the quality of the generated speech over “audio-only” methods. However, they cannot be applied for several applications where the visual stream is unreliable or completely absent. Thus, in this work, we
propose a new paradigm for speech enhancement: “pseudo-visual” approach, where the visual stream is synthetically generated from the noisy speech input. We demonstrate that the robustness and the accuracy boost obtained from our model lead to various real-world applications which were previously not possible.
In the second part, we explore an interesting question of what can be obtained from an 8 × 8 pixel video sequence by utilizing the corresponding speech of the person talking. Surprisingly, it turns out to be quite a lot. We show that when processed with the right set of audio and image priors, we can obtain a full-length talking video sequence with a 32× scale-factor. When the semantic information about the identity, including basic attributes like age and gender, are almost entirely lost in the input low-resolution video, we show that utilizing the speech that accompanies the low-resolution video aids in recovering the key face attributes. Our proposed audio-visual upsampling network generates realistic, accurate, and high-resolution (256 × 256 pixels) talking-face videos from an 8 × 8 input video. Finally, we demonstrate that our model can be utilized in video conferencing applications where the network bandwidth consumption can be drastically reduced. We hope that our work in cross-modal content recovery enables exciting applications such as smoother video calling, accessibility of video content in low-bandwidth situations, and restoring old historical videos. Our work can also pave the way for future research avenues for cross-modal enhancement of talking-face videos