Abstract
Videos have become an integral part of our daily digital consumption. With the widespread adoption
of mobile devices, internet connectivity, and social media platforms, the number of online users and
consumers has risen exponentially in recent years. This has led to an unprecedented surge in video
content consumption and creation, ranging from short-form content on TikTok to educational material
on Coursera and entertainment videos on YouTube. Consequently, there is an urgent need to study
videos as a modality in Computer Vision, as it can enable a multitude of applications across various
domains, including virtual reality, education, and entertainment. By understanding the intricacies of
video content, we can unlock its potential and leverage its benefits to enhance user experiences and
create innovative solutions.
Producing video content at scale can be challenging due to various practical issues. The recording
process can take several hours of practice, and setting up the right studio and camera equipment can
be time-consuming and expensive. Moreover, recording requires manual effort, and any mistakes made
during the shoot can be difficult to rectify or modify, often requiring the entire video to be re-shot.
In this thesis, we aim to ask the question “Can synthetically generated videos take the place of real
videos?” as automatic content creation can significantly scale digital media production and ease the
process of content creation that can aid several applications. A form of human-centric representation
that is becoming increasingly popular in the research community is the ability to generate talking-head
videos automatically. Talking-head generation refers to the ability to generate realistic videos of a
person speaking, where the generated video can be of a person that may not exist in reality or may
exhibit significantly different characteristics than the original person. Recent deep learning approaches
can synthesize synthetic talking-head videos at tremendous scale and quality, with diverse content and
styles, that are visually indistinguishable from real videos. Therefore, it is imperative to study the
process of generating talking-head videos as these videos can be used for a variety of applications,
such as video conferencing, movie-making, broadcasting news, vlogging, and language learning among
others. Consider a digital avatar reading news from a text transcript being broadcasted on news.
In this vein, this thesis aims to explore two prominent use cases of generating synthetic talkingheads automatically - the first one towards generating large-scale synthetic content to aid people in
lipreading at scale. The second use case is for automating the task of actor-double face-swapping in the
moviemaking industry. We study and elucidate the challenges and limitations of the existing approaches, propose solutions based on synthetic talking head generation, and show the superiority of our methods
through extensive experimental evaluation and user studies.
In the first task, we address the challenges associated with learning to lipread. Lipreading is a primary
mode of communication for people suffering from some form of hearing loss. Therefore, learning to
lipread is an important aspect for hard-of-hearing people. However, learning to lipread is not an easy
task and finding resources to improve one’s lipreading skills can be challenging. Existing lipreading
training websites that provide basic online resources to improve lipreading skills, are unfortunately,
limited by real-world variations in the talking faces, cover only a limited vocabulary, and are available
in a few select languages and accents. This leaves the vast majority of users without access to adequate
lipreading training resources. To address this challenge, we propose an end-to-end pipeline to develop
an online lipreading training platform using state-of-the-art talking head video generator networks, textto-speech models, and computer vision techniques, to increase the amount of online content on the
LRT platforms in an automated and cost-effective manner. We show that incorporating existing talking
heading generator networks for the task of lipreading is not trivial, and requires careful adaptation.
For instance, we develop an audio-video alignment module that aligns the speech utterance on the
region with the mouth movements and adds silence around the aligned utterance. Such modifications
are necessary to generate realistic-looking videos that don’t cause distress to the lipreaders. We also
design carefully thought out lipreading training exercises, conduct extensive user studies, and perform
statistical analysis to show the effectiveness of the generated content in replacing the manually recorded
lipreading training videos.
In the second problem, we address challenges in the entertainment industry. Body doubles play
an indispensable role in the moviemaking industry. They take