Abstract
Videos form an integral part of human lives and act as one of the most natural forms of perception
spanning both the spatial and the temporal dimensions: the spatial dimension emphasizes the content,
whereas the temporal dimension emphasizes the change. Naturally, studying this modality is an important area of computer vision. Notably, one must efficiently capture this high-dimensional modality to
perform different downstream tasks robustly. In this thesis, we study representation learning for videos
to perform two key aspects of video-based tasks: classification and generation. In a classification task, a
video is compressed to a latent space that captures the key discriminative properties of a video relevant
to the task. On the other hand, generation involves starting with a latent space (often a known space,
such as standard normal) and learning a valid mapping between the latent and the video manifold. This
thesis explores complementary representation techniques to develop robust representation spaces useful for diverse downstream tasks. In this vein, this thesis starts by tackling video classification, where
we concentrate on a specific task of “lipreading” (transliterating videos to text) or in technical terms -
classifying videos of mouth movements. Through this work, we propose a compressed generative space
that self-augments the dataset improving the discriminative capabilities of the classifier. Motivated by
the findings of this work, we move on to finding an improved generative space in which we touch upon
several key elements of video generation, including unconditional video generation, video inversion,
and video superresolution.
In the classification task, we aim to study lipreading (or visually recognizing speech from the mouth
movements of a speaker), a challenging and mentally taxing task for humans to perform. Unfortunately,
multiple medical conditions force people to depend on this skill in their day-to-day lives for essential
communication. Patients suffering from ‘Amyotrophic Lateral Sclerosis’ (ALS) often lose muscle control, consequently, their ability to generate speech and communicate via lip movements. Existing large
datasets do not focus on medical patients or curate personalized vocabulary relevant to an individual.
Collecting large-scale datasets of a patient needed to train modern data-hungry deep learning models is,
however, extremely challenging. We propose a personalized network designed to lipread for an ALS
patient using only one-shot examples. We depend on synthetically generated lip movements to augment
the one-shot scenario. A Variational Encoder-based domain adaptation technique is used to bridge the
real-synthetic domain gap. Our approach significantly improves and achieves high top-5 accuracy with
83.2% accuracy compared to 62.6% achieved by comparable methods for the patient. Apart from evaluating our approach on the ALS patient, we also extend it to people with hearing impairment, relying
extensively on lip movements to communicate.
In the next part of the thesis, we focus on representation spaces for video-based generative tasks.
Generating videos is a complex task that is accomplished by generating a set of temporally coherent
images frame-by-frame. This approach confines the expressivity of videos to image-based operations
on individual frames, necessitating network designs that can achieve temporally coherent trajectories in
the underlying image space. We propose INR-V, a video representation network that learns a continuous
space for video-based generative tasks. INR-V parameterizes videos using implicit neural representations (INRs), a multi-layered perceptron that predicts an RGB value for each input pixel location of
the video. The INR is predicted using a meta-network which is a hypernetwork trained on neural representations of multiple video instances. Later, the meta-network can be sampled to generate diverse
novel videos enabling many downstream video-based generative tasks. Interestingly, we find that conditional regularization and progressive weight initialization play a crucial role in obtaining INR-V. The
representation space learned by INR-V is more expressive than an image space showcasing many interesting properties not possible with the existing works. For instance, INR-V can smoothly interpolate
intermediate videos between known video instances (such as intermediate identities, expressions, and
poses in face videos). It can also in-paint missing portions in videos to recover temporally coherent full
videos. We evaluate the space learned by INR-V on diverse generative tasks such as video interpolation,
novel video generation, video inversion, and video inpainting against the existing baselines. INR-V
significantly outperforms the baselines on several of these demonstrated tasks, clearly showcasing the
potential of the proposed representation space.
In summary, this thesis makes a significant cont