Abstract
Video is a complex modality consisting of multiple events, complex action, humans, objects and their
interactions densely entangled over time. Understanding videos has been the core and one of the most
challenging problem in computer vision and machine learning. What makes it even harder is the lack
of structured formulation of the task specially when long videos are considered consisting of multiple
events and diverse scenes. Prior works in video understanding have tried to address the problem only in a
sparse and a uni-dimensional way, for example action recognition, spatio-temporal grounding, question
answering and, free form captioning. However it requires holistic understanding to fully capture all the
events, actions, and relations between all the entities, and represent any natural scene with the highest
detail in the most faithful way. It requires answering several questions such as who is doing what to
whom, with what, how, why, and where.
Recently, Video Situation Recognition (VidSitu) through semantic role labeling is framed as a task
for structured prediction of multiple events, their relationships, and actions and various verb-role pairs
attached to descriptive entities. This is one of the most dense video understanding task posing several
challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs,
but also faces some challenges of evaluation due to the free form captions for representing the roles.
In this work, we propose the addition of spatio-temporal grounding as an essential component of the
structured prediction task in a weakly supervised setting, without requiring ground truth bounding boxes.
Since evaluating free-form captions can be difficult and imprecise this not only improves the current
formulation and the evaluation setup, but also improves the interpretability of the models decision,
because grounding allows us to visualise where the model is looking while generating a caption.
To this end we present a novel three stage Transformer model, VideoWhisperer, that is empowered to
make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel
with key objects that appear in the video clips to enable fine-grained spatio-temporal reasoning. The
second stage sees verb-role queries attend and pool information from object embeddings, localising
answers to questions posed about the action. The final stage generates these answers as captions to
describe each verb-role pair present in the video. Our model operates on a group of events (clips)
simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When
evaluated on a grounding-augmented version of the VidSitu dataset, we observe a large improvement in
entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at
training time