Abstract
                                                                        Dense video understanding requires answering several questions such as who is  doing what to whom, with what, how, why, and where. Recently, Video Situation  Recognition (VidSitu) is framed as a task for structured prediction of multiple  events, their relationships, and actions and various verb-role pairs attached to  descriptive entities. This task poses several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces  some challenges of evaluation. In this work, we propose the addition of spatiotemporal grounding as an essential component of the structured prediction task in  a weakly supervised setting, and present a novel three stage Transformer model,  VideoWhisperer, that is empowered to make joint predictions. In stage one, we  learn contextualised embeddings for video features in parallel with key objects  that appear in the video clips to enable fine-grained spatio-temporal reasoning.  The second stage sees verb-role queries attend and pool information from object  embeddings, localising answers to questions posed about the action. The final  stage generates these answers as captions to describe each verb-role pair present  in the video. Our model operates on a group of events (clips) simultaneously and  predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When  evaluated on a grounding-augmented version of the VidSitu dataset, we observe a  large improvement in entity captioning accuracy, as well as the ability to localize  verb-roles without grounding annotations at training time.