Previously On ... From Recaps to Story Summarization
Aditya Kumar Singh,Dhruv Srivastava,Makarand Tapaswi
Computer Vision and Pattern Recognition, CVPR, 2024
@inproceedings{bib_Prev_2024, AUTHOR = {Aditya Kumar Singh, Dhruv Srivastava, Makarand Tapaswi}, TITLE = {Previously On ... From Recaps to Story Summarization}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2024}}
We introduce multimodal story summarization by leveraging TV episode recaps – short video sequences interweaving key visual moments and dialog from previous episodes to bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime thriller TV shows with rich recaps, and long-form content over 40 minutes per episode. Recap shots are mapped to corresponding sub-stories that serve as labels for story summarization. We propose a hierarchical model TaleSumm that (i) processes entire episodes by creating compact shot and dialog representations, and (ii) predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Unlike traditional summarization tasks, our method extracts multiple plot points from long-form videos. We present a thorough evaluation on this task, including promising cross-series generalization. TaleSumm shows good results on video summarization benchmarks.
MICap: A Unified Model for Identity-aware Movie Descriptions
Haran S K Raajesh,Naveen Reddy Desanur,Zeeshan Khan,Makarand Tapaswi
Computer Vision and Pattern Recognition, CVPR, 2024
@inproceedings{bib_MICa_2024, AUTHOR = {Haran S K Raajesh, Naveen Reddy Desanur, Zeeshan Khan, Makarand Tapaswi}, TITLE = {MICap: A Unified Model for Identity-aware Movie Descriptions}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2024}}
Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels. However, to predict captions with ids, a two-stage approach is required: first predict captions with someone, then fill in identities. In this work, we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives, while the encoder can benefit from or disregard captions with blanks as input. Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids. To this end, we introduce iSPICE, a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs. We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB accuracy, and a 1-2% bump in classic captioning metrics.
Major Entity Identification: A Generalizable Alternative to Coreference Resolution
S Kawshik Manikantan,Shubham Toshniwal,Makarand Tapaswi,Vineet Gandhi
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2024
@inproceedings{bib_Majo_2024, AUTHOR = {S Kawshik Manikantan, Shubham Toshniwal, Makarand Tapaswi, Vineet Gandhi}, TITLE = {Major Entity Identification: A Generalizable Alternative to Coreference Resolution}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2024}}
The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task’s broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative referential task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, MEI fits the classification framework, which enables the use of robust and intuitive classification-based metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest.
Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability
Prajneya Kumar,Eshika Khandelwal,Makarand Tapaswi,Vishnu Sreekumar
Winter Conference on Applications of Computer Vision, WACV, 2024
@inproceedings{bib_Seei_2024, AUTHOR = {Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar}, TITLE = {Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2024}}
Understanding what makes a video memorable has important applications in advertising or education technology. Towards this goal, we investigate spatio-temporal attention mechanisms underlying video memorability. Different from previous works that fuse multiple features, we adopt a simple CNN+Transformer architecture that enables analysis of spatio-temporal attention while matching state-of-the-art (SoTA) performance on video memorability prediction. We compare model attention against human gaze fixations collected through a small-scale eye-tracking study where humans perform the video memory task. We uncover the following insights: (i) Quantitative saliency metrics show that our model, trained only to predict a memorability score, exhibits similar spatial attention patterns to human gaze, especially for more memorable videos. (ii) The model assigns greater importance to initial frames in a video, mimicking human attention patterns. (iii) Panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.
Unsupervised Audio-Visual Lecture Segmentation
Darshan Singh S,Anchit Gupta,Jawahar C V,Makarand Tapaswi
Winter Conference on Applications of Computer Vision, WACV, 2023
@inproceedings{bib_Unsu_2023, AUTHOR = {Darshan Singh S, Anchit Gupta, Jawahar C V, Makarand Tapaswi}, TITLE = {Unsupervised Audio-Visual Lecture Segmentation}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2023}}
Over the last decade, online lecture videos have become increasingly popular and have experienced a meteoric rise during the pandemic. However, video-language research has primarily focused on instructional videos or movies, and tools to help students navigate the growing online lectures are lacking. Our first contribution is to facilitate research in the educational domain by introducing AVLectures, a large-scale dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR outputs for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Our second contribution is introducing video lecture segmentation that splits lectures into bitesized topics. Lecture clip representations leverage visual, textual, and OCR cues and are trained on a pretext selfsupervised task of matching the narration with the temporally aligned visual content. We formulate lecture segmentation as an unsupervised task and use these representations to generate segments using a temporally consistent 1- nearest neighbor algorithm, TW-FINCH [44]. We evaluate our method on 15 courses and compare it against various visual and textual baselines, outperforming all of them. Our comprehensive ablation studies also identify the key factors driving the success of our approach.
Test of Time: Instilling Video-Language Models with a Sense of Time
Piyush Bagad,Makarand Tapaswi,Cees G. M. Snoek
Computer Vision and Pattern Recognition, CVPR, 2023
@inproceedings{bib_Test_2023, AUTHOR = {Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek}, TITLE = {Test of Time: Instilling Video-Language Models with a Sense of Time}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2023}}
Modeling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that six existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.
How you feelin? Learning Emotions and Mental States in Movie Scenes
Dhruv Srivastava,Aditya Kumar Singh,Makarand Tapaswi
Computer Vision and Pattern Recognition, CVPR, 2023
@inproceedings{bib_How__2023, AUTHOR = {Dhruv Srivastava, Aditya Kumar Singh, Makarand Tapaswi}, TITLE = {How you feelin? Learning Emotions and Mental States in Movie Scenes}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2023}}
Movie story analysis requires understanding characters’ emotions and mental states. Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character. We propose EmoTx, a multimodal Transformer-based architecture that ingests videos, multiple characters, and dialog utterances to make joint predictions. By leveraging annotations from the MovieGraphs dataset [72], we aim to predict classic emotions (e.g. happy, angry) and other mental states (e.g. honest, helpful). We conduct experiments on the most frequently occurring 10 and 25 labels, and a mapping that clusters 181 labels to 26. Ablation studies and comparison against adapted stateof-the-art emotion recognition approaches shows the effectiveness of EmoTx. Analyzing EmoTx’s self-attention scores reveals that expressive emotions often look at character tokens while other mental states rely on video and dialog cues.
Grounded Video Situation Recognition
Zeeshan Khan,Jawahar C V,Makarand Tapaswi
Neural Information Processing Systems, NeurIPS, 2022
@inproceedings{bib_Grou_2022, AUTHOR = {Zeeshan Khan, Jawahar C V, Makarand Tapaswi}, TITLE = {Grounded Video Situation Recognition}, BOOKTITLE = {Neural Information Processing Systems}. YEAR = {2022}}
Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation. In this work, we propose the addition of spatiotemporal grounding as an essential component of the structured prediction task in a weakly supervised setting, and present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the video clips to enable fine-grained spatio-temporal reasoning. The second stage sees verb-role queries attend and pool information from object embeddings, localising answers to questions posed about the action. The final stage generates these answers as captions to describe each verb-role pair present in the video. Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When evaluated on a grounding-augmented version of the VidSitu dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time.
Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations
Jaidev Shriram,Makarand Tapaswi,Vinoo A R
International Society for Music Information Retrieval, ISMIR, 2022
@inproceedings{bib_Sonu_2022, AUTHOR = {Jaidev Shriram, Makarand Tapaswi, Vinoo A R}, TITLE = {Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations}, BOOKTITLE = {International Society for Music Information Retrieval}. YEAR = {2022}}
Reading, much like music listening, is an immersive experience that transports readers while taking them on an emotional journey. Listening to complementary music has the potential to amplify the reading experience, especially when the music is stylistically cohesive and emotionally relevant. In this paper, we propose the first fully automatic method to build a dense soundtrack for books, which can play high-quality instrumental music for the entirety of the reading duration. Our work employs a unique text processing and music weaving pipeline that determines the context and emotional composition of scenes in a chapter. This allows our method to identify and play relevant excerpts from the soundtrack of the book’s movie adaptation. By relying on the movie composer’s craftsmanship, our book soundtracks include expert-made motifs and other scene-specific musical characteristics. We validate the design decisions of our approach through a perceptual study. Our readers note that the book soundtrack greatly enhanced their reading experience, due to high immersiveness granted via uninterrupted and style-consistent music, and a heightened emotional state attained via high precision emotion and scene context recognition.