IIITH

System and method for identifying soundtrack for a digital book using a movie adaptation technique

United States Patent, Us patent, 2024

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Syst_2024, AUTHOR = {R, Vinoo Alluri and Tapaswi, Makarand and Shriram, Jaidev }, TITLE = {System and method for identifying soundtrack for a digital book using a movie adaptation technique}, BOOKTITLE = {United States Patent}. YEAR = {2024}}

System and method for identifying soundtrack for a digital book using a movie adaptation technique

Abstract

A method for identifying a soundtrack for a digital book by automatically aligning the digital book and the soundtrack with a movie using a movie adaptation technique is provided. The method includes (i) receiving media content through a user device that includes the digital book, the soundtrack, and the movie,(ii) segmenting (a) the chapters of the digital book into paragraph segments,(b) the movie into scene boundaries, and (c) tracks of the soundtrack into cohesive track segments,(iii) aligning the scene boundaries with the paragraph segments to generate aligned paragraph segments,(iv) aligning a background soundtracks of the movie with the cohesive track segments of the soundtrack using a majority key and a minority key of the cohesive track segments and the background soundtracks to generate aligned cohesive track segments and (v) aligning the aligned paragraph segments with the aligned cohesive

Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Conference on Empirical Methods in Natural Language Processing, EMNLP, 2024

Core Rank : A* Google Rank :193

Abs PDF DOI bibTex

@inproceedings{bib_Majo_2024, AUTHOR = {Manikantan, S Kawshik and Toshniwal, Shubham and Tapaswi, Makarand and Gandhi, Vineet }, TITLE = {Major Entity Identification: A Generalizable Alternative to Coreference Resolution}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2024}}

Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Abstract

The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task’s broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative referential task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, MEI fits the classification framework, which enables the use of robust and intuitive classification-based metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest.

MICap: A Unified Model for Identity-aware Movie Descriptions

Computer Vision and Pattern Recognition, CVPR, 2024

Core Rank : A* Google Rank :440

Abs PDF bibTex

@inproceedings{bib_MICa_2024, AUTHOR = {Raajesh, Haran S K and Desanur, Naveen Reddy and Khan, Zeeshan and Tapaswi, Makarand }, TITLE = {MICap: A Unified Model for Identity-aware Movie Descriptions}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2024}}

MICap: A Unified Model for Identity-aware Movie Descriptions

Abstract

Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels. However, to predict captions with ids, a two-stage approach is required: first predict captions with someone, then fill in identities. In this work, we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives, while the encoder can benefit from or disregard captions with blanks as input. Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids. To this end, we introduce iSPICE, a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs. We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB accuracy, and a 1-2% bump in classic captioning metrics.

Previously On ... From Recaps to Story Summarization

Computer Vision and Pattern Recognition, CVPR, 2024

Core Rank : A* Google Rank :440

Abs PDF DOI bibTex

@inproceedings{bib_Prev_2024, AUTHOR = {Singh, Aditya Kumar and Srivastava, Dhruv and Tapaswi, Makarand }, TITLE = {Previously On ... From Recaps to Story Summarization}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2024}}

Previously On ... From Recaps to Story Summarization

Abstract

We introduce multimodal story summarization by leveraging TV episode recaps – short video sequences interweaving key visual moments and dialog from previous episodes to bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime thriller TV shows with rich recaps, and long-form content over 40 minutes per episode. Recap shots are mapped to corresponding sub-stories that serve as labels for story summarization. We propose a hierarchical model TaleSumm that (i) processes entire episodes by creating compact shot and dialog representations, and (ii) predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Unlike traditional summarization tasks, our method extracts multiple plot points from long-form videos. We present a thorough evaluation on this task, including promising cross-series generalization. TaleSumm shows good results on video summarization benchmarks.

How you feelin? Learning Emotions and Mental States in Movie Scenes

Computer Vision and Pattern Recognition, CVPR, 2023

Core Rank : A* Google Rank :440

Abs PDF bibTex

@inproceedings{bib_How__2023, AUTHOR = {Srivastava, Dhruv and Singh, Aditya Kumar and Tapaswi, Makarand }, TITLE = {How you feelin? Learning Emotions and Mental States in Movie Scenes}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2023}}

How you feelin? Learning Emotions and Mental States in Movie Scenes

Abstract

Movie story analysis requires understanding characters’ emotions and mental states. Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character. We propose EmoTx, a multimodal Transformer-based architecture that ingests videos, multiple characters, and dialog utterances to make joint predictions. By leveraging annotations from the MovieGraphs dataset [72], we aim to predict classic emotions (e.g. happy, angry) and other mental states (e.g. honest, helpful). We conduct experiments on the most frequently occurring 10 and 25 labels, and a mapping that clusters 181 labels to 26. Ablation studies and comparison against adapted stateof-the-art emotion recognition approaches shows the effectiveness of EmoTx. Analyzing EmoTx’s self-attention scores reveals that expressive emotions often look at character tokens while other mental states rely on video and dialog cues.

GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering

WWW Workshop on Natural Language Processing for Knowledge Graph Construction, NLP4KGc, 2023

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Grap_2023, AUTHOR = {Taunk, Dhaval and Khanna, Lakshya and Kumar, Kandru Siri Venkata Pavan and Kalidindi, Vasudeva Varma and Sharma, Charu and Tapaswi, Makarand }, TITLE = {GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering}, BOOKTITLE = {WWW Workshop on Natural Language Processing for Knowledge Graph Construction}. YEAR = {2023}}

GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering

Abstract

Commonsense question-answering (QA) methods combine the power of pre-trained Language Models (LM) with the reasoning provided by Knowledge Graphs (KG). A typical approach collects nodes relevant to the QA pair from a KG to form a Working Graph (WG) followed by reasoning using Graph Neural Networks (GNNs). This faces two major challenges: (i) it is difficult to capture all the information from the QA in the WG, and (ii) the WG contains some irrelevant nodes from the KG. To address these, we propose GrapeQA with two simple improvements on the WG: (i) Prominent Entities for Graph Augmentation identifies relevant text chunks from the QA pair and augments the WG with corresponding latent representations from the LM, and (ii) ContextAware Node Pruning removes nodes that are less relevant to the QA pair. We evaluate our results on OpenBookQA, CommonsenseQA and MedQA-USMLE and see that GrapeQA shows consistent improvements over its LM + KG predecessor (QA-GNN in particular) and large improvements on OpenBookQA.

DO VIDEO-LANGUAGE FOUNDATION MODELS HAVE A SENSE OF TIME?

workshop on International Conference on Learning Representations, ICLR-W, 2023

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_DO_V_2023, AUTHOR = {Bagad, Piyush and Tapaswi, Makarand and Snoek, Cees G. M. }, TITLE = {DO VIDEO-LANGUAGE FOUNDATION MODELS HAVE A SENSE OF TIME?}, BOOKTITLE = {workshop on International Conference on Learning Representations}. YEAR = {2023}}

DO VIDEO-LANGUAGE FOUNDATION MODELS HAVE A SENSE OF TIME?

Abstract

Modelling and understanding time remains a challenge in contemporary video understanding models. Time also appears in language through temporal relations. Video-language models can benefit from having a sense of time, especially since language provides an interface for generalization. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We construct a simple synthetic dataset to measure such temporal understanding in video-language models and find that six existing models struggle to understand even such simple relations. We then posit whether it is feasible to equip these foundation models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on postpretraining on a small amount of video-text data. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without needing data- and compute-intense training from scratch. Project page:

Test of Time: Instilling Video-Language Models with a Sense of Time

Computer Vision and Pattern Recognition, CVPR, 2023

Core Rank : A* Google Rank :440

Abs PDF bibTex

@inproceedings{bib_Test_2023, AUTHOR = {Bagad, Piyush and Tapaswi, Makarand and Snoek, Cees G. M. }, TITLE = {Test of Time: Instilling Video-Language Models with a Sense of Time}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2023}}

Test of Time: Instilling Video-Language Models with a Sense of Time

Abstract

Modeling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that six existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.

Unsupervised Audio-Visual Lecture Segmentation

Winter Conference on Applications of Computer Vision, WACV, 2023

Core Rank : - Google Rank :109

Abs PDF bibTex

@inproceedings{bib_Unsu_2023, AUTHOR = {S, Darshan Singh and Gupta, Anchit and V, Jawahar C and Tapaswi, Makarand }, TITLE = {Unsupervised Audio-Visual Lecture Segmentation}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2023}}

Unsupervised Audio-Visual Lecture Segmentation

Abstract

Over the last decade, online lecture videos have become increasingly popular and have experienced a meteoric rise during the pandemic. However, video-language research has primarily focused on instructional videos or movies, and tools to help students navigate the growing online lectures are lacking. Our first contribution is to facilitate research in the educational domain by introducing AVLectures, a large-scale dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR outputs for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Our second contribution is introducing video lecture segmentation that splits lectures into bitesized topics. Lecture clip representations leverage visual, textual, and OCR cues and are trained on a pretext selfsupervised task of matching the narration with the temporally aligned visual content. We formulate lecture segmentation as an unsupervised task and use these representations to generate segments using a temporally consistent 1- nearest neighbor algorithm, TW-FINCH [44]. We evaluate our method on 15 courses and compare it against various visual and textual baselines, outperforming all of them. Our comprehensive ablation studies also identify the key factors driving the success of our approach.

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

European Conference on Computer Vision, ECCV, 2022

Core Rank : A* Google Rank :206

Abs bibTex

@inproceedings{bib_Lear_2022, AUTHOR = {Chen, Shizhe and Guhur, Pierre-Louis and Tapaswi, Makarand and Schmid, Cordelia and Laptev, Ivan }, TITLE = {Learning from Unlabeled 3D Environments for Vision-and-Language Navigation}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2022}}

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

Abstract

In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions. One major bottleneck for existing VLN approaches is the lack of sufficient training data, resulting in unsatisfactory generalization to unseen environments. While VLN data is typically collected manually, such an approach is expensive and prevents scalability. In this work, we address the data scarcity issue by proposing to automatically create a large-scale VLN dataset from 900 unlabeled 3D buildings from HM3D. We generate a navigation graph for each building and transfer object predictions from 2D to generate pseudo 3D object labels by cross-view consistency. We then fine-tune a pretrained language model using pseudo object labels as prompts to alleviate the cross-modal gap in instruction generation. Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions. We experimentally demonstrate that HM3D-AutoVLN significantly increases the generalization ability of resulting VLN models. On the SPL metric, our approach improves over state of the art by 7.1% and 8.1% on the unseen validation splits of REVERIE and SOON datasets respectively.