The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad,Makarand Tapaswi,Cees G. M. Snoek,Andrew Zisserman
International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2025
@inproceedings{bib_The__2025, AUTHOR = {Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek, Andrew Zisserman}, TITLE = {The Sound of Water: Inferring Physical Properties from Pouring Liquids}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2025}}
We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill. To this end, we: (i) show in theory that these properties can be determined from the fundamental frequency (pitch); (ii) train a pitch detection model with supervision from simulated data and visual data with a physics-inspired objective; (iii) introduce a new large dataset of real pouring videos for a systematic study; (iv) show that the trained model can indeed infer these physical properties for real data; and finally, (v) we demonstrate strong generalization to various container shapes, other datasets, and in-the-wild YouTube videos. Our work presents a keen understanding of a narrow yet rich problem at the intersection of acoustics, physics, and learning. It opens up applications to enhance multisensory perception in robotic pouring.
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur,Darshan Singh S,Makarand Tapaswi
Transactions in Machine Learning Research, TMLR, 2025
@inproceedings{bib_No_D_2025, AUTHOR = {Manu Gaur, Darshan Singh S, Makarand Tapaswi}, TITLE = {No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning}, BOOKTITLE = {Transactions in Machine Learning Research}. YEAR = {2025}}
Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training curriculum that more optimally leverages the contrastive nature of the self-retrieval reward. Jointly, they enable the captioner to describe fine-grained aspects in the image while preserving faithfulness to ground-truth captions. Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessi et al., 2023); and +7.6% on ImageCoDe. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model's fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce TrueMatch, a benchmark comprising bags of highly similar images that uses SR to assess the captioner's ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% - 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters. We also outperform vanilla SR by +14.4% to +19.5%.
Seeing Eye to AI Comparing Human Gaze and Model Attention in Video Memorability
Prajneya Kumar,Eshika Khandelwal,Makarand Tapaswi,Vishnu Sreekumar
Winter Conference on Applications of Computer Vision, WACV, 2025
@inproceedings{bib_Seei_2025, AUTHOR = {Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar}, TITLE = {Seeing Eye to AI Comparing Human Gaze and Model Attention in Video Memorability}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2025}}
Understanding what makes a video memorable has important applications in advertising or education technology. Towards this goal, we investigate spatio-temporal attention mechanisms underlying video memorability. Different from previous works that fuse multiple features, we adopt a simple CNN+Transformer architecture that enables analysis of spatio-temporal attention while matching state-of-the-art (SoTA) performance on video memorability prediction. We compare model attention against human gaze fixations collected through a small-scale eye-tracking study where humans perform the video memory task. We uncover the following insights: (i) Quantitative saliency metrics show that our model, trained only to predict a memorability score, exhibits similar spatial attention patterns to human gaze, especially for more memorable videos. (ii) The model assigns greater importance to initial frames in a video, mimicking human attention patterns. (iii) Panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.
Localizing Auditory Concepts in CNNs
Pratyaksh Gautam,Makarand Tapaswi,Vinoo Alluri R
ICML Mechanistic Interpretability Workshop, ICMLMI-W, 2024
@inproceedings{bib_Loca_2024, AUTHOR = {Pratyaksh Gautam, Makarand Tapaswi, Vinoo Alluri R}, TITLE = {Localizing Auditory Concepts in CNNs}, BOOKTITLE = {ICML Mechanistic Interpretability Workshop}. YEAR = {2024}}
Deep learning models are capable of complex auditory processing tasks such as keyword spotting, genre classification, and audio captioning, yet remain opaque. While several works have explored interpretability of neural networks for computer vision and natural language processing, the audio modality has been largely ignored. In this paper, we study the behavior of the audio CNN encoder used in the contrastively trained language-audio model, CLAP. In the domain of music and human speech sounds, we localize and identify the layers of the network that perform well on tasks of varying complexity, sometimes even outperforming the model's final outputs. Digging deeper, we also localize specific dataset classes to neuron clusters within a layer and analyze a cluster’s contribution to the model’s discriminability for that class. To perform these analyses, we propose an automated framework that can leverage a small dataset of a few thousand samples to evaluate and score neuron clusters for their role in classification. Our findings provide insights into the hierarchical nature of representations in audio CNNs, paving the way for improved interpretability of audio model.
System and method for identifying soundtrack for a digital book using a movie adaptation technique
Vinoo Alluri R,Makarand Tapaswi,Jaidev Shriram
United States Patent, Us patent, 2024
@inproceedings{bib_Syst_2024, AUTHOR = {Vinoo Alluri R, Makarand Tapaswi, Jaidev Shriram}, TITLE = {System and method for identifying soundtrack for a digital book using a movie adaptation technique}, BOOKTITLE = {United States Patent}. YEAR = {2024}}
A method for identifying a soundtrack for a digital book by automatically aligning the digital book and the soundtrack with a movie using a movie adaptation technique is provided. The method includes (i) receiving media content through a user device that includes the digital book, the soundtrack, and the movie,(ii) segmenting (a) the chapters of the digital book into paragraph segments,(b) the movie into scene boundaries, and (c) tracks of the soundtrack into cohesive track segments,(iii) aligning the scene boundaries with the paragraph segments to generate aligned paragraph segments,(iv) aligning a background soundtracks of the movie with the cohesive track segments of the soundtrack using a majority key and a minority key of the cohesive track segments and the background soundtracks to generate aligned cohesive track segments and (v) aligning the aligned paragraph segments with the aligned cohesive
Major Entity Identification: A Generalizable Alternative to Coreference Resolution
S Kawshik Manikantan,Shubham Toshniwal,Makarand Tapaswi,Vineet Gandhi
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2024
@inproceedings{bib_Majo_2024, AUTHOR = {S Kawshik Manikantan, Shubham Toshniwal, Makarand Tapaswi, Vineet Gandhi}, TITLE = {Major Entity Identification: A Generalizable Alternative to Coreference Resolution}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2024}}
The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task’s broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative referential task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, MEI fits the classification framework, which enables the use of robust and intuitive classification-based metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest.
MICap: A Unified Model for Identity-aware Movie Descriptions
Haran S K Raajesh,Naveen Reddy Desanur,Zeeshan Khan,Makarand Tapaswi
Computer Vision and Pattern Recognition, CVPR, 2024
@inproceedings{bib_MICa_2024, AUTHOR = {Haran S K Raajesh, Naveen Reddy Desanur, Zeeshan Khan, Makarand Tapaswi}, TITLE = {MICap: A Unified Model for Identity-aware Movie Descriptions}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2024}}
Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels. However, to predict captions with ids, a two-stage approach is required: first predict captions with someone, then fill in identities. In this work, we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives, while the encoder can benefit from or disregard captions with blanks as input. Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids. To this end, we introduce iSPICE, a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs. We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB accuracy, and a 1-2% bump in classic captioning metrics.
Previously On ... From Recaps to Story Summarization
Aditya Kumar Singh,Dhruv Srivastava,Makarand Tapaswi
Computer Vision and Pattern Recognition, CVPR, 2024
@inproceedings{bib_Prev_2024, AUTHOR = {Aditya Kumar Singh, Dhruv Srivastava, Makarand Tapaswi}, TITLE = {Previously On ... From Recaps to Story Summarization}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2024}}
We introduce multimodal story summarization by leveraging TV episode recaps – short video sequences interweaving key visual moments and dialog from previous episodes to bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime thriller TV shows with rich recaps, and long-form content over 40 minutes per episode. Recap shots are mapped to corresponding sub-stories that serve as labels for story summarization. We propose a hierarchical model TaleSumm that (i) processes entire episodes by creating compact shot and dialog representations, and (ii) predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Unlike traditional summarization tasks, our method extracts multiple plot points from long-form videos. We present a thorough evaluation on this task, including promising cross-series generalization. TaleSumm shows good results on video summarization benchmarks.
How you feelin? Learning Emotions and Mental States in Movie Scenes
Dhruv Srivastava,Aditya Kumar Singh,Makarand Tapaswi
Computer Vision and Pattern Recognition, CVPR, 2023
@inproceedings{bib_How__2023, AUTHOR = {Dhruv Srivastava, Aditya Kumar Singh, Makarand Tapaswi}, TITLE = {How you feelin? Learning Emotions and Mental States in Movie Scenes}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2023}}
Movie story analysis requires understanding characters’ emotions and mental states. Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character. We propose EmoTx, a multimodal Transformer-based architecture that ingests videos, multiple characters, and dialog utterances to make joint predictions. By leveraging annotations from the MovieGraphs dataset [72], we aim to predict classic emotions (e.g. happy, angry) and other mental states (e.g. honest, helpful). We conduct experiments on the most frequently occurring 10 and 25 labels, and a mapping that clusters 181 labels to 26. Ablation studies and comparison against adapted stateof-the-art emotion recognition approaches shows the effectiveness of EmoTx. Analyzing EmoTx’s self-attention scores reveals that expressive emotions often look at character tokens while other mental states rely on video and dialog cues.
GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering
Dhaval Taunk,Lakshya Khanna,Kandru Siri Venkata Pavan Kumar,Vasudeva Varma Kalidindi,Charu Sharma,Makarand Tapaswi
WWW Workshop on Natural Language Processing for Knowledge Graph Construction, NLP4KGc, 2023
@inproceedings{bib_Grap_2023, AUTHOR = {Dhaval Taunk, Lakshya Khanna, Kandru Siri Venkata Pavan Kumar, Vasudeva Varma Kalidindi, Charu Sharma, Makarand Tapaswi}, TITLE = {GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering}, BOOKTITLE = {WWW Workshop on Natural Language Processing for Knowledge Graph Construction}. YEAR = {2023}}
Commonsense question-answering (QA) methods combine the power of pre-trained Language Models (LM) with the reasoning provided by Knowledge Graphs (KG). A typical approach collects nodes relevant to the QA pair from a KG to form a Working Graph (WG) followed by reasoning using Graph Neural Networks (GNNs). This faces two major challenges: (i) it is difficult to capture all the information from the QA in the WG, and (ii) the WG contains some irrelevant nodes from the KG. To address these, we propose GrapeQA with two simple improvements on the WG: (i) Prominent Entities for Graph Augmentation identifies relevant text chunks from the QA pair and augments the WG with corresponding latent representations from the LM, and (ii) ContextAware Node Pruning removes nodes that are less relevant to the QA pair. We evaluate our results on OpenBookQA, CommonsenseQA and MedQA-USMLE and see that GrapeQA shows consistent improvements over its LM + KG predecessor (QA-GNN in particular) and large improvements on OpenBookQA.
DO VIDEO-LANGUAGE FOUNDATION MODELS HAVE A SENSE OF TIME?
Piyush Bagad,Makarand Tapaswi,Cees G. M. Snoek
workshop on International Conference on Learning Representations, ICLR-W, 2023
@inproceedings{bib_DO_V_2023, AUTHOR = {Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek}, TITLE = {DO VIDEO-LANGUAGE FOUNDATION MODELS HAVE A SENSE OF TIME?}, BOOKTITLE = {workshop on International Conference on Learning Representations}. YEAR = {2023}}
Modelling and understanding time remains a challenge in contemporary video understanding models. Time also appears in language through temporal relations. Video-language models can benefit from having a sense of time, especially since language provides an interface for generalization. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We construct a simple synthetic dataset to measure such temporal understanding in video-language models and find that six existing models struggle to understand even such simple relations. We then posit whether it is feasible to equip these foundation models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on postpretraining on a small amount of video-text data. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without needing data- and compute-intense training from scratch. Project page:
Test of Time: Instilling Video-Language Models with a Sense of Time
Piyush Bagad,Makarand Tapaswi,Cees G. M. Snoek
Computer Vision and Pattern Recognition, CVPR, 2023
@inproceedings{bib_Test_2023, AUTHOR = {Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek}, TITLE = {Test of Time: Instilling Video-Language Models with a Sense of Time}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2023}}
Modeling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that six existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.
Unsupervised Audio-Visual Lecture Segmentation
Darshan Singh S,Anchit Gupta,Jawahar C V,Makarand Tapaswi
Winter Conference on Applications of Computer Vision, WACV, 2023
@inproceedings{bib_Unsu_2023, AUTHOR = {Darshan Singh S, Anchit Gupta, Jawahar C V, Makarand Tapaswi}, TITLE = {Unsupervised Audio-Visual Lecture Segmentation}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2023}}
Over the last decade, online lecture videos have become increasingly popular and have experienced a meteoric rise during the pandemic. However, video-language research has primarily focused on instructional videos or movies, and tools to help students navigate the growing online lectures are lacking. Our first contribution is to facilitate research in the educational domain by introducing AVLectures, a large-scale dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR outputs for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Our second contribution is introducing video lecture segmentation that splits lectures into bitesized topics. Lecture clip representations leverage visual, textual, and OCR cues and are trained on a pretext selfsupervised task of matching the narration with the temporally aligned visual content. We formulate lecture segmentation as an unsupervised task and use these representations to generate segments using a temporally consistent 1- nearest neighbor algorithm, TW-FINCH [44]. We evaluate our method on 15 courses and compare it against various visual and textual baselines, outperforming all of them. Our comprehensive ablation studies also identify the key factors driving the success of our approach.
Learning from Unlabeled 3D Environments for Vision-and-Language Navigation
Shizhe Chen, Pierre-Louis Guhur,Makarand Tapaswi,Cordelia Schmid, Ivan Laptev
European Conference on Computer Vision, ECCV, 2022
Abs | | bib Tex
@inproceedings{bib_Lear_2022, AUTHOR = {Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev}, TITLE = {Learning from Unlabeled 3D Environments for Vision-and-Language Navigation}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2022}}
In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions. One major bottleneck for existing VLN approaches is the lack of sufficient training data, resulting in unsatisfactory generalization to unseen environments. While VLN data is typically collected manually, such an approach is expensive and prevents scalability. In this work, we address the data scarcity issue by proposing to automatically create a large-scale VLN dataset from 900 unlabeled 3D buildings from HM3D. We generate a navigation graph for each building and transfer object predictions from 2D to generate pseudo 3D object labels by cross-view consistency. We then fine-tune a pretrained language model using pseudo object labels as prompts to alleviate the cross-modal gap in instruction generation. Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions. We experimentally demonstrate that HM3D-AutoVLN significantly increases the generalization ability of resulting VLN models. On the SPL metric, our approach improves over state of the art by 7.1% and 8.1% on the unseen validation splits of REVERIE and SOON datasets respectively.
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
Shizhe Chen,Pierre-Louis Guhur,Makarand Tapaswi,Cordelia Schmid,Ivan Laptev
Computer Vision and Pattern Recognition, CVPR, 2022
@inproceedings{bib_Thin_2022, AUTHOR = {Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev}, TITLE = {Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2022}}
Following language instructions to navigate in unseen environments is a challenging problem for autonomous embodied agents. The agent not only needs to ground languages in visual scenes, but also should explore the environment to reach its target. In this work, we propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build a topological map on-the-fly to enable efficient exploration in global action space. To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers. The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation (VLN) benchmarks REVERIE and SOON. It also improves the success rate on the fine-grained VLN benchmark R2R.
Learning Object Manipulation Skills from Video via Approximate Differentiable Physics
Vladim´ır Petr´ık,Mohammad Nomaan Qureshi,Josef Sivic,Makarand Tapaswi
International Conference on Intelligent Robots and Systems, IROS, 2022
@inproceedings{bib_Lear_2022, AUTHOR = {Vladim´ır Petr´ık, Mohammad Nomaan Qureshi, Josef Sivic, Makarand Tapaswi}, TITLE = {Learning Object Manipulation Skills from Video via Approximate Differentiable Physics}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2022}}
We aim to teach robots to perform simple object manipulation tasks by watching a single video demonstration. Towards this goal, we propose an optimization approach that outputs a coarse and temporally evolving 3D scene to mimic the action demonstrated in the input video. Similar to previous work, a differentiable renderer ensures perceptual fidelity between the 3D scene and the 2D video. Our key novelty lies in the inclusion of a differentiable approach to solve a set of Ordinary Differential Equations (ODEs) that allows us to approximately model laws of physics such as gravity, friction, and hand-object or object-object interactions. This not only enables us to dramatically improve the quality of estimated hand and object states, but also produces physically admissible trajectories that can be directly translated to a robot without the need for costly reinforcement learning. We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations sourced from 9 actions such as pull something from right to left or put something in front of something. Our approach improves over previous state-of-the-art by almost 30%, demonstrating superior quality on especially challenging actions involving physical interactions of two objects such as put something onto something. Finally, we showcase the learned skills on a Franka Emika Panda robot.
Instruction-driven history-aware policies for robotic manipulations
Pierre-Louis Guhur,Shizhe Chen,Ricardo Garcia,Makarand Tapaswi,Ivan Laptev,Cordelia Schmid
Conference on Robot Learning, CORL, 2022
@inproceedings{bib_Inst_2022, AUTHOR = {Pierre-Louis Guhur, Shizhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid}, TITLE = {Instruction-driven history-aware policies for robotic manipulations}, BOOKTITLE = {Conference on Robot Learning}. YEAR = {2022}}
In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions. Yet, robotic manipulation is extremely challenging as it requires fine-grained motor control, long-term memory as well as generalization to previously unseen tasks and environments. To address these challenges, we propose a unified transformer-based approach that takes into account multiple inputs. In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations while (iii) keeping track of the full history of observations and actions. Such an approach enables learning dependencies between history and instructions and improves manipulation precision using multiple views. We evaluate our method on the challenging RLBench benchmark and on a real-world robot. Notably, our approach scales to 74 diverse RLBench tasks and outperforms the state of the art. We also address instruction-conditioned tasks and demonstrate excellent generalization to previously unseen variations.
Can we Adopt Self-supervised Pretraining for Chest X-Rays?
Arsh Verma,Makarand Tapaswi
Machine Learning for Health Workshop, ML4H, 2022
@inproceedings{bib_Can__2022, AUTHOR = {Arsh Verma, Makarand Tapaswi}, TITLE = {Can we Adopt Self-supervised Pretraining for Chest X-Rays?}, BOOKTITLE = {Machine Learning for Health Workshop}. YEAR = {2022}}
Chest radiograph (or Chest X-Ray, CXR) is a popular medical imaging modality that is used by radiologists across the world to diagnose heart or lung conditions. Over the last decade, Convolutional Neural Networks (CNN), have seen success in identifying pathologies in CXR images. Typically, these CNNs are pretrained on the standard ImageNet classification task, but this assumes availability of large-scale annotated datasets. In this work, we analyze the utility of pretraining on unlabeled ImageNet or Chest X-Ray (CXR) datasets using various algorithms and in multiple settings. Some findings of our work include: (i) supervised training with labeled ImageNet learns strong representations that are hard to beat; (ii) self-supervised pretraining on ImageNet (∼1M images) shows performance similar to self-supervised pretraining on a CXR dataset (∼100K images); and (iii) the CNN trained on supervised ImageNet can be trained further with self-supervised CXR images leading to improvements, especially when the downstream dataset is on the order of a few thousand images. Keywords: Chest X-Ray, SelfSupervised Pretraining
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
Shizhe Chen,Pierre-Louis Guhur,Makarand Tapaswi,Cordelia Schmid,Ivan Laptev
Neural Information Processing Systems, NeurIPS, 2022
@inproceedings{bib_Lang_2022, AUTHOR = {Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev}, TITLE = {Language Conditioned Spatial Relation Reasoning for 3D Object Grounding}, BOOKTITLE = {Neural Information Processing Systems}. YEAR = {2022}}
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations. In particular, it is often crucial to distinguish similar objects referred by the text, such as "the left most chair" and "a chair next to the window". In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations. To this end, we design a spatial self-attention layer that accounts for relative distances and orientations between objects in input 3D point clouds. Training such a layer with visual and language inputs enables to disambiguate spatial relations and to localize objects referred by the text. To facilitate the cross-modal learning of relations, we further propose a teacher-student approach where the teacher model is first trained using ground-truth object labels, and then helps to train a student model using point cloud inputs. We perform ablation studies showing advantages of our approach. We also demonstrate our model to significantly outperform the state of the art on the challenging Nr3D, Sr3D and ScanRefer 3D object grounding datasets.
Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations
Jaidev Shriram,Makarand Tapaswi,Vinoo A R
International Society for Music Information Retrieval, ISMIR, 2022
@inproceedings{bib_Sonu_2022, AUTHOR = {Jaidev Shriram, Makarand Tapaswi, Vinoo A R}, TITLE = {Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations}, BOOKTITLE = {International Society for Music Information Retrieval}. YEAR = {2022}}
Reading, much like music listening, is an immersive experience that transports readers while taking them on an emotional journey. Listening to complementary music has the potential to amplify the reading experience, especially when the music is stylistically cohesive and emotionally relevant. In this paper, we propose the first fully automatic method to build a dense soundtrack for books, which can play high-quality instrumental music for the entirety of the reading duration. Our work employs a unique text processing and music weaving pipeline that determines the context and emotional composition of scenes in a chapter. This allows our method to identify and play relevant excerpts from the soundtrack of the book’s movie adaptation. By relying on the movie composer’s craftsmanship, our book soundtracks include expert-made motifs and other scene-specific musical characteristics. We validate the design decisions of our approach through a perceptual study. Our readers note that the book soundtrack greatly enhanced their reading experience, due to high immersiveness granted via uninterrupted and style-consistent music, and a heightened emotional state attained via high precision emotion and scene context recognition.
Grounded Video Situation Recognition
Zeeshan Khan,Jawahar C V,Makarand Tapaswi
Neural Information Processing Systems, NeurIPS, 2022
@inproceedings{bib_Grou_2022, AUTHOR = {Zeeshan Khan, Jawahar C V, Makarand Tapaswi}, TITLE = {Grounded Video Situation Recognition}, BOOKTITLE = {Neural Information Processing Systems}. YEAR = {2022}}
Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation. In this work, we propose the addition of spatiotemporal grounding as an essential component of the structured prediction task in a weakly supervised setting, and present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the video clips to enable fine-grained spatio-temporal reasoning. The second stage sees verb-role queries attend and pool information from object embeddings, localising answers to questions posed about the action. The final stage generates these answers as captions to describe each verb-role pair present in the video. Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When evaluated on a grounding-augmented version of the VidSitu dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time.
Long term spatio-temporal modeling for action detection
Makarand Tapaswi,Vijay Kumar,Ivan Laptev
Computer Vision and Image Understanding, CVIU, 2021
@inproceedings{bib_Long_2021, AUTHOR = {Makarand Tapaswi, Vijay Kumar, Ivan Laptev}, TITLE = {Long term spatio-temporal modeling for action detection}, BOOKTITLE = {Computer Vision and Image Understanding}. YEAR = {2021}}
Modeling person interactions with their surroundings has proven to be effective for recognizing and localizing human actions in videos. While most recent works focus on learning short term interactions, in this work, we consider long-term person interactions and jointly localize actions of multiple actors over an entire video shot. We construct a graph with nodes that correspond to keyframe actor instances and connect them with two edge types. Spatial edges connect actors within a keyframe, and temporal edges connect multiple instances of the same actor over a video shot. We propose a Graph Neural Network that explicitly models spatial and temporal states for each person instance and learns to effectively combine information from both modalities to make predictions at the same time. We conduct experiments on the AVA dataset and show that our graph-based model provides consistent improvements over several video descriptors, achieving state-of-the-art performance without any fine-tuning.
Airbert: In-domain Pretraining for Vision-and-Language Navigation
Pierre-Louis Guhur,Makarand Tapaswi,Shizhe Chen,Ivan Laptev,Cordelia Schmid
International Conference on Computer Vision, ICCV, 2021
@inproceedings{bib_Airb_2021, AUTHOR = {Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, Cordelia Schmid}, TITLE = {Airbert: In-domain Pretraining for Vision-and-Language Navigation}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2021}}
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments us- ing natural language instructions. Given the scarcity of domain-specific training data and the high diversity of im- age and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent meth- ods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small- scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB1, a large- scale and diverse in-domain VLN dataset. We first col- lect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal or- der inside PI pairs. We use BnB to pretrain our Airbert2 model that can be adapted to discriminative and genera- tive settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Ex- pression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a chal- lenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses
Feature Generation for Long-tail Classification
Rahul Vigneswaran,Marc T. Law,Vineeth N. Balasubramanian,Makarand Tapaswi
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2021
@inproceedings{bib_Feat_2021, AUTHOR = {Rahul Vigneswaran, Marc T. Law, Vineeth N. Balasubramanian, Makarand Tapaswi}, TITLE = {Feature Generation for Long-tail Classification}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2021}}
The visual world naturally exhibits an imbalance in the number of object or scene instances resulting in a long-tailed distribution. This imbalance poses significant challenges for classification models based on deep learning. Oversampling instances of the tail classes attempts to solve this imbalance. However, the limited visual diver- sity results in a network with poor representation ability. A simple counter to this is decoupling the representation and classifier net- works and using oversampling only to train the classifier. In this paper, instead of repeatedly re-sampling the same im- age (and thereby features), we explore a direction that attempts to generate meaningful features by estimating the tail category’s distribution. Inspired by ideas from recent work on few-shot learn- ing [ 53], we create calibrated distributions to sample additional features that are subsequently used to train the classifier. Through several experiments on the CIFAR-100-LT (long-tail) dataset with varying imbalance factors and on mini-ImageNet-LT (long-tail), we show the efficacy of our approach and establish a new state-of- the-art. We also present a qualitative analysis of generated features using t-SNE visualizations and analyze the nearest neighbors used to calibrate the tail class distributions. Our code is available at https://github.com/rahulvigneswaran/TailCalibX.