Seeing Eye to AI Comparing Human Gaze and Model Attention in Video Memorability
Prajneya Kumar,Eshika Khandelwal,Makarand Tapaswi,Vishnu Sreekumar
Winter Conference on Applications of Computer Vision, WACV, 2025
@inproceedings{bib_Seei_2025, AUTHOR = {Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar}, TITLE = {Seeing Eye to AI Comparing Human Gaze and Model Attention in Video Memorability}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2025}}
Understanding what makes a video memorable has important applications in advertising or education technology. Towards this goal, we investigate spatio-temporal attention mechanisms underlying video memorability. Different from previous works that fuse multiple features, we adopt a simple CNN+Transformer architecture that enables analysis of spatio-temporal attention while matching state-of-the-art (SoTA) performance on video memorability prediction. We compare model attention against human gaze fixations collected through a small-scale eye-tracking study where humans perform the video memory task. We uncover the following insights: (i) Quantitative saliency metrics show that our model, trained only to predict a memorability score, exhibits similar spatial attention patterns to human gaze, especially for more memorable videos. (ii) The model assigns greater importance to initial frames in a video, mimicking human attention patterns. (iii) Panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur,Darshan Singh S,Makarand Tapaswi
Transactions in Machine Learning Research, TMLR, 2025
@inproceedings{bib_No_D_2025, AUTHOR = {Manu Gaur, Darshan Singh S, Makarand Tapaswi}, TITLE = {No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning}, BOOKTITLE = {Transactions in Machine Learning Research}. YEAR = {2025}}
Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training curriculum that more optimally leverages the contrastive nature of the self-retrieval reward. Jointly, they enable the captioner to describe fine-grained aspects in the image while preserving faithfulness to ground-truth captions. Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessi et al., 2023); and +7.6% on ImageCoDe. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model's fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce TrueMatch, a benchmark comprising bags of highly similar images that uses SR to assess the captioner's ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% - 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters. We also outperform vanilla SR by +14.4% to +19.5%.
The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad,Makarand Tapaswi,Cees G. M. Snoek,Andrew Zisserman
International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2025
@inproceedings{bib_The__2025, AUTHOR = {Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek, Andrew Zisserman}, TITLE = {The Sound of Water: Inferring Physical Properties from Pouring Liquids}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2025}}
We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill. To this end, we: (i) show in theory that these properties can be determined from the fundamental frequency (pitch); (ii) train a pitch detection model with supervision from simulated data and visual data with a physics-inspired objective; (iii) introduce a new large dataset of real pouring videos for a systematic study; (iv) show that the trained model can indeed infer these physical properties for real data; and finally, (v) we demonstrate strong generalization to various container shapes, other datasets, and in-the-wild YouTube videos. Our work presents a keen understanding of a narrow yet rich problem at the intersection of acoustics, physics, and learning. It opens up applications to enhance multisensory perception in robotic pouring.
Previously On ... From Recaps to Story Summarization
Aditya Kumar Singh,Dhruv Srivastava,Makarand Tapaswi
Computer Vision and Pattern Recognition, CVPR, 2024
@inproceedings{bib_Prev_2024, AUTHOR = {Aditya Kumar Singh, Dhruv Srivastava, Makarand Tapaswi}, TITLE = {Previously On ... From Recaps to Story Summarization}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2024}}
We introduce multimodal story summarization by leveraging TV episode recaps – short video sequences interweaving key visual moments and dialog from previous episodes to bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime thriller TV shows with rich recaps, and long-form content over 40 minutes per episode. Recap shots are mapped to corresponding sub-stories that serve as labels for story summarization. We propose a hierarchical model TaleSumm that (i) processes entire episodes by creating compact shot and dialog representations, and (ii) predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Unlike traditional summarization tasks, our method extracts multiple plot points from long-form videos. We present a thorough evaluation on this task, including promising cross-series generalization. TaleSumm shows good results on video summarization benchmarks.
MICap: A Unified Model for Identity-aware Movie Descriptions
Haran S K Raajesh,Naveen Reddy Desanur,Zeeshan Khan,Makarand Tapaswi
Computer Vision and Pattern Recognition, CVPR, 2024
@inproceedings{bib_MICa_2024, AUTHOR = {Haran S K Raajesh, Naveen Reddy Desanur, Zeeshan Khan, Makarand Tapaswi}, TITLE = {MICap: A Unified Model for Identity-aware Movie Descriptions}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2024}}
Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels. However, to predict captions with ids, a two-stage approach is required: first predict captions with someone, then fill in identities. In this work, we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives, while the encoder can benefit from or disregard captions with blanks as input. Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids. To this end, we introduce iSPICE, a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs. We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB accuracy, and a 1-2% bump in classic captioning metrics.
Major Entity Identification: A Generalizable Alternative to Coreference Resolution
S Kawshik Manikantan,Shubham Toshniwal,Makarand Tapaswi,Vineet Gandhi
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2024
@inproceedings{bib_Majo_2024, AUTHOR = {S Kawshik Manikantan, Shubham Toshniwal, Makarand Tapaswi, Vineet Gandhi}, TITLE = {Major Entity Identification: A Generalizable Alternative to Coreference Resolution}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2024}}
The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task’s broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative referential task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, MEI fits the classification framework, which enables the use of robust and intuitive classification-based metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest.
Unsupervised Audio-Visual Lecture Segmentation
Darshan Singh S,Anchit Gupta,Jawahar C V,Makarand Tapaswi
Winter Conference on Applications of Computer Vision, WACV, 2023
@inproceedings{bib_Unsu_2023, AUTHOR = {Darshan Singh S, Anchit Gupta, Jawahar C V, Makarand Tapaswi}, TITLE = {Unsupervised Audio-Visual Lecture Segmentation}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2023}}
Over the last decade, online lecture videos have become increasingly popular and have experienced a meteoric rise during the pandemic. However, video-language research has primarily focused on instructional videos or movies, and tools to help students navigate the growing online lectures are lacking. Our first contribution is to facilitate research in the educational domain by introducing AVLectures, a large-scale dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR outputs for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Our second contribution is introducing video lecture segmentation that splits lectures into bitesized topics. Lecture clip representations leverage visual, textual, and OCR cues and are trained on a pretext selfsupervised task of matching the narration with the temporally aligned visual content. We formulate lecture segmentation as an unsupervised task and use these representations to generate segments using a temporally consistent 1- nearest neighbor algorithm, TW-FINCH [44]. We evaluate our method on 15 courses and compare it against various visual and textual baselines, outperforming all of them. Our comprehensive ablation studies also identify the key factors driving the success of our approach.
Test of Time: Instilling Video-Language Models with a Sense of Time
Piyush Bagad,Makarand Tapaswi,Cees G. M. Snoek
Computer Vision and Pattern Recognition, CVPR, 2023
@inproceedings{bib_Test_2023, AUTHOR = {Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek}, TITLE = {Test of Time: Instilling Video-Language Models with a Sense of Time}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2023}}
Modeling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that six existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.
How you feelin? Learning Emotions and Mental States in Movie Scenes
Dhruv Srivastava,Aditya Kumar Singh,Makarand Tapaswi
Computer Vision and Pattern Recognition, CVPR, 2023
@inproceedings{bib_How__2023, AUTHOR = {Dhruv Srivastava, Aditya Kumar Singh, Makarand Tapaswi}, TITLE = {How you feelin? Learning Emotions and Mental States in Movie Scenes}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2023}}
Movie story analysis requires understanding characters’ emotions and mental states. Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character. We propose EmoTx, a multimodal Transformer-based architecture that ingests videos, multiple characters, and dialog utterances to make joint predictions. By leveraging annotations from the MovieGraphs dataset [72], we aim to predict classic emotions (e.g. happy, angry) and other mental states (e.g. honest, helpful). We conduct experiments on the most frequently occurring 10 and 25 labels, and a mapping that clusters 181 labels to 26. Ablation studies and comparison against adapted stateof-the-art emotion recognition approaches shows the effectiveness of EmoTx. Analyzing EmoTx’s self-attention scores reveals that expressive emotions often look at character tokens while other mental states rely on video and dialog cues.
Grounded Video Situation Recognition
Zeeshan Khan,Jawahar C V,Makarand Tapaswi
Neural Information Processing Systems, NeurIPS, 2022
@inproceedings{bib_Grou_2022, AUTHOR = {Zeeshan Khan, Jawahar C V, Makarand Tapaswi}, TITLE = {Grounded Video Situation Recognition}, BOOKTITLE = {Neural Information Processing Systems}. YEAR = {2022}}
Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation. In this work, we propose the addition of spatiotemporal grounding as an essential component of the structured prediction task in a weakly supervised setting, and present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the video clips to enable fine-grained spatio-temporal reasoning. The second stage sees verb-role queries attend and pool information from object embeddings, localising answers to questions posed about the action. The final stage generates these answers as captions to describe each verb-role pair present in the video. Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When evaluated on a grounding-augmented version of the VidSitu dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time.
Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations
Jaidev Shriram,Makarand Tapaswi,Vinoo A R
International Society for Music Information Retrieval, ISMIR, 2022
@inproceedings{bib_Sonu_2022, AUTHOR = {Jaidev Shriram, Makarand Tapaswi, Vinoo A R}, TITLE = {Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations}, BOOKTITLE = {International Society for Music Information Retrieval}. YEAR = {2022}}
Reading, much like music listening, is an immersive experience that transports readers while taking them on an emotional journey. Listening to complementary music has the potential to amplify the reading experience, especially when the music is stylistically cohesive and emotionally relevant. In this paper, we propose the first fully automatic method to build a dense soundtrack for books, which can play high-quality instrumental music for the entirety of the reading duration. Our work employs a unique text processing and music weaving pipeline that determines the context and emotional composition of scenes in a chapter. This allows our method to identify and play relevant excerpts from the soundtrack of the book’s movie adaptation. By relying on the movie composer’s craftsmanship, our book soundtracks include expert-made motifs and other scene-specific musical characteristics. We validate the design decisions of our approach through a perceptual study. Our readers note that the book soundtrack greatly enhanced their reading experience, due to high immersiveness granted via uninterrupted and style-consistent music, and a heightened emotional state attained via high precision emotion and scene context recognition.
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
Shizhe Chen,Pierre-Louis Guhur,Makarand Tapaswi,Cordelia Schmid,Ivan Laptev
Neural Information Processing Systems, NeurIPS, 2022
@inproceedings{bib_Lang_2022, AUTHOR = {Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev}, TITLE = {Language Conditioned Spatial Relation Reasoning for 3D Object Grounding}, BOOKTITLE = {Neural Information Processing Systems}. YEAR = {2022}}
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations. In particular, it is often crucial to distinguish similar objects referred by the text, such as "the left most chair" and "a chair next to the window". In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations. To this end, we design a spatial self-attention layer that accounts for relative distances and orientations between objects in input 3D point clouds. Training such a layer with visual and language inputs enables to disambiguate spatial relations and to localize objects referred by the text. To facilitate the cross-modal learning of relations, we further propose a teacher-student approach where the teacher model is first trained using ground-truth object labels, and then helps to train a student model using point cloud inputs. We perform ablation studies showing advantages of our approach. We also demonstrate our model to significantly outperform the state of the art on the challenging Nr3D, Sr3D and ScanRefer 3D object grounding datasets.
Instruction-driven history-aware policies for robotic manipulations
Pierre-Louis Guhur,Shizhe Chen,Ricardo Garcia,Makarand Tapaswi,Ivan Laptev,Cordelia Schmid
Conference on Robot Learning, CORL, 2022
@inproceedings{bib_Inst_2022, AUTHOR = {Pierre-Louis Guhur, Shizhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid}, TITLE = {Instruction-driven history-aware policies for robotic manipulations}, BOOKTITLE = {Conference on Robot Learning}. YEAR = {2022}}
In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions. Yet, robotic manipulation is extremely challenging as it requires fine-grained motor control, long-term memory as well as generalization to previously unseen tasks and environments. To address these challenges, we propose a unified transformer-based approach that takes into account multiple inputs. In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations while (iii) keeping track of the full history of observations and actions. Such an approach enables learning dependencies between history and instructions and improves manipulation precision using multiple views. We evaluate our method on the challenging RLBench benchmark and on a real-world robot. Notably, our approach scales to 74 diverse RLBench tasks and outperforms the state of the art. We also address instruction-conditioned tasks and demonstrate excellent generalization to previously unseen variations.
Learning Object Manipulation Skills from Video via Approximate Differentiable Physics
Vladim´ır Petr´ık,Mohammad Nomaan Qureshi,Josef Sivic,Makarand Tapaswi
International Conference on Intelligent Robots and Systems, IROS, 2022
@inproceedings{bib_Lear_2022, AUTHOR = {Vladim´ır Petr´ık, Mohammad Nomaan Qureshi, Josef Sivic, Makarand Tapaswi}, TITLE = {Learning Object Manipulation Skills from Video via Approximate Differentiable Physics}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2022}}
We aim to teach robots to perform simple object manipulation tasks by watching a single video demonstration. Towards this goal, we propose an optimization approach that outputs a coarse and temporally evolving 3D scene to mimic the action demonstrated in the input video. Similar to previous work, a differentiable renderer ensures perceptual fidelity between the 3D scene and the 2D video. Our key novelty lies in the inclusion of a differentiable approach to solve a set of Ordinary Differential Equations (ODEs) that allows us to approximately model laws of physics such as gravity, friction, and hand-object or object-object interactions. This not only enables us to dramatically improve the quality of estimated hand and object states, but also produces physically admissible trajectories that can be directly translated to a robot without the need for costly reinforcement learning. We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations sourced from 9 actions such as pull something from right to left or put something in front of something. Our approach improves over previous state-of-the-art by almost 30%, demonstrating superior quality on especially challenging actions involving physical interactions of two objects such as put something onto something. Finally, we showcase the learned skills on a Franka Emika Panda robot.
Learning from Unlabeled 3D Environments for Vision-and-Language Navigation
Shizhe Chen, Pierre-Louis Guhur,Makarand Tapaswi,Cordelia Schmid, Ivan Laptev
European Conference on Computer Vision, ECCV, 2022
Abs | | bib Tex
@inproceedings{bib_Lear_2022, AUTHOR = {Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev}, TITLE = {Learning from Unlabeled 3D Environments for Vision-and-Language Navigation}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2022}}
In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions. One major bottleneck for existing VLN approaches is the lack of sufficient training data, resulting in unsatisfactory generalization to unseen environments. While VLN data is typically collected manually, such an approach is expensive and prevents scalability. In this work, we address the data scarcity issue by proposing to automatically create a large-scale VLN dataset from 900 unlabeled 3D buildings from HM3D. We generate a navigation graph for each building and transfer object predictions from 2D to generate pseudo 3D object labels by cross-view consistency. We then fine-tune a pretrained language model using pseudo object labels as prompts to alleviate the cross-modal gap in instruction generation. Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions. We experimentally demonstrate that HM3D-AutoVLN significantly increases the generalization ability of resulting VLN models. On the SPL metric, our approach improves over state of the art by 7.1% and 8.1% on the unseen validation splits of REVERIE and SOON datasets respectively.
Feature Generation for Long-tail Classification
Rahul Vigneswaran,Marc T. Law,Vineeth N. Balasubramanian,Makarand Tapaswi
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2021
@inproceedings{bib_Feat_2021, AUTHOR = {Rahul Vigneswaran, Marc T. Law, Vineeth N. Balasubramanian, Makarand Tapaswi}, TITLE = {Feature Generation for Long-tail Classification}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2021}}
The visual world naturally exhibits an imbalance in the number of object or scene instances resulting in a long-tailed distribution. This imbalance poses significant challenges for classification models based on deep learning. Oversampling instances of the tail classes attempts to solve this imbalance. However, the limited visual diver- sity results in a network with poor representation ability. A simple counter to this is decoupling the representation and classifier net- works and using oversampling only to train the classifier. In this paper, instead of repeatedly re-sampling the same im- age (and thereby features), we explore a direction that attempts to generate meaningful features by estimating the tail category’s distribution. Inspired by ideas from recent work on few-shot learn- ing [ 53], we create calibrated distributions to sample additional features that are subsequently used to train the classifier. Through several experiments on the CIFAR-100-LT (long-tail) dataset with varying imbalance factors and on mini-ImageNet-LT (long-tail), we show the efficacy of our approach and establish a new state-of- the-art. We also present a qualitative analysis of generated features using t-SNE visualizations and analyze the nearest neighbors used to calibrate the tail class distributions. Our code is available at https://github.com/rahulvigneswaran/TailCalibX.
Airbert: In-domain Pretraining for Vision-and-Language Navigation
Pierre-Louis Guhur,Makarand Tapaswi,Shizhe Chen,Ivan Laptev,Cordelia Schmid
International Conference on Computer Vision, ICCV, 2021
@inproceedings{bib_Airb_2021, AUTHOR = {Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, Cordelia Schmid}, TITLE = {Airbert: In-domain Pretraining for Vision-and-Language Navigation}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2021}}
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments us- ing natural language instructions. Given the scarcity of domain-specific training data and the high diversity of im- age and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent meth- ods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small- scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB1, a large- scale and diverse in-domain VLN dataset. We first col- lect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal or- der inside PI pairs. We use BnB to pretrain our Airbert2 model that can be adapted to discriminative and genera- tive settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Ex- pression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a chal- lenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses
Long term spatio-temporal modeling for action detection
Makarand Tapaswi,Vijay Kumar,Ivan Laptev
Computer Vision and Image Understanding, CVIU, 2021
@inproceedings{bib_Long_2021, AUTHOR = {Makarand Tapaswi, Vijay Kumar, Ivan Laptev}, TITLE = {Long term spatio-temporal modeling for action detection}, BOOKTITLE = {Computer Vision and Image Understanding}. YEAR = {2021}}
Modeling person interactions with their surroundings has proven to be effective for recognizing and localizing human actions in videos. While most recent works focus on learning short term interactions, in this work, we consider long-term person interactions and jointly localize actions of multiple actors over an entire video shot. We construct a graph with nodes that correspond to keyframe actor instances and connect them with two edge types. Spatial edges connect actors within a keyframe, and temporal edges connect multiple instances of the same actor over a video shot. We propose a Graph Neural Network that explicitly models spatial and temporal states for each person instance and learns to effectively combine information from both modalities to make predictions at the same time. We conduct experiments on the AVA dataset and show that our graph-based model provides consistent improvements over several video descriptors, achieving state-of-the-art performance without any fine-tuning.