IIITH

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Computer Vision and Pattern Recognition, CVPR, 2022

Core Rank : A* Google Rank :440

Abs PDF bibTex

@inproceedings{bib_Thin_2022, AUTHOR = {Chen, Shizhe and Guhur, Pierre-Louis and Tapaswi, Makarand and Schmid, Cordelia and Laptev, Ivan }, TITLE = {Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2022}}

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Abstract

Following language instructions to navigate in unseen environments is a challenging problem for autonomous embodied agents. The agent not only needs to ground languages in visual scenes, but also should explore the environment to reach its target. In this work, we propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build a topological map on-the-fly to enable efficient exploration in global action space. To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers. The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation (VLN) benchmarks REVERIE and SOON. It also improves the success rate on the fine-grained VLN benchmark R2R.

Learning Object Manipulation Skills from Video via Approximate Differentiable Physics

International Conference on Intelligent Robots and Systems, IROS, 2022

Core Rank : A Google Rank :86

Abs PDF bibTex

@inproceedings{bib_Lear_2022, AUTHOR = {Petr´ık, Vladim´ır and Qureshi, Mohammad Nomaan and Sivic, Josef and Tapaswi, Makarand }, TITLE = {Learning Object Manipulation Skills from Video via Approximate Differentiable Physics}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2022}}

Learning Object Manipulation Skills from Video via Approximate Differentiable Physics

Abstract

We aim to teach robots to perform simple object manipulation tasks by watching a single video demonstration. Towards this goal, we propose an optimization approach that outputs a coarse and temporally evolving 3D scene to mimic the action demonstrated in the input video. Similar to previous work, a differentiable renderer ensures perceptual fidelity between the 3D scene and the 2D video. Our key novelty lies in the inclusion of a differentiable approach to solve a set of Ordinary Differential Equations (ODEs) that allows us to approximately model laws of physics such as gravity, friction, and hand-object or object-object interactions. This not only enables us to dramatically improve the quality of estimated hand and object states, but also produces physically admissible trajectories that can be directly translated to a robot without the need for costly reinforcement learning. We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations sourced from 9 actions such as pull something from right to left or put something in front of something. Our approach improves over previous state-of-the-art by almost 30%, demonstrating superior quality on especially challenging actions involving physical interactions of two objects such as put something onto something. Finally, we showcase the learned skills on a Franka Emika Panda robot.

Instruction-driven history-aware policies for robotic manipulations

Conference on Robot Learning, CORL, 2022

Core Rank : - Google Rank :88

Abs PDF bibTex

@inproceedings{bib_Inst_2022, AUTHOR = {Guhur, Pierre-Louis and Chen, Shizhe and Garcia, Ricardo and Tapaswi, Makarand and Laptev, Ivan and Schmid, Cordelia }, TITLE = {Instruction-driven history-aware policies for robotic manipulations}, BOOKTITLE = {Conference on Robot Learning}. YEAR = {2022}}

Instruction-driven history-aware policies for robotic manipulations

Abstract

In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions. Yet, robotic manipulation is extremely challenging as it requires fine-grained motor control, long-term memory as well as generalization to previously unseen tasks and environments. To address these challenges, we propose a unified transformer-based approach that takes into account multiple inputs. In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations while (iii) keeping track of the full history of observations and actions. Such an approach enables learning dependencies between history and instructions and improves manipulation precision using multiple views. We evaluate our method on the challenging RLBench benchmark and on a real-world robot. Notably, our approach scales to 74 diverse RLBench tasks and outperforms the state of the art. We also address instruction-conditioned tasks and demonstrate excellent generalization to previously unseen variations.

Can we Adopt Self-supervised Pretraining for Chest X-Rays?

Machine Learning for Health Workshop, ML4H, 2022

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Can__2022, AUTHOR = {Verma, Arsh and Tapaswi, Makarand }, TITLE = {Can we Adopt Self-supervised Pretraining for Chest X-Rays?}, BOOKTITLE = {Machine Learning for Health Workshop}. YEAR = {2022}}

Can we Adopt Self-supervised Pretraining for Chest X-Rays?

Abstract

Chest radiograph (or Chest X-Ray, CXR) is a popular medical imaging modality that is used by radiologists across the world to diagnose heart or lung conditions. Over the last decade, Convolutional Neural Networks (CNN), have seen success in identifying pathologies in CXR images. Typically, these CNNs are pretrained on the standard ImageNet classification task, but this assumes availability of large-scale annotated datasets. In this work, we analyze the utility of pretraining on unlabeled ImageNet or Chest X-Ray (CXR) datasets using various algorithms and in multiple settings. Some findings of our work include: (i) supervised training with labeled ImageNet learns strong representations that are hard to beat; (ii) self-supervised pretraining on ImageNet (∼1M images) shows performance similar to self-supervised pretraining on a CXR dataset (∼100K images); and (iii) the CNN trained on supervised ImageNet can be trained further with self-supervised CXR images leading to improvements, especially when the downstream dataset is on the order of a few thousand images. Keywords: Chest X-Ray, SelfSupervised Pretraining

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

Neural Information Processing Systems, NeurIPS, 2022

Core Rank : A* Google Rank :337

Abs PDF bibTex

@inproceedings{bib_Lang_2022, AUTHOR = {Chen, Shizhe and Guhur, Pierre-Louis and Tapaswi, Makarand and Schmid, Cordelia and Laptev, Ivan }, TITLE = {Language Conditioned Spatial Relation Reasoning for 3D Object Grounding}, BOOKTITLE = {Neural Information Processing Systems}. YEAR = {2022}}

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

Abstract

Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations. In particular, it is often crucial to distinguish similar objects referred by the text, such as "the left most chair" and "a chair next to the window". In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations. To this end, we design a spatial self-attention layer that accounts for relative distances and orientations between objects in input 3D point clouds. Training such a layer with visual and language inputs enables to disambiguate spatial relations and to localize objects referred by the text. To facilitate the cross-modal learning of relations, we further propose a teacher-student approach where the teacher model is first trained using ground-truth object labels, and then helps to train a student model using point cloud inputs. We perform ablation studies showing advantages of our approach. We also demonstrate our model to significantly outperform the state of the art on the challenging Nr3D, Sr3D and ScanRefer 3D object grounding datasets.

Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations

International Society for Music Information Retrieval, ISMIR, 2022

Core Rank : - Google Rank :40

Abs PDF bibTex

@inproceedings{bib_Sonu_2022, AUTHOR = {Shriram, Jaidev and Tapaswi, Makarand and R, Vinoo A }, TITLE = {Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations}, BOOKTITLE = {International Society for Music Information Retrieval}. YEAR = {2022}}

Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations

Abstract

Reading, much like music listening, is an immersive experience that transports readers while taking them on an emotional journey. Listening to complementary music has the potential to amplify the reading experience, especially when the music is stylistically cohesive and emotionally relevant. In this paper, we propose the first fully automatic method to build a dense soundtrack for books, which can play high-quality instrumental music for the entirety of the reading duration. Our work employs a unique text processing and music weaving pipeline that determines the context and emotional composition of scenes in a chapter. This allows our method to identify and play relevant excerpts from the soundtrack of the book’s movie adaptation. By relying on the movie composer’s craftsmanship, our book soundtracks include expert-made motifs and other scene-specific musical characteristics. We validate the design decisions of our approach through a perceptual study. Our readers note that the book soundtrack greatly enhanced their reading experience, due to high immersiveness granted via uninterrupted and style-consistent music, and a heightened emotional state attained via high precision emotion and scene context recognition.

Grounded Video Situation Recognition

Neural Information Processing Systems, NeurIPS, 2022

Core Rank : A* Google Rank :337

Abs PDF bibTex

@inproceedings{bib_Grou_2022, AUTHOR = {Khan, Zeeshan and V, Jawahar C and Tapaswi, Makarand }, TITLE = {Grounded Video Situation Recognition}, BOOKTITLE = {Neural Information Processing Systems}. YEAR = {2022}}

Grounded Video Situation Recognition

Abstract

Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation. In this work, we propose the addition of spatiotemporal grounding as an essential component of the structured prediction task in a weakly supervised setting, and present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the video clips to enable fine-grained spatio-temporal reasoning. The second stage sees verb-role queries attend and pool information from object embeddings, localising answers to questions posed about the action. The final stage generates these answers as captions to describe each verb-role pair present in the video. Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When evaluated on a grounding-augmented version of the VidSitu dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time.

Long term spatio-temporal modeling for action detection

Computer Vision and Image Understanding, CVIU, 2021

Core Rank : - Google Rank :48

Abs PDF bibTex

@inproceedings{bib_Long_2021, AUTHOR = {Tapaswi, Makarand and Kumar, Vijay and Laptev, Ivan }, TITLE = {Long term spatio-temporal modeling for action detection}, BOOKTITLE = {Computer Vision and Image Understanding}. YEAR = {2021}}

Long term spatio-temporal modeling for action detection

Abstract

Modeling person interactions with their surroundings has proven to be effective for recognizing and localizing human actions in videos. While most recent works focus on learning short term interactions, in this work, we consider long-term person interactions and jointly localize actions of multiple actors over an entire video shot. We construct a graph with nodes that correspond to keyframe actor instances and connect them with two edge types. Spatial edges connect actors within a keyframe, and temporal edges connect multiple instances of the same actor over a video shot. We propose a Graph Neural Network that explicitly models spatial and temporal states for each person instance and learns to effectively combine information from both modalities to make predictions at the same time. We conduct experiments on the AVA dataset and show that our graph-based model provides consistent improvements over several video descriptors, achieving state-of-the-art performance without any fine-tuning.

Airbert: In-domain Pretraining for Vision-and-Language Navigation

International Conference on Computer Vision, ICCV, 2021

Core Rank : A* Google Rank :291

Abs PDF bibTex

@inproceedings{bib_Airb_2021, AUTHOR = {Guhur, Pierre-Louis and Tapaswi, Makarand and Chen, Shizhe and Laptev, Ivan and Schmid, Cordelia }, TITLE = {Airbert: In-domain Pretraining for Vision-and-Language Navigation}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2021}}

Airbert: In-domain Pretraining for Vision-and-Language Navigation

Abstract

Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments us- ing natural language instructions. Given the scarcity of domain-specific training data and the high diversity of im- age and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent meth- ods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small- scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB1, a large- scale and diverse in-domain VLN dataset. We first col- lect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal or- der inside PI pairs. We use BnB to pretrain our Airbert2 model that can be adapted to discriminative and genera- tive settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Ex- pression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a chal- lenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses

Feature Generation for Long-tail Classification

Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2021

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Feat_2021, AUTHOR = {Vigneswaran, Rahul and Law, Marc T. and Balasubramanian, Vineeth N. and Tapaswi, Makarand }, TITLE = {Feature Generation for Long-tail Classification}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2021}}

Feature Generation for Long-tail Classification

Abstract

The visual world naturally exhibits an imbalance in the number of object or scene instances resulting in a long-tailed distribution. This imbalance poses significant challenges for classification models based on deep learning. Oversampling instances of the tail classes attempts to solve this imbalance. However, the limited visual diver- sity results in a network with poor representation ability. A simple counter to this is decoupling the representation and classifier net- works and using oversampling only to train the classifier. In this paper, instead of repeatedly re-sampling the same im- age (and thereby features), we explore a direction that attempts to generate meaningful features by estimating the tail category’s distribution. Inspired by ideas from recent work on few-shot learn- ing [ 53], we create calibrated distributions to sample additional features that are subsequently used to train the classifier. Through several experiments on the CIFAR-100-LT (long-tail) dataset with varying imbalance factors and on mini-ImageNet-LT (long-tail), we show the efficacy of our approach and establish a new state-of- the-art. We also present a qualitative analysis of generated features using t-SNE visualizations and analyze the nearest neighbors used to calibrate the tail class distributions. Our code is available at https://github.com/rahulvigneswaran/TailCalibX.