IIITH

Lupus Nephritis Subtype Classification with only Slide Level labels

Medical Imaging with Deep Learning, MIDL, 2024

Google Rank :41

Abs PDF bibTex

@inproceedings{bib_Lupu_2024, AUTHOR = {Sharma, Amit and Chauhan, Ekansh and Uppin, Megha S and Rajasekhar, Liza and V, Jawahar C and Krishnanunni, Vinod Palakkad }, TITLE = {Lupus Nephritis Subtype Classification with only Slide Level labels}, BOOKTITLE = {Medical Imaging with Deep Learning}. YEAR = {2024}}

Lupus Nephritis Subtype Classification with only Slide Level labels

Abstract

Lupus Nephritis classification has historically relied on labor-intensive and meticulous glomerular-level labeling of renal structures in whole slide images (WSIs). However, this approach presents a formidable challenge due to its tedious and resource-intensive nature, limiting its scalability and practicality in clinical settings. In response to this challenge, our work introduces a novel methodology that utilizes only slide-level labels, eliminating the need for granular glomerular-level labeling. A comprehensive multi-stained lupus nephritis digital histopathology WSI dataset was created from the Indian population, which is the largest of its kind. LupusNet, a deep learning MIL-based model, was developed to classify LN subtypes. The results underscore its effectiveness, achieving an AUC score of 91.0%, an F1 score of 77.3%, and an accuracy of 81.1% on our dataset in distinguishing membranous and diffused classes of LN

IDD-X: A Multi-View Dataset for Ego-relative Important Object Localization and Explanation in Dense and Unstructured Traffic

International Conference on Robotics and Automation, ICRA, 2024

Core Rank : A* Google Rank :122

Abs PDF DOI bibTex

@inproceedings{bib_IDD-_2024, AUTHOR = {Parikh, Chirag and Saluja, Rohit and V, Jawahar C and Sarvadevabhatla, Ravi Kiran }, TITLE = {IDD-X: A Multi-View Dataset for Ego-relative Important Object Localization and Explanation in Dense and Unstructured Traffic}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2024}}

IDD-X: A Multi-View Dataset for Ego-relative Important Object Localization and Explanation in Dense and Unstructured Traffic

Abstract

Intelligent vehicle systems require a deep understanding of the interplay between road conditions, surrounding entities, and the ego vehicle's driving behavior for safe and efficient navigation. This is particularly critical in developing countries where traffic situations are often dense and unstructured with heterogeneous road occupants. Existing datasets, predominantly geared towards structured and sparse traffic scenarios, fall short of capturing the complexity of driving in such environments. To fill this gap, we present IDD-X, a large-scale dual-view driving video dataset. With 697K bounding boxes, 9K important object tracks, and 1-12 objects per video, IDD-X offers comprehensive ego-relative annotations for multiple important road objects covering 10 categories and 19 explanation label categories. The dataset also incorporates rearview information to provide a more complete representation of the driving environment. We also introduce custom-designed deep networks aimed at multiple important object localization and per-object explanation prediction. Overall, our dataset and introduced prediction models form the foundation for studying how road conditions and surrounding entities affect driving behavior in complex traffic situations.

System and method for automatically generating a sign language video with an input speech using a machine learning model

United States Patent, Us patent, 2023

Abs PDF bibTex

@inproceedings{bib_Syst_2023, AUTHOR = {V, Jawahar C and Kapoor, Parul and Hegde, Sindhu Balachandra and Mukhopadhyay, Rudrabha and Namboodiri, Vinay }, TITLE = {System and method for automatically generating a sign language video with an input speech using a machine learning model}, BOOKTITLE = {United States Patent}. YEAR = {2023}}

System and method for automatically generating a sign language video with an input speech using a machine learning model

Abstract

Embodiments herein provide a system and method for automatically generating a sign language video from an input speech using the machine learning model. The method includes (i) extracting a plurality of spectrograms of an input speech by (a) encoding, using an encoder, a time domain series of the input speech to a frequency domain series, and (b) decoding, using a decoder, a plurality of tokens for time steps of the frequency domain series, (ii) generating a plurality of pose sequences for a current time step of the plurality of spectrograms using a first machine learning model, and (iii) automatically generating, using a discriminator of a second machine learning model, a sign language video for the input speech using the plurality of pose sequences and the plurality of spectrograms when the plurality of pose sequences are matched with corresponding the plurality of spectrograms that are extracted.

Towards Efficient Semantic Segmentation via Meta Pruning

International Conference on Computer vision and Image Processing, CVIP, 2023

Core Rank : - Google Rank :14

Abs bibTex

@inproceedings{bib_Towa_2023, AUTHOR = {Mishra, Ashutosh and Rai, Shyam Nandan and Varma, Girish and V, Jawahar C }, TITLE = {Towards Efficient Semantic Segmentation via Meta Pruning}, BOOKTITLE = {International Conference on Computer vision and Image Processing}. YEAR = {2023}}

Towards Efficient Semantic Segmentation via Meta Pruning

Abstract

Semantic segmentation provides a pixel-level understanding of an image essential for various scene-understanding vision tasks. However, semantic segmentation models demand significant computational resources during training and inference. These requirements pose a challenge in resource-constraint scenarios. To address this issue, we present a compression algorithm based on differentiable meta-pruning through hypernetwork: MPHyp. Our proposed method MPHyp utilizes hypernetworks that take latent vectors as input and output weight matrices for the segmentation model. L1 sparsification follows the proximal gradient optimizer, updates the latent vectors and introduces sparsity leading to automatic model pruning. The proposed method offers the benefit of achieving controllable compression during the training and significantly reducing the training time. We compare our methodology with a popular pruning approach and demonstrate its efficacy by reducing the number of parameters and floating point operations while maintaining the mean Intersection over Union (mIoU) metric. We conduct experiments on two widely accepted semantic segmentation architectures: UNet and ERFNet. Our experiments and ablation study demonstrate the effectiveness of our proposed methodology by achieving efficient and reasonable segmentation results.

System and method for detecting object in an adaptive environment using a machine learning model

United States Patent, Us patent, 2023

Abs PDF bibTex

@inproceedings{bib_Syst_2023, AUTHOR = {Saluja, Rohit and Arora, Chetan and Balasubramanian, Vineeth N and Khindkar, Vaishnavi and V, Jawahar C }, TITLE = {System and method for detecting object in an adaptive environment using a machine learning model}, BOOKTITLE = {United States Patent}. YEAR = {2023}}

System and method for detecting object in an adaptive environment using a machine learning model

Abstract

A method for detecting object in an image in a target environment that is adapted to a source environment using a machine learning model is provided. The method includes (i) extracting features from source image associated with source environment and target image associated with target environment, (ii) generating a feature map based on the features, (iii) generating a pixel-wise probability output map (iv) determining a first environment invariant feature map by combining the feature map with the pixel-wise probability output map, (v) determining a second environment invariant feature map by combining the first environment invariant feature map and the features, (vi) generating environment invariant feature maps at different instances, (vii) extracting environment invariant features based on the environment invariant feature maps, (viii) detecting the object in the image in the target environment that is adapted to the source environment by training the machine learning model using the environment invariant features.

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Technical Report, arXiv, 2023

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Ego-_2023, AUTHOR = {Grauman, Kristen and Boote, Bikram and Byrne, Eugene and Chavis, Zach and Chen, Joya and Cheng, Feng and V, Jawahar C and Westbury, Andrew and Torresani, Lorenzo and Kitani, Kris and Malik, Jitendra and Afouras, Triantafyllos and Ashutosh, Kumar and Baiyya, Vijay and Bansal, Siddhant }, TITLE = {Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives}, BOOKTITLE = {Technical Report}. YEAR = {2023}}

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Abstract

We present Ego-Exo4D, a diverse, large-scale multi- modal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured ego- centric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these ac- tivities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by mul- tichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiplepaired language descriptions—including a novel “expert commentary” done by coaches and teach- ers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled hu- man activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity un- derstanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community.

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Technical Report, arXiv, 2023

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Ego-_2023, AUTHOR = {Grauman, Kristen and Bansal, Siddhant and Newcombe, Richard and Rehg, James M. and Wray, Michael and Shi, Jianbo and Kuo, Robert and Westbury, Andrew and Torresani, Lorenzo and Kitani, Kris and Malik, Jitendra and Afouras, Triantafyllos and V, Jawahar C and Ashutosh, Kumar and Baiyya, Vijay }, TITLE = {Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives}, BOOKTITLE = {Technical Report}. YEAR = {2023}}

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Abstract

United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

Technical Report, arXiv, 2023

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Unit_2023, AUTHOR = {Bansal, Siddhant and Arora, Chetan and V, Jawahar C }, TITLE = {United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos}, BOOKTITLE = {Technical Report}. YEAR = {2023}}

United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos

Abstract

iven multiple videos of the same task, procedure learn- ing addresses identifying the key-steps and determining their order to perform the task. For this purpose, existing approaches use the signal generated from a pair of videos. This makes key-steps discovery challenging as the algo- rithms lack inter-videos perspective. Instead, we propose an unsupervised Graph-based Procedure Learning (GPL) framework. GPL consists of the novel UnityGraph that rep- resents all the videos of a task as a graph to obtain both intra-video and inter-videos context. Further, to obtain sim- ilar embeddings for the same key-steps, the embeddings of UnityGraph are updated in an unsupervised manner using the Node2Vec algorithm. Finally, to identify the key-steps, we cluster the embeddings using KMeans. We test GPL on benchmark ProceL, CrossTask, and EgoProceL datasets and achieve an average improvement of 2% on third-person datasets and 3.6% on EgoProceL over the state-of-the-art.

CueCAn: Cue-driven Contextual Attention for Identifying Missing Traffic Signs on Unconstrained Roads

International Conference on Robotics and Automation, ICRA, 2023

Core Rank : A* Google Rank :122

Abs PDF DOI bibTex

@inproceedings{bib_CueC_2023, AUTHOR = {Gupta, Varun and Subramanian, Anbumani and V, Jawahar C and Saluja, Rohit }, TITLE = {CueCAn: Cue-driven Contextual Attention for Identifying Missing Traffic Signs on Unconstrained Roads}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2023}}

CueCAn: Cue-driven Contextual Attention for Identifying Missing Traffic Signs on Unconstrained Roads

Abstract

Unconstrained Asian roads often involve poor infrastructure, affecting overall road safety. Missing traffic signs are a regular part of such roads. Missing or non-existing object detection has been studied for locating missing curbs and estimating reasonable regions for pedestrians on road scene images. Such methods involve analyzing task-specific single object cues. In this paper, we present the first and most challenging video dataset for missing objects, with multiple types of traffic signs for which the cues are visible without the signs in the scenes. We refer to it as the Missing Traffic Signs Video Dataset (MTSVD). MTSVD is challenging compared to the previous works in two aspects i) The traffic signs are generally not present in the vicinity of their cues, ii) The traffic signs’ cues are diverse and unique. Also, MTSVD is the first publicly available missing object dataset. To train the models for identifying missing signs, we complement our dataset with 10K traffic sign tracks, with 40% of the traffic signs having cues visible in the scenes. For identifying missing signs, we propose the Cue-driven Contextual Attention units (CueCAn), which we incorporate in our model’s encoder. We first train the encoder to classify the presence of traffic sign cues and then train the entire segmentation model end-to-end to localize missing traffic signs. Quantitative and qualitative analysis shows that CueCAn significantly improves the performance of base models.

Understanding Video Scenes through Text Insights from Text-based Video Question Answering

International Conference on Computer Vision Workshops, ICCV-W, 2023

Core Rank : - Google Rank :80

Abs PDF DOI bibTex

@inproceedings{bib_Unde_2023, AUTHOR = {Jahagirdar, Soumya Shamarao and Mathew, Minesh and Karatzas, Dimosthenis and V, Jawahar C }, TITLE = {Understanding Video Scenes through Text Insights from Text-based Video Question Answering}, BOOKTITLE = {International Conference on Computer Vision Workshops}. YEAR = {2023}}

Understanding Video Scenes through Text Insights from Text-based Video Question Answering

Abstract

Researchers have extensively studied the field of vision and language, discovering that both visual and textual content is crucial for understanding scenes effectively. Particularly, comprehending text in videos holds great significance, requiring both scene text understanding and temporal reasoning. This paper focuses on exploring two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content. The NewsVideoQA dataset contains question-answer pairs related to the text in news videos, while M4-ViteVQA comprises question-answer pairs from diverse categories like vlogging, traveling, and shopping. We provide an analysis of the formulation of these datasets on various levels, exploring the degree of visual understanding and multi-frame comprehension required for answering the questions. Additionally, the study includes experimentation with BERT-QA, a text-only model, which demonstrates comparable performance to the original methods on both datasets, indicating the shortcomings in the formulation of these datasets. Furthermore, we also look into the domain adaptation aspect by examining the effectiveness of training on M4-ViteVQA and evaluating on NewsVideoQA and vice-versa, thereby shedding light on the challenges and potential benefits of out-of-domain training.