MAdVerse: A Hierarchical Dataset of Multi-Lingual Ads from Diverse Sources and Categories
Keralapura Nagaraju Amruth Sagar,Rishabh Srivastava,Rakshitha R T,Venkata Kesav Venna,Ravi Kiran Sarvadevabhatla
IEEE Workshop on Applications of Computer Vision, IEEE WACV, 2024
@inproceedings{bib_MAdV_2024, AUTHOR = {Keralapura Nagaraju Amruth Sagar, Rishabh Srivastava, Rakshitha R T, Venkata Kesav Venna, Ravi Kiran Sarvadevabhatla}, TITLE = {MAdVerse: A Hierarchical Dataset of Multi-Lingual Ads from Diverse Sources and Categories}, BOOKTITLE = {IEEE Workshop on Applications of Computer Vision}. YEAR = {2024}}
The convergence of computer vision and advertising has sparked substantial interest lately. Existing advertisement datasets are either subsets of existing datasets with specialized annotations or feature diverse annotations without a cohesive taxonomy among ad images. Notably, no datasets encompass diverse advertisement styles or semantic grouping at various levels of granularity. Our work addresses this gap by introducing MAdVerse, an extensive, multilingual compilation of more than 50,000 ads from the web, social media websites, and e-newspapers. Advertisements are hierarchically grouped with uniform granularity into 11 categories, divided into 51 sub-categories, and 524 fine-grained brands at leaf level, each featuring ads in various languages. We provide comprehensive baseline classification results for prediction tasks within the realm of advertising analysis. These tasks include hierarchical ad classification, source classification, multilingual classification, and inducing hierarchy in existing ad datasets.
IDD-X: A Multi-View Dataset for Ego-relative Important Object Localization and Explanation in Dense and Unstructured Traffic
Chirag Parikh,Rohit Saluja,Jawahar C V,Ravi Kiran Sarvadevabhatla
International Conference on Robotics and Automation, ICRA, 2024
@inproceedings{bib_IDD-_2024, AUTHOR = {Chirag Parikh, Rohit Saluja, Jawahar C V, Ravi Kiran Sarvadevabhatla}, TITLE = {IDD-X: A Multi-View Dataset for Ego-relative Important Object Localization and Explanation in Dense and Unstructured Traffic}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2024}}
Intelligent vehicle systems require a deep understanding of the interplay between road conditions, surrounding entities, and the ego vehicle's driving behavior for safe and efficient navigation. This is particularly critical in developing countries where traffic situations are often dense and unstructured with heterogeneous road occupants. Existing datasets, predominantly geared towards structured and sparse traffic scenarios, fall short of capturing the complexity of driving in such environments. To fill this gap, we present IDD-X, a large-scale dual-view driving video dataset. With 697K bounding boxes, 9K important object tracks, and 1-12 objects per video, IDD-X offers comprehensive ego-relative annotations for multiple important road objects covering 10 categories and 19 explanation label categories. The dataset also incorporates rearview information to provide a more complete representation of the driving environment. We also introduce custom-designed deep networks aimed at multiple important object localization and per-object explanation prediction. Overall, our dataset and introduced prediction models form the foundation for studying how road conditions and surrounding entities affect driving behavior in complex traffic situations.
OLAF: A Plug-and-Play Framework for
Enhanced Multi-object Multi-part Scene Parsing
Pranav Gupta,Rishubh Singh,Pradeep Shenoy,Ravi Kiran Sarvadevabhatla
European Conference on Computer Vision, ECCV, 2024
@inproceedings{bib_OLAF_2024, AUTHOR = {Pranav Gupta, Rishubh Singh, Pradeep Shenoy, Ravi Kiran Sarvadevabhatla}, TITLE = {OLAF: A Plug-and-Play Framework for
Enhanced Multi-object Multi-part Scene Parsing}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2024}}
Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot Action Generation
Kalakonda Sai Shashank,Shubh Maheshwari,Santosh Ravi Kiran
International Conference on Multimedia and Expo, ICME, 2023
@inproceedings{bib_Acti_2023, AUTHOR = {Kalakonda Sai Shashank, Shubh Maheshwari, Santosh Ravi Kiran}, TITLE = {Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot Action Generation}, BOOKTITLE = {International Conference on Multimedia and Expo}. YEAR = {2023}}
We introduce Action-GPT, a plug and play framework for incorporating Large Language Models (LLMs) into textbased action generation models. Action phrases in current motion capture datasets contain minimal and to-the-point information. By carefully crafting prompts for LLMs, we generate richer and fine-grained descriptions of the action. We show that utilizing these detailed descriptions instead of the original action phrases leads to better alignment of text and motion spaces. Our experiments show qualitative and quantitative improvement in the quality of synthesized motions produced by recent text-to-motion models. Code, pretrained models and sample videos will be made avail
SeamFormer : High Precision Text Line Segmentation for Handwritten Documents
Niharika,Rahul Krishna,Ravi Kiran Sarvadevabhatla
International Conference on Document Analysis and Recognition, ICDAR, 2023
@inproceedings{bib_Seam_2023, AUTHOR = {Niharika, Rahul Krishna, Ravi Kiran Sarvadevabhatla}, TITLE = {SeamFormer : High Precision Text Line Segmentation for Handwritten Documents}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2023}}
Historical manuscripts often contain dense unstructured text lines. The large diversity in sizes, scripts and appearance makes precise text line segmentation extremely challenging. Existing line segmentation approaches often associate diacritic elements incorrectly to text lines and also address above mentioned challenges inadequately. To tackle these issues, we introduce SeamFormer, a novel approach for high precision text line segmentation in handwritten manuscripts. In the first stage of our approach, a multi-task Transformer deep network outputs coarse line identifiers which we term ‘scribbles’ and the binarized manuscript image. In the second stage, a scribble-conditioned seam generation procedure utilizes outputs from first stage and feature maps derived from manuscript image to generate tight-fitting line segmentation polygons. In the process, we incorporate a novel diacritic feature map which enables improved diacritic and text line associations. Via experiments and evaluations on new and existing challenging palm leaf manuscript datasets, we show that SeamFormer outperforms competing approaches and generates precise text line segmentations.
FLOAT: Factorized Learning of Object Attributes for Improved Multi-object Multi-part Scene Parsing
Rishubh Singh,Pranav Gupta,Pradeep Shenoy,Santosh Ravi Kiran
Computer Vision and Pattern Recognition, CVPR, 2022
@inproceedings{bib_FLOA_2022, AUTHOR = {Rishubh Singh, Pranav Gupta, Pradeep Shenoy, Santosh Ravi Kiran}, TITLE = {FLOAT: Factorized Learning of Object Attributes for Improved Multi-object Multi-part Scene Parsing}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2022}}
Multi-object multi-part scene parsing is a challenging task which requires detecting multiple object classes in a scene and segmenting the semantic parts within each object. In this paper, we propose FLOAT, a factorized label space framework for scalable multi-object multi-part parsing. Our framework involves independent dense prediction of object category and part attributes which increases scalability and reduces task complexity compared to the monolithic label space counterpart. In addition, we propose an inference-time ‘zoom’ refinement technique which significantly improves segmentation quality, especially for smaller objects/parts. Compared to state of the art, FLOAT obtains an absolute improvement of 2.0% for mean IOU (mIOU) and 4.8% for segmentation quality IOU (sqIOU) on the Pascal-Part-58 dataset. For the larger Pascal-Part-108 dataset, the improvements are 2.1% for mIOU and 3.9% for
DrawMon: A Distributed System for Detection of A typical Sketch Content in Concurrent Pictionary Games
NIKHIL BANSAL,KARTIK GUPTA,Kiruthika K,Pentapati Sivani,Santosh Ravi Kiran
ACM international conference on Multimedia, ACMMM, 2022
@inproceedings{bib_Draw_2022, AUTHOR = {NIKHIL BANSAL, KARTIK GUPTA, Kiruthika K, Pentapati Sivani, Santosh Ravi Kiran}, TITLE = {DrawMon: A Distributed System for Detection of A typical Sketch Content in Concurrent Pictionary Games}, BOOKTITLE = {ACM international conference on Multimedia}. YEAR = {2022}}
Pictionary, the popular sketch-based guessing game, provides an opportunity to analyze shared goal cooperative game play in restricted communication settings. However, some players occasionally draw atypical sketch content. While such content is occasionally relevant in the game context, it sometimes represents a rule violation and
A Fine-Grained Vehicle Detection (FGVD) Dataset for Unconstrained Roads
Prafful Kumar Khoba,Chirag Parikh,Rohit Saluja,Santosh Ravi Kiran,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2022
@inproceedings{bib_A_Fi_2022, AUTHOR = {Prafful Kumar Khoba, Chirag Parikh, Rohit Saluja, Santosh Ravi Kiran, Jawahar C V}, TITLE = {A Fine-Grained Vehicle Detection (FGVD) Dataset for Unconstrained Roads}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2022}}
The previous fine-grained datasets mainly focus on classification and are often captured in a controlled setup, with the camera focusing on the objects. We introduce the first Fine-Grained Vehicle Detection (FGVD) dataset in the wild, captured from a moving camera mounted on a car. It contains 5502 scene images with 210 unique fine-grained labels of multiple vehicle types organized in a three-level hierarchy. While previous classification datasets also include makes for different kinds of cars, the FGVD dataset introduces new class labels for categorizing two-wheelers, autorickshaws, and trucks. The FGVD dataset is challenging as it has vehicles in complex traffic scenarios with intra-class and inter-class variations in types, scale, pose, occlusion, and lighting conditions. The current object detectors like yolov5 and faster RCNN perform poorly on our dataset due to a lack of hierarchical modeling. Along with providing baseline results for existing object detectors on FGVD Dataset, we also present the results of a combination of an existing detector and the recent Hierarchical Residual Network (HRN) classifier for the FGVD task. Finally, we show that FGVD vehicle images are the most challenging to classify among the fine-grained datasets.
UAV-based Visual Remote Sensing for Automated Building Inspection
Kushagra Srivastava,Kushagra Srivastava,Aditya Kumar Jha,Mohhit Kumar Jha,Jaskirat Singh,Santosh Ravi Kiran,Pradeep Kumar Ramancharla,Harikumar K,K Madhava Krishna
European Conference on Computer Vision Workshops, ECCV-W, 2022
@inproceedings{bib_UAV-_2022, AUTHOR = {Kushagra Srivastava, Kushagra Srivastava, Aditya Kumar Jha, Mohhit Kumar Jha, Jaskirat Singh, Santosh Ravi Kiran, Pradeep Kumar Ramancharla, Harikumar K, K Madhava Krishna}, TITLE = {UAV-based Visual Remote Sensing for Automated Building Inspection}, BOOKTITLE = {European Conference on Computer Vision Workshops}. YEAR = {2022}}
Unmanned Aerial Vehicle (UAV) based remote sensing system incorporated with computer vision has demonstrated potential for assisting building construction and in disaster management like damage assessment during earthquakes. The vulnerability of a building to earthquake can be assessed through inspection that takes into account the expected damage progression of the associated component and the component’s contribution to structural system performance. Most of these inspections are done manually, leading to high utilization of manpower, time, and cost. This paper proposes a methodology to automate these inspections through UAV-based image data collection and a software library for post-processing that helps in estimating the seismic structural parameters. The key parameters considered here are the distances between adjacent buildings, building plan-shape, building plan area, objects on the rooftop and rooftop layout. The accuracy of the proposed methodology in estimating the above-mentioned parameters is verified through field measurements taken using a distance measuring sensor and also from the data obtained through Google Earth. Additional details and code can be accessed from h
Wisdom of (Binned) Crowds: A Bayesian Stratification Paradigm for Crowd Counting
S Sravya Vardhani,Mansi Pradeep Khamkar,Divij Bajaj,Ganesh Ramakrishnan,Santosh Ravi Kiran
ACM international conference on Multimedia, ACMMM, 2021
@inproceedings{bib_Wisd_2021, AUTHOR = {S Sravya Vardhani, Mansi Pradeep Khamkar, Divij Bajaj, Ganesh Ramakrishnan, Santosh Ravi Kiran}, TITLE = {Wisdom of (Binned) Crowds: A Bayesian Stratification Paradigm for Crowd Counting}, BOOKTITLE = {ACM international conference on Multimedia}. YEAR = {2021}}
Datasets for training crowd counting deep networks are typically heavy-tailed in count distribution and exhibit discontinuities across the count range. As a result, the de facto statistical measures (MSE, MAE) exhibit large variance and tend to be unreliable indicators of performance across the count range. To address these concerns in a holistic manner, we revise processes at various stages of the standard crowd counting pipeline. To enable principled and balanced
MeronymNet: A Hierarchical Model for Unified and Controllable Multi-Category Object Generation
Janmejay Pratap Singh Baghel,Abhishek Trivedi,Tejas Ravichandran,Santosh Ravi Kiran
ACM international conference on Multimedia, ACMMM, 2021
@inproceedings{bib_Mero_2021, AUTHOR = {Janmejay Pratap Singh Baghel, Abhishek Trivedi, Tejas Ravichandran, Santosh Ravi Kiran}, TITLE = {MeronymNet: A Hierarchical Model for Unified and Controllable Multi-Category Object Generation}, BOOKTITLE = {ACM international conference on Multimedia}. YEAR = {2021}}
We introduce MeronymNet, a novel hierarchical approach for con- trollable, part-based generation of multi-category objects using a single unified model. We adopt a guided coarse-to-fine strategy involving semantically conditioned generation of bounding box layouts, pixel-level part layouts and ultimately, the object depic- tions themselves. We use Graph Convolutional Networks, Deep Recurrent Networks along with custom-designed Conditional Vari- ational Autoencoders to enable flexible, diverse and category-aware generation of 2-D objects in a controlled manner. The performance scores for generated objects reflect MeronymNet’s superior perfor- mance compared to multiple strong baselines and ablative variants.
An OCR for Classical Indic Documents Containing Arbitrarily Long Words
Agam Dwivedi,Rohit Saluja,Santosh Ravi Kiran
Computer Vision and Pattern Recognition Conference workshops, CVPR-W, 2020
@inproceedings{bib_An_O_2020, AUTHOR = {Agam Dwivedi, Rohit Saluja, Santosh Ravi Kiran}, TITLE = {An OCR for Classical Indic Documents Containing Arbitrarily Long Words}, BOOKTITLE = {Computer Vision and Pattern Recognition Conference workshops}. YEAR = {2020}}
OCR for printed classical Indic documents written in Sanskrit is a challenging research problem. It involves complexities such as image degradation, lack of datasets and long-length words. Due to these challenges, the word accuracy of available OCR systems, both academic and industrial, is not very high for such documents. To address these shortcomings, we develop a Sanskrit specific OCR system. We present an attention-based LSTM model for reading Sanskrit characters in line images. We introduce a dataset of Sanskrit document images annotated at line level. To augment real data and enable high performance for our OCR, we also generate synthetic data via curated font selection and rendering designed to incorporate crucial glyph substitution rules. Consequently, our OCR achieves a word error rate of 15.97% and a character error rate of 3.71% on challenging Indic document texts and outperforms strong baselines. Overall, our contributions set the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.
HInDoLA: A Unified Cloud-based Platform for Annotation, Visualization and Machine Learning-based Layout Analysis of Historical Manuscripts
Abhishek Trivedi,Santosh Ravi Kiran
International Conference on Document Analysis and Recognition Workshops, ICDAR-W, 2019
@inproceedings{bib_HInD_2019, AUTHOR = {Abhishek Trivedi, Santosh Ravi Kiran}, TITLE = {HInDoLA: A Unified Cloud-based Platform for Annotation, Visualization and Machine Learning-based Layout Analysis of Historical Manuscripts}, BOOKTITLE = {International Conference on Document Analysis and Recognition Workshops}. YEAR = {2019}}
Palm-leaf manuscripts are one of the oldest medium of inscription in many Asian countries. Especially, manuscripts from the Indian subcontinent form an important part of the world's literary and cultural heritage. Despite their significance, large-scale datasets for layout parsing and targeted annotation systems do not exist. Addressing this, we propose a web-based layout annotation and analytics system. Our system, called HInDoLA, features an intuitive annotation GUI, a graphical analytics dashboard and interfaces with machine-learning based intelligent modules on the backend. HInDoLA has successfully helped us create the first ever large-scale dataset for layout parsing of Indic palm-leaf manuscripts. These manuscripts, in turn, have been used to train and deploy deep-learning based modules for fully automatic and semi-automatic instance-level layout parsing.
Indiscapes: Instance segmentation networks for layout parsing of historical indic manuscripts
Abhishek prusty,Aitha Sowmya,Abhishek Trivedi,Santosh Ravi Kiran
International Conference on Document Analysis and Recognition, ICDAR, 2019
@inproceedings{bib_Indi_2019, AUTHOR = {Abhishek Prusty, Aitha Sowmya, Abhishek Trivedi, Santosh Ravi Kiran}, TITLE = {Indiscapes: Instance segmentation networks for layout parsing of historical indic manuscripts}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2019}}
Historical palm-leaf manuscript and early paper documents from Indian subcontinent form an important part of the world's literary and cultural heritage. Despite their importance, large-scale annotated Indic manuscript image datasets do not exist. To address this deficiency, we introduce Indiscapes, the first ever dataset with multi-regional layout annotations for historical Indic manuscripts. To address the challenge of large diversity in scripts and presence of dense, irregular layout elements (e.g. text lines, pictures, multiple documents per image), we adapt a Fully Convolutional Deep Neural Network architecture for fully automatic, instance-level spatial layout parsing of manuscript images. We demonstrate the effectiveness of proposed architecture on images from the Indiscapes dataset. For annotation flexibility and keeping the non-technical nature of domain experts in mind, we also contribute a custom, web-based …
Game of sketches: Deep recurrent models of pictionary-style word guessing
Santosh Ravi Kiran, Shiv Surya,Trisha Mittal, R. Venkatesh Babu
AAAI Conference on Artificial Intelligence, AAAI, 2018
@inproceedings{bib_Game_2018, AUTHOR = {Santosh Ravi Kiran, Shiv Surya, Trisha Mittal, R. Venkatesh Babu}, TITLE = {Game of sketches: Deep recurrent models of pictionary-style word guessing}, BOOKTITLE = {AAAI Conference on Artificial Intelligence}. YEAR = {2018}}
The ability of machine-based agents to play games in human-like fashion is considered a benchmark of progress in AI. In this paper, we introduce the first computational model aimed at Pictionary, the popular word-guessing social game. We first introduce Sketch-QA, an elementary version of Visual Question Answering task. Styled after Pictionary, Sketch-QA uses incrementally accumulated sketch stroke sequences as visual data. Notably, Sketch-QA involves asking a fixed question (" What object is being drawn?") and gathering open-ended guess-words from human guessers. To mimic Pictionary-style guessing, we propose a deep neural model which generates guess-words in response to temporally evolving human-drawn sketches. Our model even makes human-like mistakes while guessing, thus amplifying the human mimicry factor. We evaluate our model on the large-scale guess-word dataset generated via Sketch-QA task and compare with various baselines. We also conduct a Visual Turing Test to obtain human impressions of the guess-words generated by humans and our model. Experimental results demonstrate the promise of our approach for Pictionary and similarly themed games.
SketchParse: Towards rich descriptions for poorly drawn sketches using multi-task hierarchical deep networks
Santosh Ravi Kiran,Isht Dwivedi,Abhijat Biswas,Sahil Manocha,Venkatesh Babu R.
International Conference on Multimedia, IMM, 2017
@inproceedings{bib_Sket_2017, AUTHOR = {Santosh Ravi Kiran, Isht Dwivedi, Abhijat Biswas, Sahil Manocha, Venkatesh Babu R.}, TITLE = {SketchParse: Towards rich descriptions for poorly drawn sketches using multi-task hierarchical deep networks}, BOOKTITLE = {International Conference on Multimedia}. YEAR = {2017}}
The ability to semantically interpret hand-drawn line sketches, although very challenging, can pave way for novel applications in multimedia. We propose SKETCHPARSE, the first deep-network architecture for fully automatic parsing of freehand object sketches. SKETCHPARSE is configured as a two-level fully convolutional network. The first level contains shared layers common to all object categories. The second level contains a number of expert sub-networks. Each expert specializes in parsing sketches from object categories which contain structurally similar parts. Effectively, the two-level configuration enables our architecture to scale up efficiently as additional categories are added. We introduce a router layer which (i) relays sketch features from shared layers to the correct expert (ii) eliminates the need to manually specify object category during inference. To bypass laborious part-level annotation, we sketchify photos from semantic object-part image datasets and use them for training. Our architecture also incorporates object pose prediction as a novel auxiliary task which boosts overall performance while providing supplementary information regarding the sketch. We demonstrate SKETCHPARSE's abilities (i) on two challenging large-scale sketch datasets (ii) in parsing unseen, semantically related object categories (iii) in improving fine-grained sketch-based image retrieval. As a novel application, we also outline how SKETCHPARSE's output can be used to generate caption-style descriptions for hand-drawn sketches.
Enabling my robot to play pictionary: Recurrent neural networks for sketch recognition
Santosh Ravi Kiran,Jogendra Kundu,Venkatesh Babu R
ACM international conference on Multimedia, ACMMM, 2016
@inproceedings{bib_Enab_2016, AUTHOR = {Santosh Ravi Kiran, Jogendra Kundu, Venkatesh Babu R}, TITLE = {Enabling my robot to play pictionary: Recurrent neural networks for sketch recognition}, BOOKTITLE = {ACM international conference on Multimedia}. YEAR = {2016}}
Freehand sketching is an inherently sequential process. Yet, most approaches for hand-drawn sketch recognition either ignore this sequential aspect or exploit it in an ad-hoc manner. In our work, we propose a recurrent neural network architecture for sketch object recognition which exploits the long-term sequential and structural regularities in stroke data in a scalable manner. Specifically, we introduce a Gated Recurrent Unit based framework which leverages deep sketch features and weighted per-timestep loss to achieve state-of-the-art results on a large database of freehand object sketches across a large number of object categories. The inherently online nature of our framework is especially suited for on-the-fly recognition of objects as they are being drawn. Thus, our framework can enable interesting applications such as camera-equipped robots playing the popular party game Pictionary with human players and generating sparsified yet recognizable sketches of objects.