No Prompting Frozen Foundation Models: Interactive Medical Volume Segmentation using Continual Test Time Adaptation of Compact Models
@inproceedings{bib_No_P_2025, AUTHOR = {Kushal Borkar, Abhilaksh Singh Reen, Jawahar C V, Chetan Arora}, TITLE = {No Prompting Frozen Foundation Models: Interactive Medical Volume Segmentation using Continual Test Time Adaptation of Compact Models}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2025}}
Automated segmentation of medical image volumes promises to reduce costly medical experts’ time for annotation. However, using machine learning for the task is challenging due to variations in imaging modalities and scarcity of patient data. While interactive image segmentation methods and foundational models incorporating user-provided prompts to refine segmentation masks have shown promise, they overlook crucial sequential information between the slices in 3D medical image volumes and videos, resulting in discontinuities in the segmentation results. This paper proposes a new framework that dynamically updates model parameters during inference in a test time training framework using user-provided scribbles. Our framework preserves acquired knowledge from the previous slices of the current medical volume and the training dataset via student-teacher learning. We evaluate our method on diverse CT, MRI, and microscopic cell datasets. Our framework significantly reduces user annotation time by a factor of 6.72×. Compared to other interactive segmentation methods, we reduce the time by a factor of 2.64×. Our method also outperforms prompting foundation models for segmentation by achieving a dice score of 0.9 in 3-4 interactions compared to 5-8 user interactions for the foundation model, significantly reducing annotation time for the CT and MRI volumes.
Multiple Instance Learning for Glioma Diagnosis using Hematoxylin and Eosin Whole Slide Images: An Indian Cohort Study
@inproceedings{bib_Mult_2025, AUTHOR = {Ekansh Chauhan, Amit Sharma, Megha S Uppin, Jawahar C V, Vinod Palakkad Krishnanunni}, TITLE = {Multiple Instance Learning for Glioma Diagnosis using Hematoxylin and Eosin Whole Slide Images: An Indian Cohort Study}, BOOKTITLE = {IEEE Journal of Biomedical and Health Informatics}. YEAR = {2025}}
Kenny Davila,Rupak Lazarus,Fei Xu,Nicole Rodrı́guez Alcántara,Srirangaraj Setlur,Venu Govindaraju,Ajoy Mondal,Jawahar C V
@inproceedings{bib_CHAR_2024, AUTHOR = {Kenny Davila, Rupak Lazarus, Fei Xu, Nicole Rodrı́guez Alcántara, Srirangaraj Setlur, Venu Govindaraju, Ajoy Mondal, Jawahar C V}, TITLE = {CHART-Info 2024: A dataset for Chart Analysis
and Recognition}, BOOKTITLE = {Pattern Recognition}. YEAR = {2024}}
Charts are tools for data communication used in a wide range
of documents. Recently, the pattern recognition community has shown
interest in developing methods for automatically processing charts found
in the wild. Following previous efforts on ICPR’s CHART-Infographics
competitions, here we propose a newer, larger dataset and benchmark for
analyzing and recognizing charts. Inspired by the steps required to make
sense of a chart image, the benchmark is divided into 7 different tasks:
chart image classification, chart text detection and recognition, text role
classification, axis analysis, legend analysis, data extraction, and end-toend data extraction. We also show the performance of different baselines
for the first five tasks. We expect that the increased scale of the proposed
dataset will enable the development of better chart recognition systems.
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman,Bikram Boot,Eugene Byrne,Zach Chavis,Joya Chen,Joya Chen,Jawahar C V,Andrew Westbur,Lorenzo Torresani,Kris Kitani,Jitendra Malik,Triantafyllos Afouras,Kumar Ashutosh,Vijay Baiyya,Siddhant Bansal
@inproceedings{bib_Ego-_2024, AUTHOR = {Kristen Grauman, Bikram Boot, Eugene Byrne, Zach Chavis, Joya Chen, Joya Chen, Jawahar C V, Andrew Westbur, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal}, TITLE = {Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2024}}
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.
Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities
(e.g., sports, music, dance, bike repair). 740 participants
from 13 cities worldwide performed these activities in 123
different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video
combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel
audio, eye gaze, 3D point clouds, camera poses, IMU,
and multiple paired language descriptions—including a
novel “expert commentary” done by coaches and teachers and tailored to the skilled-activity domain. To push the
frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks
and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation,
and 3D hand/body pose. All resources are open sourced to
fuel new research in the community.
Ekansh Chauhan,Amit Sharma,Megha Saha Uppin,Manasa Kondamadugu,Jawahar C V,Vinod Palakkad Krishnanunni
@inproceedings{bib_IPD-_2024, AUTHOR = {Ekansh Chauhan, Amit Sharma, Megha Saha Uppin, Manasa Kondamadugu, Jawahar C V, Vinod Palakkad Krishnanunni}, TITLE = {IPD-Brain: An Indian histopathology dataset for glioma subtype classification}, BOOKTITLE = {Scientific Data}. YEAR = {2024}}
The efective management of brain tumors relies on precise typing, subtyping, and grading. We
present the IPD-Brain Dataset, a crucial resource for the neuropathological community, comprising
547 high-resolution H&E stained slides from 367 patients for the study of glioma subtypes and
immunohistochemical biomarkers. Scanned at 40x magnifcation, this dataset is one of the largest in
Asia, specifcally focusing on the Indian demographics. It encompasses detailed clinical annotations,
including patient age, sex, radiological fndings, diagnosis, CNS WHO grade, and IHC biomarker status
(IDH1R132H, ATRX and TP53 along with proliferation index, Ki67), providing a rich foundation for
research. The dataset is open for public access and is designed for various applications, from machine
learning model training to the exploration of regional and ethnic disease variations. Preliminary
validations utilizing Multiple Instance Learning for tasks such as glioma subtype classifcation and IHC
biomarker identifcation underscore its potential to signifcantly contribute to global collaboration in
brain tumor research, enhancing diagnostic precision and understanding of glioma variability across
diferent populations.
@inproceedings{bib_Indi_2024, AUTHOR = {Krishna Tulsyan, Tessy Flemin, Ajoy Mondal, Jawahar C V}, TITLE = {IndicOCR: A Pipeline for Recognizing Printed Documents for Indian Languages}, BOOKTITLE = {India Joint International Conference on Data Science & Management of Data}. YEAR = {2024}}
India’s linguistic diversity is a testament to the cultural richness of the subcontinent, with 22 officially recognized languages, each possessing its unique script and identity. In this context, the need for efficient Optical Character Recognition (OCR) tools tailored to the complexities of these languages is paramount. IndicOCR is introduced as an innovative OCR pipeline designed to address this challenge. This paper aims to shed light on the capabilities of IndicOCR, underlining its role in bridging the gap between India’s linguistic heritage and the ever-expanding digital world. Finally, IndicOCR1 is a powerful tool for inclusivity, preservation, and progress in a linguistically diverse society.
Unlocking the Potential of Unstructured Data in Business Documents Through Document Intelligence
@inproceedings{bib_Unlo_2024, AUTHOR = {Sriranjani Ramakrishnan, Himanshu Sharad Bhatt, Sachin Raja, Jawahar C V}, TITLE = {Unlocking the Potential of Unstructured Data in Business Documents Through Document Intelligence}, BOOKTITLE = {India Joint International Conference on Data Science & Management of Data}. YEAR = {2024}}
With the recent advancements, organizations have brought data to the forefront of their digital transformation journeys. Financial services industry is also moving towards adopting data-driven strategies for improved and faster decision making and providing enhanced customer experience. While advances generally in Artificial Intelligence (AI) and specifically in Machine Learning (ML) have fueled a lot of analytics, it is largely restricted to structured data as it is well organized and is easy to work with. This tutorial presents the opportunities to unlock the potential in unstructured documents in financial domain. These forms of data are more challenging to interpret, but can deliver a more comprehensive and holistic understanding of the bigger picture. While there are challenges around processing such document, ability to quickly make decisions by leveraging such data can provide differentiated value propositions and competitive benefits. This tutorial start with select problems & challenges in Document AI space and use cases involving such documents, show the business opportunities present and describe the technical challenges involved. Subsequently, we discuss techniques and algorithms for several document processing requirements and real world applications.
References
Semantic Labels-Aware Transformer Model for
Searching over a Large Collection of Lecture-Slides
@inproceedings{bib_Sema_2024, AUTHOR = {JOBIN K V, Anand Mishra, Jawahar C V}, TITLE = {Semantic Labels-Aware Transformer Model for
Searching over a Large Collection of Lecture-Slides}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2024}}
Semantic Labels-Aware Transformer Model for
Searching over a Large Collection of Lecture-Slides
K. V. Jobin1 Anand Mishra2 C. V. Jawahar1
1IIIT Hyderabad 2 IIT Jodhpur
{jobin.kv@research., jawahar@}iiit.ac.in mishra@iitj.ac.in
https://jobinkv.github.io/lecsd
Abstract
Massive Open Online Courses (MOOCs) enable easy
access to many educational materials, particularly lecture
slides, on the web. Searching through them based on user
queries becomes an essential problem due to the availability
of such vast information. To address this, we present Lec-
ture Slide Deck Search Engine – a model that supports nat-
ural language queries and hand-drawn sketches and per-
forms searches on a large collection of slide images on com-
puter science topics. This search engine is trained using a
novel semantic label-aware transformer model that extracts
the semantic labels in the slide images and seamlessly en-
codes them with the visual cues from the slide images and
textual cues from the natural language query. Further, to
study the problem in a challenging setting, we introduce
a novel dataset, namely the Lecture Slide Deck (LecSD)
Dataset containing 54K slide images from the Data Struc-
ture, Computer Networks, and Optimization courses and
provide associated manual annotation for the query in the
form of natural language or hand-drawn sketch. The pro-
posed Lecture Slide Deck Search Engine outperforms the
competitive baselines and achieves nearly 4% superior Re-
call@1 on an absolute scale compared to the state-of-the-
art approach. We firmly believe that this work will open up
promising directions for improving the accessibility and us-
ability of educational resources, enabling students and edu-
cators to find and utilize lecture materials more effectively.
@inproceedings{bib_ICPR_2024, AUTHOR = {Harsh Lunia, Ajoy Mondal, Jawahar C V}, TITLE = {ICPR 2024 Competition on Word Image Recognition from Indic Scene Images}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2024}}
Scene text recognition has historically concentrated on English, with limited advancements in developing solutions that perform well across multiple languages. Previous efforts in multilingual scene text recognition have predominantly targeted languages with considerable syntactic and semantic differences. However, Indian languages, while diverse, share numerous common features that remain largely underutilized. This competition aims to address the often-overlooked challenge of scene text recognition within the Indian context and to advance robust word image recognition across ten Indian languages. The dataset provided for this competition is one of the most comprehensive multilingual datasets, encompassing 10 languages, each with 17,500 training samples, 2,500 validation samples and 5,000 test word-image samples. The task was to correctly recognize the word-images, for which we received forty-nine registrations and five final submissions from industrial and research communities. The winning team achieved an average Character Recognition Rate (CRR) of 92.85% and a Word Recognition Rate (WRR) of 84.01% across the ten languages. This paper details the proposed dataset and summarizes the submissions for the competition- WIRIndic-2024.
@inproceedings{bib_Towa_2024, AUTHOR = {Shaon Bhattacharyya, Ajoy Mondal, Jawahar C V}, TITLE = {Towards Digitization Filled Indic Handwritten Forms}, BOOKTITLE = {International Conference on Computer vision and Image Processing}. YEAR = {2024}}
The demand for daily form digitization requires extensive manual effort. This paper presents an efficient pipeline for digitizing Indic handwritten forms, minimizing human intervention. We validate the pipeline using textit{Hindi} and textit{Bengali} forms, creating a dedicated test set, textit{IIIT-Indic-Form-Test}. Our approach enables form capture and orientation alignment via smartphone, extracting printed and handwritten fields with OCR enhanced by template annotations. A predefined word list supports post-processing for fields like textit{name}, textit{state}, and textit{country}. We compare our pipeline with tools like Google Parser and Microsoft Azure and conduct ablation studies on form style, rotation, and handwriting variations. A GUI-based application for digitization is also developed. The code and model are publicly available at https://github.com/shaoncvit/Indic_Handwritten_Form_dataset.
Understanding the Generalization of Pretrained Diffusion Models on Out-of-Distribution Data
R Sai Niranjan,Rudrabha Mukhopadhyay,Madhav Agarwal,Jawahar C V,Vinay Namboodiri
AAAI Conference on Artificial Intelligence, AAAI, 2024
@inproceedings{bib_Unde_2024, AUTHOR = {R Sai Niranjan, Rudrabha Mukhopadhyay, Madhav Agarwal, Jawahar C V, Vinay Namboodiri}, TITLE = {Understanding the Generalization of Pretrained Diffusion Models on Out-of-Distribution Data}, BOOKTITLE = {AAAI Conference on Artificial Intelligence}. YEAR = {2024}}
This work tackles the important task of understanding out-of-distribution behavior in two prominent types of generative models, i.e., GANs and Diffusion models. Understanding this behavior is crucial in understanding their broader utility and risks as these systems are increasingly deployed in our daily lives. Our first contribution is demonstrating that diffusion spaces outperform GANs' latent spaces in inverting high-quality OOD images. We also provide a theoretical analysis attributing this to the lack of prior holes in diffusion spaces. Our second significant contribution is to provide a theoretical hypothesis that diffusion spaces can be projected onto a bounded hypersphere, enabling image manipulation through geodesic traversal between inverted images. Our analysis shows that different geodesics share common attributes for the same manipulation, which we leverage to perform various image manipulations. We conduct thorough empirical evaluations to support and validate our claims. Finally, our third and final contribution introduces a novel approach to the few-shot sampling for out-of-distribution data by inverting a few images to sample from the cluster formed by the inverted latents. The proposed technique achieves state-of-the-art results for the few-shot generation task in terms of image quality. The effectiveness of the generated samples is further validated when used to augment data for underrepresented classes in a classification task, leading to improved performance. Our research underscores the promise of diffusion spaces in out-of-distribution imaging and offers avenues for further exploration.
Enhancing Road Safety: Predictive Modeling of Accident-Prone Zones with ADAS-Equipped Vehicle Fleet Data
Ravi Shankar Mishra,Dev Singh Thakur,Anbumani Subramanian,Mukti Advani,S. Velmurugan,Juby Jose,Jawahar C V,Ravi Kiran Sarvadevabhatla
Intelligent Vehicles symposium, IV, 2024
@inproceedings{bib_Enha_2024, AUTHOR = {Ravi Shankar Mishra, Dev Singh Thakur, Anbumani Subramanian, Mukti Advani, S. Velmurugan, Juby Jose, Jawahar C V, Ravi Kiran Sarvadevabhatla}, TITLE = {Enhancing Road Safety: Predictive Modeling of Accident-Prone Zones with ADAS-Equipped Vehicle Fleet Data}, BOOKTITLE = {Intelligent Vehicles symposium}. YEAR = {2024}}
This work presents a novel approach to identifying
possible early accident-prone zones in a large city-scale road network using geo-tagged collision alert data from a vehicle fleet. The alert data has been collected for a year from 200 city buses installed with the Advanced Driver Assistance System (ADAS). To the best of our knowledge, no research paper has used ADAS alerts to identify the early accident prone zones. A nonparametric technique called Kernel Density
Estimation (KDE) is employed to model the distribution of alert data across stratified time intervals. A novel recall-based measure is introduced to assess the degree of support provided by our density-based approach for existing, manually determined accident-prone zones (‘blackspots’) provided by civic
authorities. This shows that our KDE approach significantly outperforms existing approaches in terms of the recall-based measure. Introducing a novel linear assignment Earth Mover Distance based measure to predict previously unidentified accident-prone zones. The results and findings support the feasibility of utilizing alert data from vehicle fleets to aid civic
planners in assessing accident-zone trends and deploying traffic calming measures, thereby improving overall road safety and saving lives.
Lupus Nephritis Subtype Classification with only Slide Level labels
Amit Sharma,Ekansh Chauhan,Megha S Uppin,Liza Rajasekhar,Jawahar C V,Vinod Palakkad Krishnanunni
Medical Imaging with Deep Learning, MIDL, 2024
@inproceedings{bib_Lupu_2024, AUTHOR = {Amit Sharma, Ekansh Chauhan, Megha S Uppin, Liza Rajasekhar, Jawahar C V, Vinod Palakkad Krishnanunni}, TITLE = {Lupus Nephritis Subtype Classification with only Slide Level labels}, BOOKTITLE = {Medical Imaging with Deep Learning}. YEAR = {2024}}
Lupus Nephritis classification has historically relied on labor-intensive and meticulous
glomerular-level labeling of renal structures in whole slide images (WSIs). However, this
approach presents a formidable challenge due to its tedious and resource-intensive nature,
limiting its scalability and practicality in clinical settings. In response to this challenge, our
work introduces a novel methodology that utilizes only slide-level labels, eliminating the
need for granular glomerular-level labeling. A comprehensive multi-stained lupus nephritis
digital histopathology WSI dataset was created from the Indian population, which is the
largest of its kind. LupusNet, a deep learning MIL-based model, was developed to classify
LN subtypes. The results underscore its effectiveness, achieving an AUC score of 91.0%, an
F1 score of 77.3%, and an accuracy of 81.1% on our dataset in distinguishing membranous
and diffused classes of LN
IDD-X: A Multi-View Dataset for Ego-relative Important Object Localization and Explanation in Dense and Unstructured Traffic
Chirag Parikh,Rohit Saluja,Jawahar C V,Ravi Kiran Sarvadevabhatla
International Conference on Robotics and Automation, ICRA, 2024
@inproceedings{bib_IDD-_2024, AUTHOR = {Chirag Parikh, Rohit Saluja, Jawahar C V, Ravi Kiran Sarvadevabhatla}, TITLE = {IDD-X: A Multi-View Dataset for Ego-relative Important Object Localization and Explanation in Dense and Unstructured Traffic}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2024}}
Intelligent vehicle systems require a deep understanding of the interplay between road conditions, surrounding entities, and the ego vehicle's driving behavior for safe and efficient navigation. This is particularly critical in developing countries where traffic situations are often dense and unstructured with heterogeneous road occupants. Existing datasets, predominantly geared towards structured and sparse traffic scenarios, fall short of capturing the complexity of driving in such environments. To fill this gap, we present IDD-X, a large-scale dual-view driving video dataset. With 697K bounding boxes, 9K important object tracks, and 1-12 objects per video, IDD-X offers comprehensive ego-relative annotations for multiple important road objects covering 10 categories and 19 explanation label categories. The dataset also incorporates rearview information to provide a more complete representation of the driving environment. We also introduce custom-designed deep networks aimed at multiple important object localization and per-object explanation prediction. Overall, our dataset and introduced prediction models form the foundation for studying how road conditions and surrounding entities affect driving behavior in complex traffic situations.
System and method for automatically generating a sign language video with an input speech using a machine learning model
Jawahar C V,Parul Kapoor,Sindhu Balachandra Hegde,Rudrabha Mukhopadhyay,Vinay Namboodiri
United States Patent, Us patent, 2023
@inproceedings{bib_Syst_2023, AUTHOR = {Jawahar C V, Parul Kapoor, Sindhu Balachandra Hegde, Rudrabha Mukhopadhyay, Vinay Namboodiri}, TITLE = {System and method for automatically generating a sign language video with an input speech using a machine learning model}, BOOKTITLE = {United States Patent}. YEAR = {2023}}
Embodiments herein provide a system and method for automatically generating a sign language video from an input speech using the machine learning model. The method includes (i) extracting a plurality of spectrograms of an input speech by (a) encoding, using an encoder, a time domain series of the input speech to a frequency domain series, and (b) decoding, using a decoder, a plurality of tokens for time steps of the frequency domain series, (ii) generating a plurality of pose sequences for a current time step of the plurality of spectrograms using a first machine learning model, and (iii) automatically generating, using a discriminator of a second machine learning model, a sign language video for the input speech using the plurality of pose sequences and the plurality of spectrograms when the plurality of pose sequences are matched with corresponding the plurality of spectrograms that are extracted.
Towards Efficient Semantic Segmentation via Meta Pruning
Ashutosh Mishra,Shyam Nandan Rai,Girish Varma,Jawahar C V
International Conference on Computer vision and Image Processing, CVIP, 2023
@inproceedings{bib_Towa_2023, AUTHOR = {Ashutosh Mishra, Shyam Nandan Rai, Girish Varma, Jawahar C V}, TITLE = {Towards Efficient Semantic Segmentation via Meta Pruning}, BOOKTITLE = {International Conference on Computer vision and Image Processing}. YEAR = {2023}}
Semantic segmentation provides a pixel-level understanding of an image essential for various scene-understanding vision tasks. However, semantic segmentation models demand significant computational resources during training and inference. These requirements pose a challenge in resource-constraint scenarios. To address this issue, we present a compression algorithm based on differentiable meta-pruning through hypernetwork: MPHyp. Our proposed method MPHyp utilizes hypernetworks that take latent vectors as input and output weight matrices for the segmentation model. L1 sparsification follows the proximal gradient optimizer, updates the latent vectors and introduces sparsity leading to automatic model pruning. The proposed method offers the benefit of achieving controllable compression during the training and significantly reducing the training time. We compare our methodology with a popular pruning approach and demonstrate its efficacy by reducing the number of parameters and floating point operations while maintaining the mean Intersection over Union (mIoU) metric. We conduct experiments on two widely accepted semantic segmentation architectures: UNet and ERFNet. Our experiments and ablation study demonstrate the effectiveness of our proposed methodology by achieving efficient and reasonable segmentation results.
System and method for detecting object in an adaptive environment using a machine learning model
Rohit Saluja,Chetan Arora,Vineeth N Balasubramanian,Vaishnavi Khindkar,Jawahar C V
United States Patent, Us patent, 2023
@inproceedings{bib_Syst_2023, AUTHOR = {Rohit Saluja, Chetan Arora, Vineeth N Balasubramanian, Vaishnavi Khindkar, Jawahar C V}, TITLE = {System and method for detecting object in an adaptive environment using a machine learning model}, BOOKTITLE = {United States Patent}. YEAR = {2023}}
A method for detecting object in an image in a target environment that is adapted to a source environment using a machine learning model is provided. The method includes (i) extracting features from source image associated with source environment and target image associated with target environment, (ii) generating a feature map based on the features, (iii) generating a pixel-wise probability output map (iv) determining a first environment invariant feature map by combining the feature map with the pixel-wise probability output map, (v) determining a second environment invariant feature map by combining the first environment invariant feature map and the features, (vi) generating environment invariant feature maps at different instances, (vii) extracting environment invariant features based on the environment invariant feature maps, (viii) detecting the object in the image in the target environment that is adapted to the source environment by training the machine learning model using the environment invariant features.
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Kristen Grauman,Bikram Boote,Eugene Byrne,Zach Chavis,Joya Chen,Feng Cheng,Jawahar C V,Andrew Westbury,Lorenzo Torresani,Kris Kitani,Jitendra Malik,Triantafyllos Afouras,Kumar Ashutosh,Vijay Baiyya,Siddhant Bansal
Technical Report, arXiv, 2023
@inproceedings{bib_Ego-_2023, AUTHOR = {Kristen Grauman, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Jawahar C V, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal}, TITLE = {Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
We present Ego-Exo4D, a diverse, large-scale multi- modal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured ego- centric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these ac- tivities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by mul- tichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiplepaired language descriptions—including a novel “expert commentary” done by coaches and teach- ers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled hu- man activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity un- derstanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community.
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Kristen Grauman,Siddhant Bansal,Richard Newcombe, James M. Rehg,Michael Wray,Jianbo Shi,Robert Kuo,Andrew Westbury,Lorenzo Torresani,Kris Kitani, Jitendra Malik,Triantafyllos Afouras,Jawahar C V,Kumar Ashutosh,Vijay Baiyya
Technical Report, arXiv, 2023
@inproceedings{bib_Ego-_2023, AUTHOR = {Kristen Grauman, Siddhant Bansal, Richard Newcombe, James M. Rehg, Michael Wray, Jianbo Shi, Robert Kuo, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Jawahar C V, Kumar Ashutosh, Vijay Baiyya}, TITLE = {Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
We present Ego-Exo4D, a diverse, large-scale multi- modal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured ego- centric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these ac- tivities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by mul- tichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiplepaired language descriptions—including a novel “expert commentary” done by coaches and teach- ers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled hu- man activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity un- derstanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community
United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos
Siddhant Bansal,Chetan Arora,Jawahar C V
Technical Report, arXiv, 2023
@inproceedings{bib_Unit_2023, AUTHOR = {Siddhant Bansal, Chetan Arora, Jawahar C V}, TITLE = {United We Stand, Divided We Fall: UnityGraph for Unsupervised Procedure Learning from Videos}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
iven multiple videos of the same task, procedure learn- ing addresses identifying the key-steps and determining their order to perform the task. For this purpose, existing approaches use the signal generated from a pair of videos. This makes key-steps discovery challenging as the algo- rithms lack inter-videos perspective. Instead, we propose an unsupervised Graph-based Procedure Learning (GPL) framework. GPL consists of the novel UnityGraph that rep- resents all the videos of a task as a graph to obtain both intra-video and inter-videos context. Further, to obtain sim- ilar embeddings for the same key-steps, the embeddings of UnityGraph are updated in an unsupervised manner using the Node2Vec algorithm. Finally, to identify the key-steps, we cluster the embeddings using KMeans. We test GPL on benchmark ProceL, CrossTask, and EgoProceL datasets and achieve an average improvement of 2% on third-person datasets and 3.6% on EgoProceL over the state-of-the-art.
CueCAn: Cue-driven Contextual Attention for Identifying Missing Traffic Signs on Unconstrained Roads
Varun Gupta,Anbumani Subramanian,Jawahar C V,Rohit Saluja
International Conference on Robotics and Automation, ICRA, 2023
@inproceedings{bib_CueC_2023, AUTHOR = {Varun Gupta, Anbumani Subramanian, Jawahar C V, Rohit Saluja}, TITLE = {CueCAn: Cue-driven Contextual Attention for Identifying Missing Traffic Signs on Unconstrained Roads}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2023}}
Unconstrained Asian roads often involve poor infrastructure, affecting overall road safety. Missing traffic signs are a regular part of such roads. Missing or non-existing object detection has been studied for locating missing curbs and estimating reasonable regions for pedestrians on road scene images. Such methods involve analyzing task-specific single object cues. In this paper, we present the first and most challenging video dataset for missing objects, with multiple types of traffic signs for which the cues are visible without the signs in the scenes. We refer to it as the Missing Traffic Signs Video Dataset (MTSVD). MTSVD is challenging compared to the previous works in two aspects i) The traffic signs are generally not present in the vicinity of their cues, ii) The traffic signs’ cues are diverse and unique. Also, MTSVD is the first publicly available missing object dataset. To train the models for identifying missing signs, we complement our dataset with 10K traffic sign tracks, with 40% of the traffic signs having cues visible in the scenes. For identifying missing signs, we propose the Cue-driven Contextual Attention units (CueCAn), which we incorporate in our model’s encoder. We first train the encoder to classify the presence of traffic sign cues and then train the entire segmentation model end-to-end to localize missing traffic signs. Quantitative and qualitative analysis shows that CueCAn significantly improves the performance of base models.
Understanding Video Scenes through Text Insights from Text-based Video Question Answering
Soumya Shamarao Jahagirdar,Minesh Mathew,Dimosthenis Karatzas,Jawahar C V
International Conference on Computer Vision Workshops, ICCV-W, 2023
@inproceedings{bib_Unde_2023, AUTHOR = {Soumya Shamarao Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, Jawahar C V}, TITLE = {Understanding Video Scenes through Text Insights from Text-based Video Question Answering}, BOOKTITLE = {International Conference on Computer Vision Workshops}. YEAR = {2023}}
Researchers have extensively studied the field of vision and language, discovering that both visual and textual content is crucial for understanding scenes effectively. Particularly, comprehending text in videos holds great significance, requiring both scene text understanding and temporal reasoning. This paper focuses on exploring two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content. The NewsVideoQA dataset contains question-answer pairs related to the text in news videos, while M4-ViteVQA comprises question-answer pairs from diverse categories like vlogging, traveling, and shopping. We provide an analysis of the formulation of these datasets on various levels, exploring the degree of visual understanding and multi-frame comprehension required for answering the questions. Additionally, the study includes experimentation with BERT-QA, a text-only model, which demonstrates comparable performance to the original methods on both datasets, indicating the shortcomings in the formulation of these datasets. Furthermore, we also look into the domain adaptation aspect by examining the effectiveness of training on M4-ViteVQA and evaluating on NewsVideoQA and vice-versa, thereby shedding light on the challenges and potential benefits of out-of-domain training.
An Approach for Speech Enhancement in Low SNR Environments using Granular Speaker Embedding
Jayashree Saha,Rudrabha Mukhopadhyay,Agrawal Aparna Nitin,Surabhi Jain,Jawahar C V
Joint International Conference on Data Science & Management of Data, CODS-COMAD, 2023
@inproceedings{bib_An_A_2023, AUTHOR = {Jayashree Saha, Rudrabha Mukhopadhyay, Agrawal Aparna Nitin, Surabhi Jain, Jawahar C V}, TITLE = {An Approach for Speech Enhancement in Low SNR Environments using Granular Speaker Embedding}, BOOKTITLE = {Joint International Conference on Data Science & Management of Data}. YEAR = {2023}}
Abstract The proliferation of speech technology applications has led to an unprecedented demand for effective speech enhancement techniques, particularly in low Signal-to-Noise Ratio (SNR) conditions. This research presents a novel approach to speech enhancement, specifically designed for very low SNR scenarios. Our technique focuses on speaker embedding at a granular level and highlights its consistent impact on enhancing speech quality and improving Automatic Speech Recognition (ASR) performance, a significant downstream task. Experimental findings demonstrate competitive speech quality and substantial enhancements in ASR accuracy compared to alternative methods in low SNR situations. The proposed technique offers promising advancements in addressing the challenges posed by low SNR conditions in speech technology applications.
Towards Accurate Lip-to-Speech Synthesis in-the-Wild
Sindhu Balachandra Hegde,Rudrabha Mukhopadhyay,Jawahar C V,Vinay Namboodiri
ACM international conference on Multimedia, ACMMM, 2023
@inproceedings{bib_Towa_2023, AUTHOR = {Sindhu Balachandra Hegde, Rudrabha Mukhopadhyay, Jawahar C V, Vinay Namboodiri}, TITLE = {Towards Accurate Lip-to-Speech Synthesis in-the-Wild}, BOOKTITLE = {ACM international conference on Multimedia}. YEAR = {2023}}
In this paper, we introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements. The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone, resulting in unsatisfactory outcomes. To overcome this issue, we propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model. The noisy text is generated using a pre-trained lip-to-text model, enabling our approach to work without text annotations during inference. We design a visual text-to-speech network that utilizes the visual stream to generate accurate speech, which is in-sync with the silent input video. We perform extensive experiments and ablation studies, demonstrating our approach's superiority over the current state-of-the-art methods on various benchmark datasets. Further, we demonstrate an essential practical application of our method in assistive technology by generating speech for an ALS patient who has lost the voice but can make mouth movements. Our demo video, code, and additional details can be found at http://cvit.iiit.ac.in/research/projects/cvit-projects/ms-l2s-itw.
Overcoming Label Noise for Source-free Unsupervised Video Domain Adaptation
Avijit Dasgupta,Jawahar C V,Karteek Alahari
Technical Report, arXiv, 2023
@inproceedings{bib_Over_2023, AUTHOR = {Avijit Dasgupta, Jawahar C V, Karteek Alahari}, TITLE = {Overcoming Label Noise for Source-free Unsupervised Video Domain Adaptation}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Despite the progress seen in classification methods, current approaches for handling videos with distribution shifts in source and target domains remain source-dependent as they require access to the source data during the adaptation stage. In this paper, we present a self-training based source-free video domain adaptation approach to address this challenge by bridging the gap between the source and the target domains. We use the source pre-trained model to generate pseudo-labels for the target domain samples, which are inevitably noisy. Thus, we treat the problem of source-free video domain adaptation as learning from noisy labels and argue that the samples with correct pseudo-labels can help us in adaptation. To this end, we leverage the crossentropy loss as an indicator of the correctness of the pseudo-labels and use the resulting small-loss samples from the target domain for fine-tuning the model. We further enhance the adaptation performance by implementing a teacher-student framework, in which the teacher, which is updated gradually, produces reliable pseudo-labels. Meanwhile, the student undergoes fine-tuning on the target domain videos using these generated pseudo-labels to improve its performance. Extensive experimental evaluations show that our methods, termed as CleanAdapt, CleanAdapt + TS, achieve state-of-the-art results, outperforming the existing approaches on various open datasets. Our source code is publicly available at https://avijit9.github.io/CleanAdapt.
System and method for automatically generating synthetic head videos using a machine learning model
Jawahar C V,Aditya Agarwal,Bipasha Sen,Rudrabha Mukhopadhyay,Vinay Namboodiri
United States Patent, Us patent, 2023
@inproceedings{bib_Syst_2023, AUTHOR = {Jawahar C V, Aditya Agarwal, Bipasha Sen, Rudrabha Mukhopadhyay, Vinay Namboodiri}, TITLE = {System and method for automatically generating synthetic head videos using a machine learning model}, BOOKTITLE = {United States Patent}. YEAR = {2023}}
Embodiments herein provide a system and a method for automatically generating at least one synthetic talking head video using a machine learning model. The method includes (i) extracting features from each frame of a video that is extracted from data sources,(ii) analyzing, using a face-detection model, the video to determine a driving face video if a number of identities, and faces of speakers are equal to one in all frames of the video,(iii) generating, using a text to speech model, synthetic speech utterances by automatically selecting a vocabulary of words and sentences from the data sources,(iv) modifying lip movements that are originally present in the driving face video corresponding to the synthetic speech utterances, and (v) generating, using machine learning model, synthetic talking head video based on the lip movements that are modified corresponding to the synthetic speech utterances.
Explaining Deep Face Algorithms Through Visualization: A Survey
THRUPTHI ANN JOHN, Vineeth N Balasubramanian,Jawahar C V
IEEE Transactions on Biometrics, Behavior, and Identity Science, TBOIM, 2023
Abs | | bib Tex
@inproceedings{bib_Expl_2023, AUTHOR = {THRUPTHI ANN JOHN, Vineeth N Balasubramanian, Jawahar C V}, TITLE = {Explaining Deep Face Algorithms Through Visualization: A Survey}, BOOKTITLE = {IEEE Transactions on Biometrics, Behavior, and Identity Science}. YEAR = {2023}}
Although current deep models for face tasks surpass human performance on some benchmarks, we do not understand how they work. Thus, we cannot predict how it will react to novel inputs, resulting in catastrophic failures and unwanted biases in the algorithms. Explainable AI helps bridge the gap, but currently, there are very few visualization algorithms designed for faces. This work undertakes a first-of-its-kind meta-analysis of explainability algorithms in the face domain. We explore the nuances and caveats of adapting general-purpose visualization algorithms to the face domain, illustrated by computing visualizations on popular face models. We review existing face explainability works and reveal valuable insights into the structure and hierarchy of face networks. We also determine the design considerations for practical face visualizations accessible to AI practitioners by conducting a user study on the utility of various explainability algorithms.
Dataset agnostic document object detection
Ajoy Mondal,Madhav Agarwal,Jawahar C V
Pattern Recognition, PR, 2023
Abs | | bib Tex
@inproceedings{bib_Data_2023, AUTHOR = {Ajoy Mondal, Madhav Agarwal, Jawahar C V}, TITLE = {Dataset agnostic document object detection}, BOOKTITLE = {Pattern Recognition}. YEAR = {2023}}
Localizing document objects such as tables, figures, and equations is a primary step for extracting information from document images. We propose a novel end-to-end trainable deep network, termed Document Object Localization Network (dolnet), for detecting various objects present in the document images. The proposed network is a multi-stage extension of Mask r-cnn with a dual backbone having deformable convolution for detecting document objects with high detection accuracy at a higher IoU threshold. We also empirically evaluate the proposed dolnet on the publicly available benchmark datasets. The proposed DOLNet achieves state-of-the-art performance for most of the bench-mark datasets under various existing experimental environments. Our solution has three important properties: (i) a single trained model dolnet that performs well across all the popular benchmark datasets, (ii) reports excellent performances across multiple, including with higher IoU thresholds, and (iii) consistently demonstrate the superior quantitative performance by following the same protocol of the recent works for each of the benchmarks.
IndicSTR12: A Dataset for Indic Scene Text Recognition
Harsh Lunia,Ajoy Mondal,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2023
Abs | | bib Tex
@inproceedings{bib_Indi_2023, AUTHOR = {Harsh Lunia, Ajoy Mondal, Jawahar C V}, TITLE = {IndicSTR12: A Dataset for Indic Scene Text Recognition}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2023}}
The importance of Scene Text Recognition (STR) in today’s increasingly digital world cannot be overstated. Given the significance of STR, data-intensive deep learning approaches that auto-learn feature mappings have primarily driven the development of STR solutions. Several benchmark datasets and substantial work on deep learning models are available for Latin languages to meet this need. On more complex, syntactically and semantically, Indian languages spoken and read by 1.3 billion people, there is less work and datasets available. This paper aims to address the Indian space’s lack of a comprehensive dataset by proposing the largest and most comprehensive real dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian languages (Assamese, Bengali, Odia, Marathi, Hindi, Kannada, Urdu, Telugu, Malayalam, Tamil, Gujarati, and Punjabi). A few works have addressed the same …
AI-Assisted Screening of Oral Potentially Malignant Disorders Using Smartphone-Based Photographic Images
Talwar Vivek Jayant,Jawahar C V,Vinod Palakkad Krishnanunni, Pragya Singh,Nirza Mukhia,Anupama Shetty,Praveen Birur,Karishma M. Desai,Chinnababu Sunkavall,Konala S. Varma,Ramanathan Sethuraman
@inproceedings{bib_AI-A_2023, AUTHOR = {Talwar Vivek Jayant, Jawahar C V, Vinod Palakkad Krishnanunni, Pragya Singh, Nirza Mukhia, Anupama Shetty, Praveen Birur, Karishma M. Desai, Chinnababu Sunkavall, Konala S. Varma, Ramanathan Sethuraman}, TITLE = {AI-Assisted Screening of Oral Potentially Malignant Disorders Using Smartphone-Based Photographic Images}, BOOKTITLE = {Cancers}. YEAR = {2023}}
The prevalence of oral potentially malignant disorders (OPMDs) and oral cancer is surging in low- and middle-income countries. A lack of resources for population screening in remote locations delays the detection of these lesions in the early stages and contributes to higher mortality and a poor quality of life. Digital imaging and artificial intelligence (AI) are promising tools for cancer screening. This study aimed to evaluate the utility of AI-based techniques for detecting OPMDs in the Indian population using photographic images of oral cavities captured using a smartphone. A dataset comprising 1120 suspicious and 1058 non-suspicious oral cavity photographic images taken by trained front-line healthcare workers (FHWs) was used for evaluating the performance of different deep learning models based on convolution (DenseNets) and Transformer (Swin) architectures. The best- performing model was also tested on an additional independent test set comprising 440 photographic images taken by untrained FHWs (set I). DenseNet201 and Swin Transformer (base) models show high classification performance with an F1-score of 0.84 (CI 0.79–0.89) and 0.83 (CI 0.78–0.88) on the internal test set, respectively. However, the performance of models decreases on test set I, which has considerable variation in the image quality, with the best F1-score of 0.73 (CI 0.67–0.78) obtained using DenseNet201. The proposed AI model has the potential to identify suspicious and non-suspicious oral lesions using photographic images. This simplified image-based AI solution can assist in screening, early detection, and prompt referral for OPMDs.
Reading Between the Lanes: Text VideoQA on the Road
George Tom,MINESH MATHEW,Sergi Garcia-Bordils, Dimosthenis Karatzas,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2023
@inproceedings{bib_Read_2023, AUTHOR = {George Tom, MINESH MATHEW, Sergi Garcia-Bordils, Dimosthenis Karatzas, Jawahar C V}, TITLE = {Reading Between the Lanes: Text VideoQA on the Road}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2023}}
Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typ- ically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and textual cues from the video stream but also reason over time. To address this issue, we introduce RoadTextVQA, a new dataset for the task of video question answering (VideoQA) in the context of driver assistance. RoadTextVQA consists of 3, 222 driving videos collected from multiple countries, anno- tated with 10, 500 questions, all based on text or road signs present in the driving videos. We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset, highlighting the significant potential for improvement in this domain and the useful- ness of the dataset in advancing research on in-vehicle support systems and text-aware multimodal question answering. The dataset is available at http://cvit.iiit.ac.in/research/projects/cvit-projects/roadtextvqa
ICDAR 2023 Competition on RoadText Video Text Detection, Tracking and Recognition
George Tom,MINESH MATHEW,Sergi Garcia-Bordils,Dimosthenis Karatzas,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2023
Abs | | bib Tex
@inproceedings{bib_ICDA_2023, AUTHOR = {George Tom, MINESH MATHEW, Sergi Garcia-Bordils, Dimosthenis Karatzas, Jawahar C V}, TITLE = {ICDAR 2023 Competition on RoadText Video Text Detection, Tracking and Recognition}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2023}}
In this report, we present the final results of the ICDAR 2023 Competition on RoadText Video Text Detection, Tracking and Recognition. The RoadText challenge is based on the RoadText-1K dataset and aims to assess and enhance current methods for scene text detection, recognition, and tracking in videos. The RoadText-1K dataset contains 1000 dash cam videos with annotations for text bounding boxes and transcriptions in every frame. The competition features an end-to-end task, requiring systems to accurately detect, track, and recognize text in dash cam videos. The paper presents a comprehensive review of the submitted methods along with a detailed analysis of the results obtained by the methods. The analysis provides valuable insights into the current capabilities and limitations of video text detection, tracking, and recognition systems for dashcam videos.
ICDAR 2023 Competition on Visual Question Answering on Business Document Images
Sachin Raja,Ajoy Mondal,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2023
Abs | | bib Tex
@inproceedings{bib_ICDA_2023, AUTHOR = {Sachin Raja, Ajoy Mondal, Jawahar C V}, TITLE = {ICDAR 2023 Competition on Visual Question Answering on Business Document Images}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2023}}
This paper presents the competition report on Visual Question Answering (VQA) on Business Document Images (VQAonBD) held at the 17th International Conference on Document Analysis and Recognition (icdar 2023). Understanding business documents is a crucial step toward making an important financial decision. It remains a manual process in most industrial applications. Given the requirement for a large-scale solution to this problem, it has recently seen a surge in interest from the document image research community. Credit underwriters and business analysts often look for answers to a particular set of questions to reach a decisive conclusion. This competition is designed to encourage research in this broader area to find answers to questions with minimal human supervision. Some problem-specific challenges include an accurate understanding of the questions/queries, figuring out cross-document
ICDAR 2023 Competition on Indic Handwriting Text Recognition
Ajoy Mondal,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2023
Abs | | bib Tex
@inproceedings{bib_ICDA_2023, AUTHOR = {Ajoy Mondal, Jawahar C V}, TITLE = {ICDAR 2023 Competition on Indic Handwriting Text Recognition}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2023}}
This paper presents the competition report on Indic Handwriting Text Recognition (IHTR) held at the 17th International Conference on Document Analysis and Recognition (ICDAR 2023 IHTR). Handwriting text recognition is an essential component for analyzing handwritten documents. Several good recognizers are available for English handwriting text in the literature. In the case of Indic languages, limited work is available due to several challenging factors. (i) Two or more characters are often combined to form conjunct characters, (ii) most Indic scripts have around 100 unique Unicode characters, (iii) diversity in handwriting styles, (iv) varying ink density around the words, (v) challenging layouts with overlap between words
Towards Real-Time Analysis of Broadcast Badminton Videos
Nitin Nilesh,Tushar Sharma,ANURAG GHOSH,Jawahar C V
Technical Report, arXiv, 2023
@inproceedings{bib_Towa_2023, AUTHOR = {Nitin Nilesh, Tushar Sharma, ANURAG GHOSH, Jawahar C V}, TITLE = {Towards Real-Time Analysis of Broadcast Badminton Videos}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Analysis of player movements is a crucial subset of sports analysis. Existing player movement analysis methods use recorded videos after the match is over. In this work, we propose an end-to-end framework for player movement analysis for badminton matches on live broadcast match videos. We only use the visual inputs from the match and, unlike other approaches which use multi-modal sensor data, our approach uses only visual cues. We propose a method to calculate the on-court distance covered by both the players from the video feed of a live broadcast badminton match. To perform this analysis, we focus on the gameplay by removing replays and other redundant parts of the broadcast match. We then perform player tracking to identify and track the movements of both players in each frame. Finally, we calculate the distance covered by each player and the average speed with which they move on the court. We further show a heatmap of the areas covered by the player on the court which is useful for analyzing the gameplay of the player. Our proposed framework was successfully used to analyze live broadcast matches in real-time during the Premier Badminton League 2019 (PBL 2019), with commentators and broadcasters appreciating the utility
Reading Between the Lanes: Text VideoQA on the Road
George Tom,MINESH MATHEW,Sergi Garcia-Bordils,Dimosthenis Karatzas,Jawahar C V
Technical Report, arXiv, 2023
@inproceedings{bib_Read_2023, AUTHOR = {George Tom, MINESH MATHEW, Sergi Garcia-Bordils, Dimosthenis Karatzas, Jawahar C V}, TITLE = {Reading Between the Lanes: Text VideoQA on the Road}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and textual cues from the video stream but also reason over time. To address this issue, we introduce RoadTextVQA, a new dataset for the task of video question answering (VideoQA) in the context of driver assistance. RoadTextVQA consists of 3,222 driving videos collected from multiple countries, annotated with 10,500 questions, all based on text or road signs present in the driving videos. We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset, highlighting the significant potential for improvement in this domain and the usefulness of the dataset in advancing research on in-vehicle support systems and text-aware multimodal question answering.
CueCAn: Cue Driven Contextual Attention For Identifying Missing Traffic Signs on Unconstrained Roads
VARUN GUPTA,Anbumani Subramanian,Jawahar C V,Rohit Saluja
International Conference on Robotics and Automation, ICRA, 2023
@inproceedings{bib_CueC_2023, AUTHOR = {VARUN GUPTA, Anbumani Subramanian, Jawahar C V, Rohit Saluja}, TITLE = {CueCAn: Cue Driven Contextual Attention For Identifying Missing Traffic Signs on Unconstrained Roads}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2023}}
Unconstrained Asian roads often involve poor infrastructure, affecting overall road safety. Missing traffic signs are a regular part of such roads. Missing or non-existing object detection has been studied for locating missing curbs and estimating reasonable regions for pedestrians on road scene images. Such methods involve analyzing task-specific single object cues. In this paper, we present the first and most challenging video dataset for missing objects, with multiple types of traffic signs for which the cues are visible without the signs in the scenes. We refer to it as the Missing Traffic Signs Video Dataset (MTSVD). MTSVD is challenging compared to the previous works in two aspects i) The traffic signs are generally not present in the vicinity of their cues, ii) The traffic signs cues are diverse and unique. Also, MTSVD is the first publicly available missing object dataset. To train the models for identifying missing signs, we complement our dataset with 10K traffic sign tracks, with 40 percent of the traffic signs having cues visible in the scenes. For identifying missing signs, we propose the Cue-driven Contextual Attention units (CueCAn), which we incorporate in our model encoder. We first train the encoder to classify the presence of traffic sign cues and then train the entire segmentation model end-to-end to localize missing traffic signs. Quantitative and qualitative analysis shows that CueCAn significantly improves the performance of base models.
HWNet v3: a joint embedding framework for recognition and retrieval of handwritten text
PRAVEEN KRISHNAN,KARTIK DUTTA,Jawahar C V
International Journal on Document Analysis and Recognition, IJDAR, 2023
Abs | | bib Tex
@inproceedings{bib_HWNe_2023, AUTHOR = {PRAVEEN KRISHNAN, KARTIK DUTTA, Jawahar C V}, TITLE = {HWNet v3: a joint embedding framework for recognition and retrieval of handwritten text}, BOOKTITLE = {International Journal on Document Analysis and Recognition}. YEAR = {2023}}
Learning an efficient label embedding framework for word images enables effective word spotting of handwritten documents. In this work, we propose different schemes of label embedding for word images using deep neural architectures and their representations. We refer to our first scheme as the two-stage label embedding technique which projects both word images and their corresponding textual strings into a common subspace. We further introduce an end-to-end label embedding scheme using deep neural architecture which simplifies the embedding process and reports state-of-the-art performance for the task of word spotting and recognition. We also validate the role of synthetic data as a complementary modality to further enhance the embedding process. On the challenging IAM handwritten dataset, we report an mAP of 0.9753 for query-by-string-based word spotting, while under lexicon-based word recognition, our proposed method reports 1.67 and 3.62 character and word error rates, respectively. We also present the detailed ablation study on various variants of our end-to-end embedding architecture and perform analysis under varying embedding sizes. We further validate the embedding scheme on degraded printed document datasets from both Latin and Indic scripts.
Towards Generating Ultra-High Resolution Talking-Face Videos With Lip Synchronization
Anchit Gupta,Rudrabha Mukhopadhyay,Sindhu Balachandra Hegde,Faizan Farooq Khan,Vinay P. Namboodiri,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2023
@inproceedings{bib_Towa_2023, AUTHOR = {Anchit Gupta, Rudrabha Mukhopadhyay, Sindhu Balachandra Hegde, Faizan Farooq Khan, Vinay P. Namboodiri, Jawahar C V}, TITLE = {Towards Generating Ultra-High Resolution Talking-Face Videos With Lip Synchronization}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2023}}
Talking-face video generation works have achieved state-of-the-art results in synthesizing videos with lip synchronization. However, most of the previous works deal with low-resolution talking-face videos (up to 256×256 pixels), thus, generating extremely high-resolution videos still remains a challenge. We take a giant leap in this work and propose a novel method to synthesize talking-face videos at resolutions as high as 4K! Our task presents several key challenges: (i) Scaling the existing methods to such high resolutions is resource-constrained, both in terms of compute and the availability of very high-resolution datasets, (ii) The synthesized videos need to be spatially and temporally coherent. The sheer number of pixels that the model needs to generate while maintaining the temporal consistency at the video level makes this task non-trivial and has never been attempted before in literature. To address these issues, we propose to train the lip-sync generator in a compact Vector Quantized (VQ) space for the first time. Our
Watching the News Towards VideoQA Models that can Read
Soumya Shamarao Jahagirdar,MINESH MATHEW,Dimosthenis Karatzas,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2023
@inproceedings{bib_Watc_2023, AUTHOR = {Soumya Shamarao Jahagirdar, MINESH MATHEW, Dimosthenis Karatzas, Jawahar C V}, TITLE = {Watching the News Towards VideoQA Models that can Read}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2023}}
Video Question Answering methods focus on commonsense reasoning and visual cognition of objects or persons and their interactions over time. Current VideoQA approaches ignore the textual information present in the video. Instead, we argue that textual information is complementary to the action and provides essential contextualisation cues to the reasoning process. To this end, we propose a novel VideoQA task that requires reading and understanding the text in the video. To explore this direction, we focus on news videos and require QA systems to comprehend and answer questions about the topics presented by combining visual and textual cues in the video. We introduce the “NewsVideoQA” dataset that comprises more than 8, 600 QA pairs on 3, 000+ news videos obtained from diverse news channels from around the world. We demonstrate the limitations of current Scene Text VQA and VideoQA methods and propose ways to incorporate scene text information into VideoQA methods.
Unsupervised Audio-Visual Lecture Segmentation
Darshan Singh S,Anchit Gupta,Jawahar C V,Makarand Tapaswi
Winter Conference on Applications of Computer Vision, WACV, 2023
@inproceedings{bib_Unsu_2023, AUTHOR = {Darshan Singh S, Anchit Gupta, Jawahar C V, Makarand Tapaswi}, TITLE = {Unsupervised Audio-Visual Lecture Segmentation}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2023}}
Over the last decade, online lecture videos have become increasingly popular and have experienced a meteoric rise during the pandemic. However, video-language research has primarily focused on instructional videos or movies, and tools to help students navigate the growing online lectures are lacking. Our first contribution is to facilitate research in the educational domain by introducing AVLectures, a large-scale dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR outputs for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Our second contribution is introducing video lecture segmentation that splits lectures into bitesized topics. Lecture clip representations leverage visual, textual, and OCR cues and are trained on a pretext selfsupervised task of matching the narration with the temporally aligned visual content. We formulate lecture segmentation as an unsupervised task and use these representations to generate segments using a temporally consistent 1- nearest neighbor algorithm, TW-FINCH [44]. We evaluate our method on 15 courses and compare it against various visual and textual baselines, outperforming all of them. Our comprehensive ablation studies also identify the key factors driving the success of our approach.
IDD-3D: Indian Driving Dataset for 3D Unstructured Road Scenes
Shubham Dokania,A. H. Abdul Hafez,Anbumani Subramanian,Manmohan Chandraker,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2023
@inproceedings{bib_IDD-_2023, AUTHOR = {Shubham Dokania, A. H. Abdul Hafez, Anbumani Subramanian, Manmohan Chandraker, Jawahar C V}, TITLE = {IDD-3D: Indian Driving Dataset for 3D Unstructured Road Scenes}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2023}}
Autonomous driving and assistance systems rely on annotated data from traffic and road scenarios to model and learn the various object relations in complex real-world scenarios. Preparation and training of deploy-able deep learning architectures require the models to be suited to different traffic scenarios and adapt to different situations. Currently, existing datasets, while large-scale, lack such diversities and are geographically biased towards mainly developed cities. An unstructured and complex driving layout found in several developing countries such as India poses a challenge to these models due to the sheer degree of variations in the object types, densities, and locations. To facilitate better research toward accommodating such scenarios, we build a new dataset, IDD-3D, which consists of multimodal data from multiple cameras and LiDAR sensors with 12k annotated driving LiDAR frames across various traffic scenarios. We discuss the need for this dataset through statistical comparisons with existing datasets and highlight benchmarks on standard 3D object detection and tracking tasks in complex layouts. Code and data available 1 .
Audio-visual face reenactment
Madhav Agarwal,Rudrabha Mukhopadhyay,Vinay Namboodiri,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2023
@inproceedings{bib_Audi_2023, AUTHOR = {Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, Jawahar C V}, TITLE = {Audio-visual face reenactment}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2023}}
This work proposes a novel method to generate realistic talking head videos using audio and visual streams. We animate a source image by transferring head motion from a driving video using a dense motion field generated using learnable keypoints. We improve the quality of lip sync using audio as an additional input, helping the network to attend to the mouth region. We use additional priors using face segmentation and face mesh to improve the structure of the reconstructed faces. Finally, we improve the visual quality of the generations by incorporating a carefully designed identity-aware generator module. The identityaware generator takes the source image and the warped motion features as input to generate a high-quality output with fine-grained details. Our method produces state-ofthe-art results and generalizes well to unseen faces, languages, and voices. We comprehensively evaluate our approach using multiple metrics and outperforming the current techniques both qualitative and quantitatively. Our work opens up several applications, including enabling low bandwidth video calls. We release a demo video and additional information at http://cvit.iiit.ac.in/
ScribbleNet: Efficient interactive annotation of urban city scenes for semantic segmentation
Bhavani S,Ashutosh Gupta,Jawahar C V,Chetan Arora
Pattern Recognition, PR, 2023
Abs | | bib Tex
@inproceedings{bib_Scri_2023, AUTHOR = {Bhavani S, Ashutosh Gupta, Jawahar C V, Chetan Arora}, TITLE = {ScribbleNet: Efficient interactive annotation of urban city scenes for semantic segmentation}, BOOKTITLE = {Pattern Recognition}. YEAR = {2023}}
Annotation is a crucial first step in the semantic segmentation of urban images that facilitates the development of autonomous navigation systems. However, annotating complex urban images is time-consuming and challenging. It requires significant human effort making it expensive and error-prone. To reduce human effort during annotation, multiple images need to be annotated in a short time-span. In this paper, we introduce ScribbleNet, an interactive image segmentation algorithm to address this issue. Our approach provides users with a pre-segmented image that iteratively improves the segmentation using scribble as an annotation input. This method is based on conditional inference and exploits the learnt correlations in a deep neural network (DNN). ScribbleNet can: (1) work with urban city scenes captured in unseen environments, (2) annotate new classes not present in the training set, and (3) correct several labels at once. We compare this method with other interactive segmentation approaches on multiple datasets such as CityScapes, BDD, Mapillary Vistas, KITTI, and IDD. ScribbleNet reduces the annotation time of an image by up to 14.7 over manual annotation and up to 5.4 over the current approaches. The algorithm is integrated into the publicly available LabelMe image annotation tool and will be released as an open-source software.
Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale
Aditya Agarwal,Bipasha Sen,Rudrabha Mukhopadhyay,Vinay Namboodiri,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2023
@inproceedings{bib_Towa_2023, AUTHOR = {Aditya Agarwal, Bipasha Sen, Rudrabha Mukhopadhyay, Vinay Namboodiri, Jawahar C V}, TITLE = {Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2023}}
Many people with some form of hearing loss consider lipreading as their primary mode of day-to-day communication. However, finding resources to learn or improve one’s lipreading skills can be challenging. This is further exacerbated in the COVID19 pandemic due to restrictions on direct interactions with peers and speech therapists. Today, online MOOCs platforms like Coursera and Udemy have become the most effective form of training for many types of skill development. However, online lipreading resources are scarce as creating such resources is an extensive process needing months of manual effort to record hired actors. Because of the manual pipeline, such platforms are also limited in vocabulary, supported languages, accents, *Equal contribution and speakers and have a high usage cost. In this work, we investigate the possibility of replacing real human talking videos with synthetically generated videos. Synthetic data can easily incorporate larger vocabularies, variations in accent, and even local languages and many speakers. We propose an end-to-end automated pipeline to develop such a platform using state-of-the-art talking head video generator networks, text-to-speech models, and computer vision techniques. We then perform an extensive human evaluation using carefully thought out lipreading exercises to validate the quality of our designed platform against the existing lipreading platforms. Our studies concretely point toward the potential of our approach in developing a largescale lipreading MOOC platform that can impact millions of people with hearing loss
FaceOff: A Video-to-Video Face Swapping System
Aditya Agarwal,Bipasha Sen,Rudrabha Mukhopadhyay,Vinay Namboodiri,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2023
@inproceedings{bib_Face_2023, AUTHOR = {Aditya Agarwal, Bipasha Sen, Rudrabha Mukhopadhyay, Vinay Namboodiri, Jawahar C V}, TITLE = {FaceOff: A Video-to-Video Face Swapping System}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2023}}
Doubles play an indispensable role in the movie industry. They take the place of the actors in dangerous stunt scenes or scenes where the same actor plays multiple characters. The double’s face is later replaced with the actor’s face and expressions manually using expensive CGI technology, costing millions of dollars and taking months to complete. An automated, inexpensive, and fast way can be to use face-swapping techniques that aim to swap an identity from a source face video (or an image) to a target face video. However, such methods cannot preserve the source expressions of the actor important for the scene’s context. *Equal contribution To tackle this challenge, we introduce video-to-video (V2V) face-swapping, a novel task of face-swapping that can preserve (1) the identity and expressions of the source (actor) face video and (2) the background and pose of the target (double) video. We propose FaceOff, a V2V face-swapping system that operates by learning a robust blending operation to merge two face videos following the constraints above. It reduces the videos to a quantized latent space and then blends them in the reduced space. FaceOff is trained in a self-supervised manner and robustly tackles the nontrivial challenges of V2V face-swapping. As shown in the experimental section, FaceOff significantly outperforms alternate approaches qualitatively and quantitatively.
Generating Personalized Summaries of Day Long Egocentric Videos
Pravin Nagar,Anuj Rathore,Jawahar C V,Chetan Arora
IEEE Transaction on Pattern Analysis Machine Intelligence, TPAMI, 2023
@inproceedings{bib_Gene_2023, AUTHOR = {Pravin Nagar, Anuj Rathore, Jawahar C V, Chetan Arora}, TITLE = {Generating Personalized Summaries of Day Long Egocentric Videos}, BOOKTITLE = {IEEE Transaction on Pattern Analysis Machine Intelligence}. YEAR = {2023}}
The popularity of egocentric cameras and their always-on nature has lead to the abundance of day long first-person videos. The highly redundant nature of these videos and extreme camera-shakes make them difficult to watch from beginning to end. These videos require efficient summarization tools for consumption. However, traditional summarization techniques developed for static surveillance videos or highly curated sports videos and movies are either not suitable or simply do not scale for such hours long videos in the wild. On the other hand, specialized summarization techniques developed for egocentric videos limit their focus to important objects and people. This paper presents a novel unsupervised reinforcement learning framework to summarize egocentric videos both in terms of length and the content. The proposed framework facilitates incorporating various prior preferences such as faces, places, or …
System and method for lip-syncing a face to target speech using a machine learning model
Jawahar C V,Rudrabha Mukhopadhyay,Prajwal K R,Vinay Namboodiri
United States Patent, Us patent, 2022
@inproceedings{bib_Syst_2022, AUTHOR = {Jawahar C V, Rudrabha Mukhopadhyay, Prajwal K R, Vinay Namboodiri}, TITLE = {System and method for lip-syncing a face to target speech using a machine learning model}, BOOKTITLE = {United States Patent}. YEAR = {2022}}
A processor-implemented method for generating a lip-sync for a face to a target speech of a live session to a speech in one or more languages in-sync with improved visual quality using a machine learning model and a pre-trained lip-sync model is provided. The method includes (i) determining a visual representation of the face and an audio representation, the visual representation includes crops of the face; (ii) modifying the crops of the face to obtain masked crops; (iii) obtaining a reference frame from the visual representation at a second timestamp; (iv) combining the masked crops at the first timestamp with the reference to obtain lower half crops; (v) training the machine learning model by providing historical lower half crops and historical audio representations as training data; (vi) generating lip-synced frames for the face to the target speech, and (vii) generating an in-sync lip-synced frames by the pre-trained lip-sync model.
ETL: Efficient Transfer Learning for Face Tasks
THRUPTHI ANN JOHN,Isha Dua,Vineeth N Balasubramanian,Jawahar C V
International Conference on Computer Vision Theory and Applications, VISAPP, 2022
@inproceedings{bib_ETL:_2022, AUTHOR = {THRUPTHI ANN JOHN, Isha Dua, Vineeth N Balasubramanian, Jawahar C V}, TITLE = {ETL: Efficient Transfer Learning for Face Tasks}, BOOKTITLE = {International Conference on Computer Vision Theory and Applications}. YEAR = {2022}}
Transfer learning is a popular method for obtaining deep trained models for data-scarce face tasks such as head pose and emotion. However, current transfer learning methods are inefficient and time-consuming as they do not fully account for the relationships between related tasks. Moreover, the transferred model is large and computationally expensive. As an alternative, we propose ETL: a technique that efficiently transfers a pre-trained model to a new task by retaining only cross-task aware filters, resulting in a sparse transferred model. We demonstrate the effectiveness of ETL by transferring VGGFace, a popular face recognition model to four diverse face tasks. Our experiments show that we attain a size reduction up to 97% and an inference time reduction up to 94% while retaining 99.5% of the baseline transfer learning accuracy
Plant Disease Classification Using Hybrid Features
Muthireddy Vamsidhar,Jawahar C V
International Conference on Computer vision and Image Processing, CVIP, 2022
Abs | | bib Tex
@inproceedings{bib_Plan_2022, AUTHOR = {Muthireddy Vamsidhar, Jawahar C V}, TITLE = {Plant Disease Classification Using Hybrid Features}, BOOKTITLE = {International Conference on Computer vision and Image Processing}. YEAR = {2022}}
Deep learning has shown remarkable performances in image classification, including those of plants and leaves. However, high-performing networks in terms of accuracy may not be using the salient regions for making the prediction and could be prone to biases. This work proposes a neural network architecture incorporating handcrafted features and fusing them with the learned features. Using hybrid features provides better control and understanding of the feature space while leveraging deep learning capabilities. Furthermore, a new IoU-based metric is introduced to assess the CNN-based classifier’s performance based on the regions focused on making predictions. Experiments over multiple leaf disease datasets demonstrate the performance improvement with the model using hybrid features. Classification using hybrid features performed better than the baseline models in terms of P@1 and also on the IoU …
Mobile Captured Glass Board Image Enhancement
BODDAPATI MAHESH,Ajoy Mondal,Jawahar C V
International Conference on Computer vision and Image Processing, CVIP, 2022
Abs | | bib Tex
@inproceedings{bib_Mobi_2022, AUTHOR = {BODDAPATI MAHESH, Ajoy Mondal, Jawahar C V}, TITLE = {Mobile Captured Glass Board Image Enhancement}, BOOKTITLE = {International Conference on Computer vision and Image Processing}. YEAR = {2022}}
Note-taking methods and devices have improved tremendously over the past few decades, and people are finding new ways to write notes and take photos. Automatic extraction, recognition, and retrieval are necessary to process the huge chunk of digitized document data. However, an important step in all of these pipelines is the pre-processing step, mainly image enhancement or clean-up, which enhances the text regions and suppresses the non-text regions. In this article, we look at the problem of image enhancement or clean-up on one such important class of images (i.e., mobile captured glass board images). We present a simple yet efficient algorithm using the concepts of classical image processing techniques to solve the problem, and the obtained results are promising in comparison to the Office Lens.
Deep semantic binarization for document images
Ajoy Mondal,chetan Reddy,Jawahar C V
Multimedia Tools and Applications, MT&A, 2022
Abs | | bib Tex
@inproceedings{bib_Deep_2022, AUTHOR = {Ajoy Mondal, chetan Reddy, Jawahar C V}, TITLE = {Deep semantic binarization for document images}, BOOKTITLE = {Multimedia Tools and Applications}. YEAR = {2022}}
Binarization is an essential pre-processing step for many document image analysis tasks. Binarization of handwritten documents is more challenging than printed documents because of the non-uniform density of ink and the variable thickness of strokes. Instead of traditional scanners, people nowadays use the mobile camera to capture documents, including text written on white and glass boards. The quality of the camera-captured document images is often poor when compared with scanned document images. This impacts binarization accuracy. This paper presents a deep learning-based binarization framework called Deep Semantic Binarization (DSB) to binarize various document images. We pose document image binarization problem as a pixel-wise two-class classification task. Deep networks (including DSB) require many training images during training. However, the benchmark datasets with a limited number of training images are publicly available in the literature. We explore various training strategies, including transfer learning, to handle the data scarcity during training. Due to the unavailability of mobile-captured whiteboard and glass board images, we created two datasets, namely WBIDS-IIIT and GBIDS-IIIT, with associated ground truths. We validate DSBN on the public benchmark DIBCO dataset and WBIDS-IIIT and GBIDS-IIIT datasets. We empirically demonstrate that the DSB outperforms the state-of-the-art techniques for WBIDS-IIIT, GBIDS-IIIT and public datasets.
Automatic Annotation of Handwritten Document Images at Word Level
Ajoy Mondal,Krishna Tulsyan,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2022
@inproceedings{bib_Auto_2022, AUTHOR = {Ajoy Mondal, Krishna Tulsyan, Jawahar C V}, TITLE = {Automatic Annotation of Handwritten Document Images at Word Level}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2022}}
Recent development in deep learning-based recognizers needs a large annotated corpus for creating the model. Manually annotating a large corpus is time-consuming, costly, and tedious. In this work, we propose a framework for automatic annotation at the word level for given handwritten data and corresponding text sequences (or corpora). The proposed framework consists of five modules (i) pre-processing,(ii) word detection,(iii) word recognition,(iv) alignment, and (v) manual correction and verification. The preprocessing module cleans the image and crops the text region from an image. Word detection and recognition modules localize and recognize words. It is necessary to align words in the sequence with the word images during detection and recognition because of errors in writing. The alignment module aligns words in text sequence to the word images. The human annotator will correct the errors in the automatic annotation process and verify the document. Finally, we created an annotated dataset containing word images and their corresponding ground truth transcriptions. In this work, we demonstrate the proposed tool for annotating 14 sets corresponding to 13 Indic languages and English. Each set contains 15000 handwritten document images. On an extensive collection of handwritten document images in 14 languages, 80% of words are correctly annotated by the automatic annotation tool, while the remaining 20% are corrected manually.
An empirical study of CTC based models for OCR of Indian languages
MINESH MATHEW,Jawahar C V
Technical Report, arXiv, 2022
@inproceedings{bib_An_e_2022, AUTHOR = {MINESH MATHEW, Jawahar C V}, TITLE = {An empirical study of CTC based models for OCR of Indian languages}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Recognition of text on word or line images, without the need for sub-word segmentation has become the mainstream of research and development of text recognition for Indian languages. Modelling unsegmented sequences using Connectionist Temporal Classification (CTC) is the most commonly used approach for segmentation-free OCR. In this work we present a comprehensive empirical study of various neural network models that uses CTC for transcribing step-wise predictions in the neural network output to a Unicode sequence. The study is conducted for 13 Indian languages, using an internal dataset that has around 1000 pages per language. We study the choice of line vs word as the recognition unit, and use of synthetic data to train the models. We compare our models with popular publicly available OCR tools for end-to-end document image recognition. Our end-to-end pipeline that employ our recognition models and existing text segmentation tools outperform these public OCR tools for 8 out of the 13 languages. We also introduce a new public dataset called Mozhi for word and line recognition in Indian language. The dataset contains more than 1.2 million annotated word images (120 thousand text lines) across 13 Indian languages. Our code, trained models and the Mozhi dataset will be made available at cvit-projects Additional Key Words and Phrases: indic ocr, indian languages, crnn, ctc, text recognition
System and method of scribble based segmentation for medical imaging using machine learning
Jawahar C V,Bhavani S,Ashutosh Gupta,Chetan Arora
United States Patent, Us patent, 2022
@inproceedings{bib_Syst_2022, AUTHOR = {Jawahar C V, Bhavani S, Ashutosh Gupta, Chetan Arora}, TITLE = {System and method of scribble based segmentation for medical imaging using machine learning}, BOOKTITLE = {United States Patent}. YEAR = {2022}}
A system and method generating an optimized medical image using a machine learning model are provided . The method includes ( i ) receiving one or more medical images , ( ii ) segmenting to generate a transformed medical image for detecting a plurality of target elements , ( iii ) displaying the transformed medical image , ( iv ) receiving markings and scribblings associated with scribble locations from a user , ( v ) identifying errors associated with an outline of a target element , ( vi ) computing a loss function for a location of pixels where the target element is located on the transformed medical image , ( vii ) modifying the pre - defined weights ( w ) to match the segmentation output and the determined target element ,( viii ) determining whether the segmentation output is matched with the target element and ( ix ) generating the optimized medical image if the segmentation output is matched with the determined target element
System and method for generating an optimized image with scribble-based annotation of images using a machine learning model
Jawahar C V,Bhavani S,Ashutosh Gupta,Chetan Arora
United States Patent, Us patent, 2022
@inproceedings{bib_Syst_2022, AUTHOR = {Jawahar C V, Bhavani S, Ashutosh Gupta, Chetan Arora}, TITLE = {System and method for generating an optimized image with scribble-based annotation of images using a machine learning model}, BOOKTITLE = {United States Patent}. YEAR = {2022}}
A system and method for generating an optimized image with scribble - based interactive image segmentation model using a machine learning are provided . The method includes , ( i ) segmenting , using a machine learning model , an image to classify into classes each class is represented with a label , ( ii ) displaying the classified image which specifies the classes on the classified image with outlines , ( iii ) enabling a user to scribble on the classified image to annotate the classes if an area is not classified , ( iv ) assigning a color mask for each scribbled area , ( v ) computing , using the machine learning model , a loss function for a location of pixels based on color mask , ( vi ) modifying pre - defined weights for each scribbled area to match the annotated image and a determined class on the classified image , and ( vii ) generating the optimized image if the annotated image is matched with the determined class on the classified image
Generalized Keyword Spotting using ASR embeddings
Kirandevraj R,Vinod K Kurmi,Vinay P Namboodiri,Jawahar C V
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022
@inproceedings{bib_Gene_2022, AUTHOR = {Kirandevraj R, Vinod K Kurmi, Vinay P Namboodiri, Jawahar C V}, TITLE = {Generalized Keyword Spotting using ASR embeddings}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2022}}
Keyword Spotting (KWS) detects a set of pre-defined spoken keywords. Building a KWS system for an arbitrary set requires massive training datasets. We propose to use the text transcripts from an Automatic Speech Recognition (ASR) system alongside triplets for KWS training. The intermediate representation from the ASR system trained on a speech corpus is used as acoustic word embeddings for keywords. Triplet loss is added to the Connectionist Temporal Classification (CTC) loss in the ASR while training. This method achieves an Average Precision (AP) of 0.843 over 344 words unseen by the model trained on the TIMIT dataset. In contrast, the Multi-View recurrent method that learns jointly on the text and acoustic embeddings achieves only 0.218 for out-of-vocabulary words. This method is also applied to low-resource languages such as Tamil by converting Tamil characters to English using transliteration. This is a very challenging novel task for which we provide a dataset of transcripts for the keywords. Despite our model not generalizing well, we achieve a benchmark AP of 0.321 on over 38 words unseen by the model on the MSWC Tamil keyword set. The model also produces an accuracy of 96.2% for classification tasks on the Google Speech Commands dataset. Index Terms: speech recognition, keyword spotting, lowresource languages
Compressing Video Calls using Synthetic Talking Heads
Madhav Agarwal,Anchit Gupta,Rudrabha Mukhopadhyay,Vinay P. NamboodirI,Jawahar C V
British Machine Vision Conference, BMVC, 2022
@inproceedings{bib_Comp_2022, AUTHOR = {Madhav Agarwal, Anchit Gupta, Rudrabha Mukhopadhyay, Vinay P. NamboodirI, Jawahar C V}, TITLE = {Compressing Video Calls using Synthetic Talking Heads}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2022}}
We leverage the modern advancements in talking head generation to propose an end-to-end system for talking head video compression. Our algorithm transmits pivot frames intermittently while the rest of the talking head video is generated by animating them. We use a state-of-the-art face reenactment network to detect key points in the non-pivot frames and transmit them to the receiver. A dense flow is then calculated to warp a pivot frame to reconstruct the non-pivot ones. Transmitting key points instead of full frames leads to significant compression. We propose a novel algorithm to adaptively select the best-suited pivot frames at regular intervals to provide a smooth experience. We also propose a frame-interpolater at the receiver's end to improve the compression levels further. Finally, a face enhancement network improves reconstruction quality, significantly improving several aspects like the sharpness of the generations. We evaluate our method both qualitatively and quantitatively on benchmark datasets and compare it with multiple compression techniques. We release a demo video and additional information at https://cvit.iiit.ac.in/research/projects/cvit-projects/talking-video-compression.
Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors
Sindhu Balachandra Hegde,Rudrabha Mukhopadhyay,Vinay P Namboodiri,Jawahar C V
ACM international conference on Multimedia, ACMMM, 2022
@inproceedings{bib_Extr_2022, AUTHOR = {Sindhu Balachandra Hegde, Rudrabha Mukhopadhyay, Vinay P Namboodiri, Jawahar C V}, TITLE = {Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors}, BOOKTITLE = {ACM international conference on Multimedia}. YEAR = {2022}}
In this paper, we explore an interesting question of what can be obtained from an 8 × 8 pixel video sequence. Surprisingly, it turns out to be quite a lot. We show that when we process this 8 × 8 video with the right set of audio and image priors, we can obtain a full-length, 256 × 256 video. We achieve this 32× scaling of an extremely low-resolution input using our novel audio-visual upsampling network. The audio prior helps to recover the elemental facial details and precise lip shapes and a single high-resolution target identity image prior provides us with rich appearance details. Our approach is an end-to-end multi-stage framework. The first stage produces a coarse intermediate output video that can be then used to animate single target identity image and generate realistic, accurate and high-quality outputs. Our approach is simple and performs exceedingly well (an 8× improvement in FID score) compared to previous super-resolution methods. We also extend our model to talking-face video compression, and show that we obtain a 3.5× improvement in terms of bits/pixel over the previous state-of-the-art. The results from our network are thoroughly analyzed through extensive ablation experiments (in the paper and supplementary material). We also provide the demo video along with code and models on our website
Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild
Sindhu Balachandra Hegde,K R Prajwal,Rudrabha Mukhopadhyay,Vinay P Namboodiri,Jawahar C V
ACM international conference on Multimedia, ACMMM, 2022
@inproceedings{bib_Lip-_2022, AUTHOR = {Sindhu Balachandra Hegde, K R Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, Jawahar C V}, TITLE = {Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild}, BOOKTITLE = {ACM international conference on Multimedia}. YEAR = {2022}}
In this work, we address the problem of generating speech from silent lip videos for any speaker in the wild. In stark contrast to previous works, our method (i) is not restricted to a fixed number of speakers, (ii) does not explicitly impose constraints on the domain or the vocabulary and (iii) deals with videos that are recorded in the wild as opposed to within laboratory settings. The task presents a host of challenges, with the key one being that many features of the desired target speech, like voice, pitch and linguistic content, cannot be entirely inferred from the silent face video. In order to handle these stochastic variations, we propose a new VAE-GAN
Document Image Analysis Using Deep Multi-modular Features
JOBIN K V,Ajoy Mondal,Jawahar C V
SN Computer Science, SNCS, 2022
@inproceedings{bib_Docu_2022, AUTHOR = {JOBIN K V, Ajoy Mondal, Jawahar C V}, TITLE = {Document Image Analysis Using Deep Multi-modular Features}, BOOKTITLE = {SN Computer Science}. YEAR = {2022}}
Texture or repeating patterns, discriminative patches, and shapes are the salient features for various document image analysis problems. This article proposes a deep network architecture that independently learns texture patterns, discriminative patches, and shapes to solve various document image analysis tasks. The considered tasks are document image classifcation, genre identifcation from book covers, scientifc document fgure classifcation, and script identifcation. The presented network learns global, texture, and discriminative features and combines them judicially based on the nature of the problems to be solved. We compare the performance of the proposed approach with state-of-the-art techniques on multiple publicly available datasets such as Book-Cover, rvl-cdip, cvsi and docfigure. Experiments show that our approach outperforms stateof-the-art for the genre and document fgure classifcations and obtains comparable results for document image and script classifcation tasks.
Grounded Video Situation Recognition
Zeeshan Khan,Jawahar C V,Makarand Tapaswi
Neural Information Processing Systems, NeurIPS, 2022
@inproceedings{bib_Grou_2022, AUTHOR = {Zeeshan Khan, Jawahar C V, Makarand Tapaswi}, TITLE = {Grounded Video Situation Recognition}, BOOKTITLE = {Neural Information Processing Systems}. YEAR = {2022}}
Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation. In this work, we propose the addition of spatiotemporal grounding as an essential component of the structured prediction task in a weakly supervised setting, and present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the video clips to enable fine-grained spatio-temporal reasoning. The second stage sees verb-role queries attend and pool information from object embeddings, localising answers to questions posed about the action. The final stage generates these answers as captions to describe each verb-role pair present in the video. Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When evaluated on a grounding-augmented version of the VidSitu dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time.
New Objects on the Road? No Problem, Well Learn Them Too
Deepak Kumar Singh,Shyam Nandan Rai,K J Joseph,Rohit Saluja,Vineeth N Balasubramanian,Chetan Arora,Anbumani Subramanian,Jawahar C V
International Conference on Intelligent Robots and Systems, IROS, 2022
@inproceedings{bib_New__2022, AUTHOR = {Deepak Kumar Singh, Shyam Nandan Rai, K J Joseph, Rohit Saluja, Vineeth N Balasubramanian, Chetan Arora, Anbumani Subramanian, Jawahar C V}, TITLE = {New Objects on the Road? No Problem, Well Learn Them Too}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2022}}
Object detection plays an essential role in providing localization, path planning, and decision making capabilities in autonomous navigation systems. However, existing object detection models are trained and tested on a fixed number of known classes. This setting makes the object detection model difficult to generalize well in real-world road scenarios while encountering an unknown object. We address this problem by introducing our framework that handles the issue of unknown object detection and updates the model when unknown object labels are available. Next, our solution includes three major components that address the inherent problems present in the road scene datasets. The novel components are a) Feature-Mix that improves the unknown object detection by widening the gap between known and unknown classes in latent feature space, b) Focal regression loss handling the problem of improving small object detection and intra-class scale variation, and c) Curriculum learning further enhances the detection of small objects. We use Indian Driving Dataset (IDD) and Berkeley Deep Drive (BDD) dataset for evaluation. Our solution provides state-of-the-art performance on open-world evaluation metrics. We hope this work will create new directions for open-world object detection for road scenes, making it more reliable and robust autonomous systems.
INR-V: A Continuous Representation Space for Video-based Generative Tasks
Bipasha Sen,Aditya Agarwal,Vinay P Namboodiri,Jawahar C V
Transactions in Machine Learning Research, TMLR, 2022
@inproceedings{bib_INR-_2022, AUTHOR = {Bipasha Sen, Aditya Agarwal, Vinay P Namboodiri, Jawahar C V}, TITLE = {INR-V: A Continuous Representation Space for Video-based Generative Tasks}, BOOKTITLE = {Transactions in Machine Learning Research}. YEAR = {2022}}
arXiv preprint arXiv:2210.16579 Description Generating videos is a complex task that is accomplished by generating a set of temporally coherent images frame-by-frame. This limits the expressivity of videos to only image-based operations on the individual video frames needing network designs to obtain temporally coherent trajectories in the underlying image space. We propose INR-V, a video representation network that learns a continuous space for video-based generative tasks. INR-V parameterizes videos using implicit neural representations (INRs), a multi-layered perceptron that predicts an RGB value for each input pixel location of the video. The INR is predicted using a meta-network which is a hypernetwork trained on neural representations of multiple video instances. Later, the meta-network can be sampled to generate diverse novel videos enabling many downstream video-based generative tasks. Interestingly, we find that conditional regularization and progressive weight initialization play a crucial role in obtaining INR-V. The representation space learned by INR-V is more expressive than an image space showcasing many interesting properties not possible with the existing works. For instance, INR-V can smoothly interpolate intermediate videos between known video instances (such as intermediate identities, expressions, and poses in face videos). It can also in-paint missing portions in videos to recover temporally coherent full videos. In this work, we evaluate the space learned by INR-V on diverse generative tasks such as video interpolation, novel video generation, video inversion, and video inpainting against the existing baselines. INR-V significantly outperforms the …
My View is the Best View: Procedure Learning from Egocentric Videos
Siddhant Bansal,Chetan Arora,Jawahar C V
European Conference on Computer Vision, ECCV, 2022
@inproceedings{bib_My_V_2022, AUTHOR = {Siddhant Bansal, Chetan Arora, Jawahar C V}, TITLE = {My View is the Best View: Procedure Learning from Egocentric Videos}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2022}}
Procedure learning involves identifying the key-steps and determining their logical order to perform a task. Existing approaches commonly use third-person videos for learning the procedure, making the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action. However, procedure learning from egocentric videos is challenging because (a) the camera view undergoes extreme changes due to the wearer’s head motion, and (b) the presence of unrelated frames due to the unconstrained nature of the videos. Due to this, current state-of-the-art methods’ assumptions that the actions occur at approximately the same time and are of the same duration, do not hold. Instead, we propose to use the signal provided by the temporal correspondences between key-steps across videos. To this end, we present a novel self-supervised Correspond and Cut (CnC) framework for procedure learning. CnC identifies and utilizes the temporal correspondences between the key-steps across multiple videos to learn the procedure. Our experiments show that CnC outperforms the state-of-the-art on the benchmark ProceL and CrossTask datasets by 5.2% and 6.3%, respectively. Furthermore, for procedure learning using egocentric videos, we propose the EgoProceL dataset consisting of 62 hours of videos captured by 130 subjects performing 16 tasks. The source code and the dataset are available on the project page https://sid2697.github.io/egoprocel/
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments
Shubham Dokania,Anbumani Subramanian ,Manmohan Chandraker,Jawahar C V
European Conference on Computer Vision, ECCV, 2022
@inproceedings{bib_TRoV_2022, AUTHOR = {Shubham Dokania, Anbumani Subramanian , Manmohan Chandraker, Jawahar C V}, TITLE = {TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2022}}
High-quality structured data with rich annotations are critical components in intelligent vehicle systems dealing with road scenes. However, data curation and annotation require intensive investments and yield low-diversity scenarios. The recently growing interest in synthetic data raises questions about the scope of improvement in such systems and the amount of manual work still required to produce high volumes and variations of simulated data. This work proposes a synthetic data generation pipeline that utilizes existing datasets, like nuScenes, to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation, mimicking real scene properties with high-fidelity, along with mechanisms to diversify samples in a physically meaningful way. We demonstrate improvements in mIoU metrics by presenting qualitative and quantitative experiments with real and synthetic data for semantic segmentation on the Cityscapes and KITTI-STEP datasets. All relevant code and data is released on github
Enhancing Indic Handwritten Text Recognition Using Global Semantic Information
Ajoy Mondal,Jawahar C V
International Conference on Frontiers in Handwriting Recognition, ICFHR, 2022
@inproceedings{bib_Enha_2022, AUTHOR = {Ajoy Mondal, Jawahar C V}, TITLE = {Enhancing Indic Handwritten Text Recognition Using Global Semantic Information}, BOOKTITLE = {International Conference on Frontiers in Handwriting Recognition}. YEAR = {2022}}
Handwritten Text Recognition (htr) is more interesting and challenging than printed text due to uneven variations in the handwriting style of the writers, content, and time. htr becomes more challenging for the Indic languages because of (i) multiple characters combined to form conjuncts which increase the number of characters of respective languages, and (ii) near to 100 unique basic Unicode characters in each Indic script. Recently, many recognition methods based on the encoderdecoder framework have been proposed to handle such problems. They still face many challenges, such as image blur and incomplete characters due to varying writing styles and ink density. We argue that most encoder-decoder methods are based on local visual features without explicit global semantic information. In this work, we enhance the performance of Indic handwritten text recognizers using global semantic information. We use a semantic module in an encoder-decoder framework for extracting global semantic information to recognize the Indic handwritten texts. The semantic information is used in both the encoder for supervision and the decoder for initialization. The semantic information is predicted from the word embedding of a pre-trained language model. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art results on handwritten texts of ten Indic languages.
Information Retrieval from the Digitized Books
Riya Gupta,Jawahar C V
Technical Report, arXiv, 2022
@inproceedings{bib_Info_2022, AUTHOR = {Riya Gupta, Jawahar C V}, TITLE = {Information Retrieval from the Digitized Books}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Extracting the relevant information out of a large number of documents is a challenging and tedious task. The quality of results generated by the traditionally available full-text search engine and text-based image retrieval systems is not optimal. Information retrieval (IR) tasks become more challenging with the nontraditional language scripts, as in the case of Indic scripts. The authors have developed OCR (Optical Character Recognition) Search Engine to make an Information Retrieval & Extraction (IRE) system that replicates the current state-of-the-art methods using the IRE and Natural Language Processing (NLP) techniques. Here we have presented the study of the methods used for performing search and retrieval tasks. The details of this system, along with the statistics of the dataset (source: National Digital Library of India or NDLI), is also presented. Additionally, the ideas to further explore and add value to research in IRE are also discussed.
Overcoming Label Noise for Source-free Unsupervised Video Domain Adaptation
Avijit Dasgupta,Jawahar C V,Karteek Alahari
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2022
@inproceedings{bib_Over_2022, AUTHOR = {Avijit Dasgupta, Jawahar C V, Karteek Alahari}, TITLE = {Overcoming Label Noise for Source-free Unsupervised Video Domain Adaptation}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2022}}
Despite the progress seen in classification methods, current approaches for handling videos with distribution shifts in source and target domains remain source-dependent as they require access to the source data during the adaptation stage. In this paper, we present a self-training based source-free video domain adaptation approach (without bells and whistles) to address this challenge by bridging the gap between the source and the target domains. We use the source pre-trained model to generate pseudo-labels for the target domain samples, which are inevitably noisy. We treat the problem of source-free video domain adaptation as learning from noisy labels and argue that the samples with correct pseudo-labels can help in the adaptation stage. To this end, we leverage the crossentropy loss as an indicator of the correctness of pseudo-labels, and use the resulting small-loss samples from the target domain for fine-tuning the model. Extensive experimental evaluations show that our method termed as CleanAdapt achieves ∼ 7% gain over the source-only model and outperforms the state-of-the-art approaches on various open datasets.
Towards Robust Handwritten Text Recognition with On-the-fly User Participation
Ajoy Mondal,Rohit Saluja,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2022
@inproceedings{bib_Towa_2022, AUTHOR = {Ajoy Mondal, Rohit Saluja, Jawahar C V}, TITLE = {Towards Robust Handwritten Text Recognition with On-the-fly User Participation}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2022}}
Long-term OCR services aim to provide high-quality output to their users at competitive costs. It is essential to upgrade the models because of the complex data loaded by the users. The service providers encourage the users who provide data where the OCR model fails by rewarding them based on data complexity, readability, and available budget. Hitherto, the OCR works include preparing the models on standard datasets without considering the end-users. We propose a strategy of consistently upgrading an existing Handwritten Hindi OCR model three times on the dataset of 15 users. We fix the budget of 4 users for each iteration. For the first iteration, the model directly trains on the dataset from the first four users. For the rest iteration, all remaining users write a page each, which service providers later analyze to select the 4 (new) best users based on the quality of predictions on the human-readable words. Selected users write 23 more pages for upgrading the model. We upgrade the model with Curriculum Learning (CL) on the data available in the current iteration and compare the subset from previous iterations. The upgraded model is tested on a held-out set of one page each from all 23 users. We provide insights into our investigations on the effect of CL, user selection, and especially the data from unseen writing styles. Our work can be used for long-term OCR services in crowd-sourcing scenarios for the service providers and end users.
A Fine-Grained Vehicle Detection (FGVD) Dataset for Unconstrained Roads
Prafful Kumar Khoba,Chirag Parikh,Rohit Saluja,Santosh Ravi Kiran,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2022
@inproceedings{bib_A_Fi_2022, AUTHOR = {Prafful Kumar Khoba, Chirag Parikh, Rohit Saluja, Santosh Ravi Kiran, Jawahar C V}, TITLE = {A Fine-Grained Vehicle Detection (FGVD) Dataset for Unconstrained Roads}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2022}}
The previous fine-grained datasets mainly focus on classification and are often captured in a controlled setup, with the camera focusing on the objects. We introduce the first Fine-Grained Vehicle Detection (FGVD) dataset in the wild, captured from a moving camera mounted on a car. It contains 5502 scene images with 210 unique fine-grained labels of multiple vehicle types organized in a three-level hierarchy. While previous classification datasets also include makes for different kinds of cars, the FGVD dataset introduces new class labels for categorizing two-wheelers, autorickshaws, and trucks. The FGVD dataset is challenging as it has vehicles in complex traffic scenarios with intra-class and inter-class variations in types, scale, pose, occlusion, and lighting conditions. The current object detectors like yolov5 and faster RCNN perform poorly on our dataset due to a lack of hierarchical modeling. Along with providing baseline results for existing object detectors on FGVD Dataset, we also present the results of a combination of an existing detector and the recent Hierarchical Residual Network (HRN) classifier for the FGVD task. Finally, we show that FGVD vehicle images are the most challenging to classify among the fine-grained datasets.
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Siddhant Bansal,Durga Nagendra Raghava Kumar M,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2022
@inproceedings{bib_Ego4_2022, AUTHOR = {Siddhant Bansal, Durga Nagendra Raghava Kumar M, Jawahar C V}, TITLE = {Ego4D: Around the World in 3,000 Hours of Egocentric Video}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2022}}
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/
Read while you drive - multilingual text tracking on the road
Sergi Garcia-Bordils,George Tom,Sangeeth Reddy Battu,MINESH MATHEW,Marcal Rusinol,Jawahar C V,Dimosthenis Karatzas
International Workshop on Document Analysis Systems, DAS, 2022
@inproceedings{bib_Read_2022, AUTHOR = {Sergi Garcia-Bordils, George Tom, Sangeeth Reddy Battu, MINESH MATHEW, Marcal Rusinol, Jawahar C V, Dimosthenis Karatzas}, TITLE = {Read while you drive - multilingual text tracking on the road}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2022}}
Visual data obtained during driving scenarios usually con-tain large amounts of text that conveys semantic information necessary to analyse the urban environment and is integral to the traffic control plan. Yet, research on autonomous driving or driver assistance systems typically ignores this information. To advance research in this direction, we present RoadText-3K, a large driving video dataset with fully anno-tated text. RoadText-3K is three times bigger than its predecessor and contains data from varied geographical locations, unconstrained driving conditions and multiple languages and scripts. We offer a comprehensive analysis of tracking by detection and detection by tracking methods ex-ploring the limits of state-of-the-art text detection. Finally, we propose a new end-to-end trainable tracking model that yields state-of-the-art results on this challenging dataset. Our experiments demonstrate the complexity and variability of RoadText-3K and establish a new, realistic benchmark for scene text tracking in the wild.
Exploring Histological Similarities Across Cancers From a Deep Learning Perspective
Ashish Menon,Piyush Singh,Vinod Palakkad Krishnanunni,Jawahar C V
Frontiers in Oncology, FIO, 2022
@inproceedings{bib_Expl_2022, AUTHOR = {Ashish Menon, Piyush Singh, Vinod Palakkad Krishnanunni, Jawahar C V}, TITLE = {Exploring Histological Similarities Across Cancers From a Deep Learning Perspective}, BOOKTITLE = {Frontiers in Oncology}. YEAR = {2022}}
Histopathology image analysis is widely accepted as a gold standard for cancer diagnosis. The Cancer Genome Atlas (TCGA) contains large repositories of histopathology whole slide images spanning several organs and subtypes. However, not much work has gone into analyzing all the organs and subtypes and their similarities. Our work attempts to bridge this gap by training deep learning models to classify cancer vs. normal patches for 11 subtypes spanning seven organs (9,792 tissue slides) to achieve high classification performance. We used these models to investigate their performances in the test set of other organs (cross-organ inference). We found that every model had a good cross-organ inference accuracy when tested on breast, colorectal, and liver cancers. Further, high accuracy is observed between models trained on the cancer subtypes originating from the same organ (kidney and lung). We also validated these performances by showing the separability of cancer and normal samples in a high-dimensional feature space. We further hypothesized that the high cross-organ inferences are due to shared tumor morphologies among organs. We validated the hypothesis by showing the overlap in the Gradient-weighted Class Activation Mapping (GradCAM) visualizations and similarities in the distributions of nuclei features present within the high-attention regions.
Improving Scene Text Recognition for Indian Languages with Transfer Learning and Font Diversity
G SANJANA,Rohit Saluja,Jawahar C V
Journal of Imaging, JIma, 2022
@inproceedings{bib_Impr_2022, AUTHOR = {G SANJANA, Rohit Saluja, Jawahar C V}, TITLE = {Improving Scene Text Recognition for Indian Languages with Transfer Learning and Font Diversity}, BOOKTITLE = {Journal of Imaging}. YEAR = {2022}}
Reading Indian scene texts is complex due to the use of regional vocabulary, multiple fonts/scripts, and text size. This work investigates the significant differences in Indian and Latin Scene Text Recognition (STR) systems. Recent STR works rely on synthetic generators that involve diverse fonts to ensure robust reading solutions. We present utilizing additional non-Unicode fonts with generally employed Unicode fonts to cover font diversity in such synthesizers for Indian languages. We also perform experiments on transfer learning among six different Indian languages. Our transfer learning experiments on synthetic images with common backgrounds provide an exciting insight that Indian scripts can benefit from each other than from the extensive English datasets. Our evaluations for the real settings help us achieve significant improvements over previous methods on four Indian languages from standard datasets like IIIT-ILST, MLT-17, and the new dataset (we release) containing 440 scene images with 500 Gujarati and 2535 Tamil words. Further enriching the synthetic dataset with non-Unicode fonts and multiple augmentations helps us achieve a remarkable Word Recognition Rate gain of over 33% on the IIIT-ILST Hindi dataset. We also present the results of lexicon-based transcription approaches for all six languages
Detecting, Tracking and Counting Motorcycle Rider Traffic Violations on Unconstrained Roads
Aman Goyal,Dev Agarwa,Anbumani Subramanian,Jawahar C V,Santosh Ravi Kiran,Rohit Saluja
Technical Report, arXiv, 2022
@inproceedings{bib_Dete_2022, AUTHOR = {Aman Goyal, Dev Agarwa, Anbumani Subramanian, Jawahar C V, Santosh Ravi Kiran, Rohit Saluja}, TITLE = {Detecting, Tracking and Counting Motorcycle Rider Traffic Violations on Unconstrained Roads}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
In many Asian countries with unconstrained road traffic conditions, driving violations such as not wearing helmets and triple-riding are a significant source of fatalities involving motorcycles. Identifying and penalizing such riders is vital in curbing road accidents and improving citizens’ safety. With this motivation, we propose an approach for detecting, tracking, and counting motorcycle riding violations in videos taken from a vehicle-mounted dashboard camera. We employ a curriculum learning-based object detector to better tackle challenging scenarios such as occlusions. We introduce a novel trapezium-shaped object boundary representation to increase robustness and tackle the rider-motorcycle association. We also introduce an amodal regressor that generates bounding boxes for the occluded riders. Experimental results on a large-scale unconstrained driving dataset demonstrate the superiority of our approach compared to existing approaches and other ablative variants.
FLUID: Few-Shot Self-Supervised Image Deraining
Shyam Nandan Rai,Rohit Saluja,Chetan Arora,Vineeth N Balasubramanian,Anbumani Subramanian,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2022
@inproceedings{bib_FLUI_2022, AUTHOR = {Shyam Nandan Rai, Rohit Saluja, Chetan Arora, Vineeth N Balasubramanian, Anbumani Subramanian, Jawahar C V}, TITLE = {FLUID: Few-Shot Self-Supervised Image Deraining}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2022}}
Self-supervised methods have shown promising results in denoising and dehazing tasks, where the collection of the paired dataset is challenging and expensive. However, we find that these methods fail to remove the rain streaks when applied for image deraining tasks. The method’s poor per- formance is due to the explicit assumptions: (i) the distribu- tion of noise or haze is uniform and (ii) the value of a noisy or hazy pixel is independent of its neighbors. The rainy pix- els are non-uniformly distributed, and it is not necessarily dependant on its neighboring pixels. Hence, we conclude that the self-supervised method needs to have some prior knowledge about rain distribution to perform the deraining task. To provide this knowledge, we hypothesize a network trained with minimal supervision to estimate the likelihood of rainy pixels. This leads us to our proposed method called FLUID: Few Shot Self-Supervised Image Deraining. We perform extensive experiments and comparisons with existing image deraining and few-shot image-to-image translation methods on Rain 100L and DDN-SIRR datasets containing real and synthetic rainy images. In addition, we use the Rainy Cityscapes dataset to show that our method trained in a few-shot setting can improve semantic segmen- tation and object detection in rainy conditions. Our ap- proach obtains a mIoU gain of 51.20 over the current best- performing deraining method.
Removing Atmospheric Turbulence via Deep Adversarial Learning
Shyam Nandan Rai,Jawahar C V
IEEE Transactions on Image Processing, TIP, 2022
@inproceedings{bib_Remo_2022, AUTHOR = {Shyam Nandan Rai, Jawahar C V}, TITLE = {Removing Atmospheric Turbulence via Deep Adversarial Learning}, BOOKTITLE = {IEEE Transactions on Image Processing}. YEAR = {2022}}
Restoring images degraded due to atmospheric turbulence is challenging as it consists of several distortions. Several deep learning methods have been proposed to minimize atmospheric distortions that consist of a single-stage deep net- work. However, we find that a single-stage deep network is insuffi- cient to remove the mixture of distortions caused by atmospheric turbulence. We propose a two-stage deep adversarial network that minimizes atmospheric turbulence to mitigate this. The first stage reduces the geometrical distortion and the second stage minimizes the image blur. We improve our network by adding channel attention and a proposed sub-pixel mechanism, which utilizes the information between the channels and further reduces the atmospheric turbulence at the finer level. Unlike previous methods, our approach neither uses any prior knowledge about atmospheric turbulence conditions at inference time nor requires the fusion of multiple images to get a single restored image. Our final restoration models DT-GAN+ and DTD-GAN+ outperform the general state-of-the-art image-to-image translation models and baseline restoration models. We synthesize turbulent image datasets to train the restoration models. Additionally, we also curate a natural turbulent dataset from YouTube to show the generalisability of the proposed model. We perform extensive experiments on restored images by utilizing them for downstream tasks such as classification, pose estimation, semantic keypoint estimation, and depth estimation. We observe that our restored images outperform turbulent images in downstream tasks by a significant margin demonstrating the restoration model’s applica- bility in real-world problems.
InfographicVQA
MINESH MATHEW,Viraj Bagal,Ruben Tito,Dimosthenis Karatzas,Ernest Valveny,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2022
@inproceedings{bib_Info_2022, AUTHOR = {MINESH MATHEW, Viraj Bagal, Ruben Tito, Dimosthenis Karatzas, Ernest Valveny, Jawahar C V}, TITLE = {InfographicVQA}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2022}}
Infographics communicate information using a combi- nation of textual, graphical and visual elements. This work explores the automatic understanding of infographic images by using a Visual Question Answering technique. To this end, we present InfographicVQA, a new dataset comprising a diverse collection of infographics and question-answer annotations. The questions require methods that jointly rea- son over the document layout, textual content, graphical el- ements, and data visualizations. We curate the dataset with an emphasis on questions that require elementary reason- ing and basic arithmetic skills. For VQA on the dataset, we evaluate two Transformer-based strong baselines. Both the baselines yield unsatisfactory results compared to near perfect human performance on the dataset. The results sug- gest that VQA on infographics—images that are designed to communicate information quickly and clearly to human brain—is ideal for benchmarking machine understanding of complex document images. The dataset is available for download at docvqa.org
Multi-Domain Incremental Learning for Semantic Segmentation
Prachi Garg,Rohit Saluja,Vineeth N Balasubramanian,Chetan Arora,Anbumani Subramanian,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2022
@inproceedings{bib_Mult_2022, AUTHOR = {Prachi Garg, Rohit Saluja, Vineeth N Balasubramanian, Chetan Arora, Anbumani Subramanian, Jawahar C V}, TITLE = {Multi-Domain Incremental Learning for Semantic Segmentation}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2022}}
Recent efforts in multi-domain learning for semantic seg- mentation attempt to learn multiple geographical datasets in a universal, joint model. A simple fine-tuning experi- ment performed sequentially on three popular road scene segmentation datasets demonstrates that existing segmenta- tion frameworks fail at incrementally learning on a series of visually disparate geographical domains. When learning a new domain, the model catastrophically forgets previously learned knowledge. In this work, we pose the problem of multi-domain incremental learning for semantic segmenta- tion. Given a model trained on a particular geographical domain, the goal is to (i) incrementally learn a new geo- graphical domain, (ii) while retaining performance on the old domain, (iii) given that the previous domain’s dataset is not accessible. We propose a dynamic architecture that assigns universally shared, domain-invariant parameters to capture homogeneous semantic features present in all do- mains, while dedicated domain-specific parameters learn the statistics of each domain. Our novel optimization strat- egy helps achieve a good balance between retention of old knowledge (stability) and acquiring new knowledge (plas- ticity). We demonstrate the effectiveness of our proposed solution on domain incremental settings pertaining to real- world driving scenes from roads of Germany (Cityscapes), the United States (BDD100k), and India (IDD).
Visual Understanding of Complex Table Structures from Document Images
Sachin Raja,Ajoy Mondal,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2022
@inproceedings{bib_Visu_2022, AUTHOR = {Sachin Raja, Ajoy Mondal, Jawahar C V}, TITLE = {Visual Understanding of Complex Table Structures from Document Images}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2022}}
Table structure recognition is necessary for a compre- hensive understanding of documents. Tables in unstruc- tured business documents are tough to parse due to the high diversity of layouts, varying alignments of contents, and the presence of empty cells. The problem is particularly dif- ficult because of challenges in identifying individual cells using visual or linguistic contexts or both. Accurate detec- tion of table cells (including empty cells) simplifies struc- ture extraction and hence, it becomes the prime focus of our work. We propose a novel object-detection-based deep model that captures the inherent alignments of cells within tables and is fine-tuned for fast optimization. Despite ac- curate detection of cells, recognizing structures for dense tables may still be challenging because of difficulties in cap- turing long-range row/column dependencies in presence of multi-row/column spanning cells. Therefore, we also aim to improve structure recognition by deducing a novel rec- tilinear graph-based formulation. From a semantics per- spective, we highlight the significance of empty cells in a table. To take these cells into account, we suggest an en- hancement to a popular evaluation criterion. Finally, we introduce a modestly sized evaluation dataset with an anno- tation style inspired by human cognition to encourage new approaches to the problem. Our framework improves the previous state-of-the-art performance by a 2.7% average F1-score on benchmark datasets.
To miss-attend is to misalign! Residual Self-Attentive Feature Alignment for Adapting Object Detectors
Vaishnavi Khindkar,Chetan Arora,Vineeth N Balasubramanian,Anbumani Subramanian,Rohit Saluja,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2022
@inproceedings{bib_To_m_2022, AUTHOR = {Vaishnavi Khindkar, Chetan Arora, Vineeth N Balasubramanian, Anbumani Subramanian, Rohit Saluja, Jawahar C V}, TITLE = {To miss-attend is to misalign! Residual Self-Attentive Feature Alignment for Adapting Object Detectors}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2022}}
Advancements in adaptive object detection can lead to tremendous improvements in applications like autonomous navigation, as they alleviate the distributional shifts along the detection pipeline. Prior works adopt adversarial learn- ing to align image features at global and local levels, yet the instance-specific misalignment persists. Also, adaptive object detection remains challenging due to visual diver- sity in background scenes and intricate combinations of ob- jects. Motivated by structural importance, we aim to attend prominent instance-specific regions, overcoming the feature misalignment issue. We propose a novel resIduaL seLf- attentive featUre alignMEnt (ILLUME) method for adaptive object detection. ILLUME comprises Self-Attention Fea- ture Map (SAFM) module that enhances structural atten- tion to object-related regions and thereby generates domain invariant features. Our approach significantly reduces the domain distance with the improved feature alignment of the instances. Qualitative results demonstrate the ability of ILLUME to attend important object instances required for alignment. Experimental results on several benchmark datasets show that our method outperforms the existing state-of-the-art approaches.
Transductive Weakly-Supervised Player Detection Using Soccer Broadcast Videos
Chris Andrew Gadde,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2022
@inproceedings{bib_Tran_2022, AUTHOR = {Chris Andrew Gadde, Jawahar C V}, TITLE = {Transductive Weakly-Supervised Player Detection Using Soccer Broadcast Videos}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2022}}
Player detection lays the foundation for many applica- tions in the field of sports analytics including player recog- nition, player tracking, and activity detection. In this work, we study player detection in continuous long shot broad- cast videos. Broadcast match videos are easy to obtain, and detection on these videos is much more challenging. We propose a transductive approach for player detection that treats it as a domain adaptation problem. We show that instance-level domain labels are significant for sufficient adaptation in the case of soccer broadcast videos. An ef- ficient multi-model greedy labelling scheme based on visual features is proposed to annotate domain labels on bounding box predictions made by our inductive model. We use reli- able instances from the inductive model inferences to train a transductive copy of the model. We create and release a fully annotated player detection dataset comprising soccer broadcast videos from the FIFA 2018 World Cup matches to evaluate our method. Our method shows significant im- provements in player detection to the baseline and existing state-of-the-art methods on our dataset. We show, on aver- age, a 16 point improvement in mAP for soccer broadcast videos by annotating domain labels for around a 100 sam- ples per video.
Classification of histopathology images using ConvNets to detect Lupus Nephritis
R Anirudh Reddy,Jawahar C V,Vinod Palakkad Krishnanunni
Neural Information Processing Systems Workshops, NeurIPS-W, 2021
@inproceedings{bib_Clas_2021, AUTHOR = {R Anirudh Reddy, Jawahar C V, Vinod Palakkad Krishnanunni}, TITLE = {Classification of histopathology images using ConvNets to detect Lupus Nephritis}, BOOKTITLE = {Neural Information Processing Systems Workshops}. YEAR = {2021}}
Systemic lupus erythematosus (SLE) is an autoimmune disease in which the immune system of the patient starts attacking healthy tissues of the body. Lupus Nephritis (LN) refers to the inflammation of kidney tissues resulting in renal failure due to these attacks. The International Society of Nephrology/Renal Pathology Society (ISN/RPS) has released a classification system based on various patterns observed during renal injury in SLE. Traditional methods require meticulous pathological assessment of the renal biopsy and are time-consuming. Recently, computational techniques have helped to alleviate this issue by using virtual microscopy or Whole Slide Imaging (WSI). With the use of deep learning and modern computer vision techniques, we propose a pipeline that is able to automate the process of 1) detection of various glomeruli patterns present in these whole slide images and 2) classification of each image using the extracted glomeruli features.
Computer Vision for Capturing Flora
Muthireddy Vamsidhar,Jawahar C V
Digital Techniques for Heritage Presentation and Preservation, DTHPP, 2021
Abs | | bib Tex
@inproceedings{bib_Comp_2021, AUTHOR = {Muthireddy Vamsidhar, Jawahar C V}, TITLE = {Computer Vision for Capturing Flora}, BOOKTITLE = {Digital Techniques for Heritage Presentation and Preservation}. YEAR = {2021}}
The identification of plant species by looking at their leaves, flowers, and seeds is a crucial component in the conservation of endangered plants. Traditional identification methods are manual and time consuming and require domain knowledge to operate. Owing to an increased interest in the automated plant identification system, we propose one that utilizes modern convolutional neural network architectures. This approach helps in the recognition of leaf images and can be integrated into mobile platforms like smartphones. Such a system can also be employed in aiding plant-related education, promoting ecotourism, and creating a digital heritage for plant species, among many others. Our proposed solution achieves a state-of-the-art performance for plant classification in the wild. An exhaustive set of experiments are performed to classify 112 species of plants from the challenging Indic-Leaf dataset. The best-performing model gives Top 1 precision of 90.08 and Top 5 precision of 96.90. We discuss and elaborate on our crowdsourcing web application that is used to collect and regulate data. We explain how the automated plant identification system can be integrated into a smartphone by detailing the flow of our mobile application.
Canonical Saliency Maps: Decoding Deep Face Models
THRUPTHI ANN JOHN,Vineeth N. Balasubramanian,Jawahar C V
IEEE Transactions on Biometrics, Behavior, and Identity Science, TBOIM, 2021
Abs | | bib Tex
@inproceedings{bib_Cano_2021, AUTHOR = {THRUPTHI ANN JOHN, Vineeth N. Balasubramanian, Jawahar C V}, TITLE = {Canonical Saliency Maps: Decoding Deep Face Models}, BOOKTITLE = {IEEE Transactions on Biometrics, Behavior, and Identity Science}. YEAR = {2021}}
As Deep Neural Network models for face processing tasks approach human-like performance, their deployment in critical applications such as law enforcement and access control has seen an upswing, where any failure may have far-reaching consequences. We need methods to build trust in deployed systems by making their working as transparent as possible. Existing visualization algorithms are designed for object recognition and do not give insightful results when applied‘ to the face domain. In this work, we present ‘Canonical Saliency Maps’, a new method which highlights relevant facial areas by projecting saliency maps onto a canonical face model. We present two kinds of Canonical Saliency Maps: image-level maps and model-level maps. Image-level maps highlight facial features responsible for the decision made by a deep face model on a given image, thus helping to understand how a DNN made a prediction on the image. Model-level maps provide an understanding of what the entire DNN model focuses on in each task, and thus can be used to detect biases in the model. Our qualitative and quantitative results show the usefulness of the proposed canonical saliency maps, which can be used on any deep face model regardless of the architecture.
Towards Label-Free Few-Shot Learning: How Far Can We Go?
Aditya Bharti,Vineeth N. B,Jawahar C V
International Conference on Computer vision and Image Processing, CVIP, 2021
@inproceedings{bib_Towa_2021, AUTHOR = {Aditya Bharti, Vineeth N. B, Jawahar C V}, TITLE = {Towards Label-Free Few-Shot Learning: How Far Can We Go?}, BOOKTITLE = {International Conference on Computer vision and Image Processing}. YEAR = {2021}}
Few-shot learners aim to recognize new categories given only a small number of training samples. The core challenge is to avoid overfitting to the limited data while ensuring good generalization to novel classes. Existing literature makes use of vast amounts of annotated data by simply shifting the label requirement from novel classes to base classes. Since data annotation is time-consuming and costly, reducing the label requirement even further is an important goal. To that end, our paper presents a more challenging few-shot setting with almost no class label access. By leveraging self-supervision to learn image representations and similarity for classification at test time, we achieve competitive baselines while using almost zero (0-5) class labels. Compared to existing state-of-the-art approaches which use 60,000 labels, this is a four orders of magnitude (10,000 times) difference. This work is a step towards developing few-shot learning methods that do not depend on annotated data. Our code is publicly released at https: //github.com/adbugger/FewShot. 3
Handwritten Text Retrieval from Unlabeled Collections
Santhoshini Reddy Gongidi,Jawahar C V
International Conference on Computer vision and Image Processing, CVIP, 2021
@inproceedings{bib_Hand_2021, AUTHOR = {Santhoshini Reddy Gongidi, Jawahar C V}, TITLE = {Handwritten Text Retrieval from Unlabeled Collections}, BOOKTITLE = {International Conference on Computer vision and Image Processing}. YEAR = {2021}}
Handwritten documents from communities like cultural heritage, judiciary, and modern journals remain largely unexplored even today. To a great extent, this is due to the lack of retrieval tools for such unlabeled document collections. This work considers such collections and presents a simple, robust retrieval framework for easy information access. We achieve retrieval on unlabeled novel collections through invariant features learned for handwritten text. These feature representations enable zero-shot retrieval for novel queries on unlabeled collections. We improve the framework further by supporting search via text and exemplar queries. Four new collections written in English, Malayalam, and Bengali are used to evaluate our text retrieval framework. These collections comprise 2957 handwritten pages and over 300K words. We report promising results on these collections, despite the zero-shot constraint and huge collection size. Our framework allows the addition of new collections without any need for specific finetuning or labeling. Finally, we also present a demonstration of the retrieval framework. [Project Page] Keywords: Document retrieval· Keyword Spotting· Zero-shot retrieval
Evaluation of Detection and Segmentation Tasks on Driving Datasets
Deepak Kumar Singh,Ameet Rahane,Ajoy Mondal,Anbumani Subramanian,Jawahar C V
International Conference on Computer vision and Image Processing, CVIP, 2021
@inproceedings{bib_Eval_2021, AUTHOR = {Deepak Kumar Singh, Ameet Rahane, Ajoy Mondal, Anbumani Subramanian, Jawahar C V}, TITLE = {Evaluation of Detection and Segmentation Tasks on Driving Datasets}, BOOKTITLE = {International Conference on Computer vision and Image Processing}. YEAR = {2021}}
Object detection, semantic segmentation, and instance seg-mentation form the bases for many computer vision tasks in autonomous driving. The complexity of these tasks increases as we shift from object detection to instance segmentation. The state-of-the-art models are eval-uated on standard datasets such as pascal-voc and ms-cococ, which do not consider the dynamics of road scenes. Driving datasets such as Cityscapes and Berkeley Deep Drive (bdd) are captured in a structured environment with better road markings and fewer variations in the ap-pearance of objects and background. However, the same does not hold for Indian roads. The Indian Driving Dataset (idd) is captured in un-structured driving scenarios and is highly challenging for a model due toits diversity. This work presents a comprehensive evaluation of state-of-the-art models on object detection, semantic segmentation, and instance segmentation on-road scene datasets. We present our analyses and com- pare their quantitative and qualitative performance on structured driv-ing datasets (Cityscapes and bdd) and the unstructured driving dataset (idd); understanding the behavior on these datasets helps in addressing various practical issues and helps in creating real-life applications.
ORDER: Open World Object Detection on Road Scenes
Deepak Kumar Singh,Shyam Nandan Rai,K J Joseph,Rohit Saluja,Vineeth N Balasubramanian,Chetan Arora,Anbumani Subramanian,Jawahar C V
Work shop on Machine Learning for Autonomous Driving, ML4AD, 2021
@inproceedings{bib_ORDE_2021, AUTHOR = {Deepak Kumar Singh, Shyam Nandan Rai, K J Joseph, Rohit Saluja, Vineeth N Balasubramanian, Chetan Arora, Anbumani Subramanian, Jawahar C V}, TITLE = {ORDER: Open World Object Detection on Road Scenes}, BOOKTITLE = {Work shop on Machine Learning for Autonomous Driving}. YEAR = {2021}}
Object detection is a key component in autonomous navigation systems that enables localization and classification of the objects in a road scene. Existing object detection methods are trained and inferred on a fixed number of known classes present in road scenes. However, in real-world or open-world road scenes, while inference, we come across unknown objects that the detection model hasn’t seen while training. Hence, we propose O pen Wo rld Object Detection on Road Scenes (ORDER) to address the aforementioned problem for road scenes. Firstly, we introduce Feature-Mix to improve the unknown object detection capabilities of an object detector. Feature-Mix widens the gap between known and unknown classes in latent feature space that helps improve the unknown object detection. Next, we identify that the road scene dataset compared to generic object dataset contains a significant proportion of small objects and has higher intra-class bounding box scale variations, making it challenging to detect the known and unknown objects. We propose a novel loss: Focal regression loss that collectively addresses the problem of small object detection and intra-class bounding box by penalizing more the small bounding boxes and dynamically changing the loss according to object size. Further, the detection of small objects is improved by curriculum learning. Finally, we present an extensive evaluation on two road scene datasets: BDD and IDD. Our experimental evaluations on BDD and IDD shows consistent improvement over the current state-of-the-art method. We believe that this work will lay the foundation for real-world object detection for road scenes.
Evaluating Computer Vision Techniques for Urban Mobility on Large-Scale, Unconstrained Roads
Harish Rithish,Durga Nagendra Raghava Kumar M,B RANJITH REDDY,Rohit Saluja,Jawahar C V
Technical Report, arXiv, 2021
@inproceedings{bib_Eval_2021, AUTHOR = {Harish Rithish, Durga Nagendra Raghava Kumar M, B RANJITH REDDY, Rohit Saluja, Jawahar C V}, TITLE = {Evaluating Computer Vision Techniques for Urban Mobility on Large-Scale, Unconstrained Roads}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
Conventional approaches for addressing road safety rely on manual interventions or immobile CCTV infras-tructure. Such methods are expensive in enforcing compliance to traffic rules and do not scale to large road networks. This paper proposes a simple mobile imaging setup to address several common problems in road safety at scale. We use recent computer vision techniques to identify possible irregularities on roads, the absence of street lights, and defective traffic signs using videos from a moving camera-mounted vehicle. Beyond the inspection of static road infrastructure, we also demonstrate the mobile imaging solution’s applicability to spot traffic violations. Before deploying our system in the real-world, we investigate the strengths and shortcomings of computer vi- sion techniques on thirteen condition-based hierarchical labels. These conditions include different timings, road type, traffic density, and state of road damage. Our demonstrations are then carried out on 2000 km of unconstrained road scenes, captured across an entire city. Through this, we quantitatively measure the overall safety of roads in the city through carefully constructed metrics. We also show an interactive dashboard for visually inspecting and initiating action in a time, labor and cost-efficient manner. Code, models, and datasets used in this work will be publicly released
Classroom Slide Narration System
JOBIN K V,Ajoy Mondal,Jawahar C V
International Conference on Computer vision and Image Processing, CVIP, 2021
@inproceedings{bib_Clas_2021, AUTHOR = {JOBIN K V, Ajoy Mondal, Jawahar C V}, TITLE = {Classroom Slide Narration System}, BOOKTITLE = {International Conference on Computer vision and Image Processing}. YEAR = {2021}}
Slide presentations are an effective and efficient tool used by the teaching community for classroom communication. However, this teaching model can be challenging for the blind and visually impaired (vi) students. The vi student required a personal human assistance for understand the presented slide. This shortcoming motivates us to de- sign a Classroom Slide Narration System (csns) that generates audio descriptions corresponding to the slide content. This problem poses as an image-to-markup language generation task. The initial step is to ex- tract logical regions such as title, text, equation, figure, and table from the slide image. In the classroom slide images, the logical regions are distributed based on the location of the image. To utilize the location of the logical regions for slide image segmentation, we propose the ar- chitecture, Classroom Slide Segmentation Network (cssn). The unique attributes of this architecture differs from most other semantic segmenta- tion networks. Publicly available benchmark datasets such as WiSe and SPaSe are used to validate the performance of our segmentation architec- ture. We obtained 9.54% segmentation accuracy improvement in WiSe dataset. We extract content (information) from the slide using four well- established modules such as optical character recognition (ocr), figure classification, equation description, and table structure recognizer. With this information, we build a Classroom Slide Narration System (csns) to help vi students understand the slide content. The users have given better feedback on the quality output of the proposed csns in compar- ison to existing systems like Facebook’s Automatic Alt-Text (aat) and Tesseract.
Exploring pan-cancer similarities from a deep learning perspective
Ashish R,Piyush Singh,Vinod Palakkad Krishnanunni,Jawahar C V
Frontiers in Oncology, FIO, 2021
@inproceedings{bib_Expl_2021, AUTHOR = {Ashish R, Piyush Singh, Vinod Palakkad Krishnanunni, Jawahar C V}, TITLE = {Exploring pan-cancer similarities from a deep learning perspective}, BOOKTITLE = {Frontiers in Oncology}. YEAR = {2021}}
Histopathology image analysis is widely accepted as a gold standard for cancer diagnosis. The Cancer Genome Atlas (TCGA) contains large repositories of histopathology whole slide images spanning across several organs and subtypes. However, not much work has gone into analysis of all the organs and subtypes and their similarities. Our work attempts to bridge this gap by training deep learning models to classify cancer vs normal patches for 11 subtypes spanning 7 organs (9792 tissue slides) to achieve near-perfect classification performance. We used these models to investigate their performances in the test set of other organs (cross organ inference). We found that every model had a good cross organ inference accuracy when tested on breast, colorectal and liver cancers. Further, a high accuracy is observed between models trained on the cancer subtypes originating from same organ (kidney and lung). We also validated these performances by showing the separability of cancer and normal samples in a high dimensional feature space. We further hypothesized that the high cross organ inferences are due to shared tumor morphologies among organs. We validated the hypothesis by showing the overlap in the Gradient-weighted Class Activation Mapping (GradCAM) visualizations and similarities in the distributions of nuclei geometrical features present within the high attention regions. Keywords: TCGA, histopathology, cancer classification, cross-organ inference, deep learning, tissue morphology, class activation map.
Interactive Learning for Assisting Whole Slide Image Annotation
Ashish R,Piyush Singh,Vinod Palakkad Krishnanunni,Jawahar C V
Asian Conference on Pattern Recognition, ACPR, 2021
@inproceedings{bib_Inte_2021, AUTHOR = {Ashish R, Piyush Singh, Vinod Palakkad Krishnanunni, Jawahar C V}, TITLE = {Interactive Learning for Assisting Whole Slide Image Annotation}, BOOKTITLE = {Asian Conference on Pattern Recognition}. YEAR = {2021}}
Owing to the large dimensions of the histopathology whole slide images (WSI), visually searching for clinically significant regions (patches) is a tedious task for a medical expert. Sequential analysis of several such images further increases the workload resulting in poor diagnosis. A major impediment towards automating this task using deep learning models is that it requires a huge chunk of laboriously annotated data in the form of WSI patches. Our work suggests a novel CNN-based, expert feedback-driven interactive learning technique to mitigate this issue. The proposed method seeks to acquire labels of the most informative patches in small increments with multiple feedback rounds to maximize the throughput. It requires the expert to query a patch of interest from one slide and provide feedback to a set of unlabelled patches chosen using the proposed sampling strategy from a ranked list. The experiments on a large patient cohort of colorectal cancer histological patches (100K images with nine classes of tissues) show a significant reduction (≈ 95%) in the amount of labelled data required to achieve state-of theart results when compared to other existing interactive learning methods (35% − 50%). We also demonstrate the utility of the proposed technique to assist a WSI tumor segmentation annotation task using the ICIAR breast cancer challenge dataset (≈ 12.5K patches per slide). The proposed technique reduces the scanning and searching area to about 2% of the total area of WSI (by seeking labels of ≈ 250 informative patches only) and achieves segmentation outputs with 85% IOU. Thus our work helps avoid the routine procedure of exhaustive scanning and searching during annotation and diagnosis in general.
Personalized One-Shot Lipreading for an ALS Patient
Bipasha Sen,Aditya Agarwal,Rudrabha Mukhopadhyay,Vinay p Namboodiri,Jawahar C V
British Machine Vision Conference, BMVC, 2021
@inproceedings{bib_Pers_2021, AUTHOR = {Bipasha Sen, Aditya Agarwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, Jawahar C V}, TITLE = {Personalized One-Shot Lipreading for an ALS Patient}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2021}}
Lipreading or visually recognizing speech from the mouth movements of a speaker is a challenging and mentally taxing task. Unfortunately, multiple medical conditions force people to depend on this skill in their day-to-day lives for essential communication. Patients suffering from ‘Amyotrophic Lateral Sclerosis’ (ALS) often lose muscle control, consequently their ability to generate speech and communicate via lip movements. Existing large datasets do not focus on medical patients or curate personalized vocabulary relevant to an individual. Collecting large-scale dataset of a patient, needed to train modern data-hungry deep learning models is however, extremely challenging. In this work, we propose a personalized network to lipread an ALS patient using only one-shot examples. We depend on synthetically generated lip movements to augment the one-shot scenario. A Variational Encoder based domain adaptation technique is used to bridge the real-synthetic domain gap. Our approach significantly improves and achieves high top-5 accuracy with 83.2% accuracy compared to 62.6% achieved by comparable methods for the patient. Apart from evaluating our approach on the ALS patient, we also extend it to people with hearing impairment relying extensively on lip movements to communicate.
Multi-Domain Incremental Learning for Semantic Segmentation
Jawahar C V,Prachi Garg,Rohit Saluja,Vineeth N Balasubramanian,Chetan Arora,Anbumani Subramanian
Winter Conference on Applications of Computer Vision, WACV, 2021
@inproceedings{bib_Mult_2021, AUTHOR = {Jawahar C V, Prachi Garg, Rohit Saluja, Vineeth N Balasubramanian, Chetan Arora, Anbumani Subramanian}, TITLE = {Multi-Domain Incremental Learning for Semantic Segmentation}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2021}}
Recent efforts in multi-domain learning for semantic segmentation attempt to learn multiple geographical datasets in a universal, joint model. A simple fine-tuning experiment performed sequentially on three popular road scene segmentation datasets demonstrates that existing segmentation frameworks fail at incrementally learning on a series of visually disparate geographical domains. When learning a new domain, the model catastrophically forgets previously learned knowledge. In this work, we pose the problem of multi-domain incremental learning for semantic segmentation. Given a model trained on a particular geographical domain, the goal is to (i) incrementally learn a new geographical domain, (ii) while retaining performance on the old domain, (iii) given that the previous domain’s dataset is not accessible. We propose a dynamic architecture that assigns universally shared, domain-invariant parameters to capture homogeneous semantic features present in all domains, while dedicated domain-specific parameters learn the statistics of each domain. Our novel optimization strategy helps achieve a good balance between retention of old knowledge (stability) and acquiring new knowledge (plasticity). We demonstrate the effectiveness of our proposed solution on domain incremental settings pertaining to realworld driving scenes from roads of Germany (Cityscapes), the United States (BDD100k), and India (IDD). 1
Intelligent Video Editing: Incorporating Modern Talking Face Generation Algorithms in a Video Editor
Anchit Gupta,Faizan Farooq Khan,Rudrabha Mukhopadhyay,Vinay P Namboodiri,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2021
@inproceedings{bib_Inte_2021, AUTHOR = {Anchit Gupta, Faizan Farooq Khan, Rudrabha Mukhopadhyay, Vinay P Namboodiri, Jawahar C V}, TITLE = {Intelligent Video Editing: Incorporating Modern Talking Face Generation Algorithms in a Video Editor}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2021}}
This paper proposes a video editor based on OpenShot with several state-of-the-art facial video editing algorithms as added functionalities. Our editor provides an easy-to-use interface to apply modern
Efficient and Generic Interactive Segmentation Framework to Correct Mispredictions during Clinical Evaluation of Medical Images
Bhavani S,Ashutosh Gupta,Jawahar C V,Chetan Arora
International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI, 2021
@inproceedings{bib_Effi_2021, AUTHOR = {Bhavani S, Ashutosh Gupta, Jawahar C V, Chetan Arora}, TITLE = {Efficient and Generic Interactive Segmentation Framework to Correct Mispredictions during Clinical Evaluation of Medical Images}, BOOKTITLE = {International Conference on Medical Image Computing and Computer Assisted Intervention}. YEAR = {2021}}
Semantic segmentation of medical images is an essential first step in computer-aided diagnosis systems for many applications. However, given many disparate imaging modalities and inherent variations in the patient data, it is difficult to consistently achieve high accuracy using modern deep neural networks (DNNs). This has led researchers to propose interactive image segmentation techniques where a medical expert can interactively correct the output of a DNN to the desired accuracy. However, these techniques often need separate training data with the associated human interactions, and do not generalize to various diseases, and types of medical images. In this paper, we suggest a novel conditional inference technique for DNNs which takes the intervention by a medical expert as test time constraints and performs inference conditioned upon these constraints. Our technique is generic can be used for medical images from any modality. Unlike other methods, our approach can correct multiple structures simultaneously and add structures missed at initial segmentation. We report an improvement of 13.3, 12.5, 17.8, 10.2, and 12.4 times in user annotation time than full human annotation for the nucleus, multiple cells, liver and tumor, organ, and brain segmentation respectively. We report a time saving of 2.8, 3.0, 1.9, 4.4, and 8.6 fold compared to other interactive segmentation techniques. Our method can be useful to clinicians for diagnosis and post-surgical follow-up with minimal intervention from the medical expert. The source-code and the detailed results are available here [1].
ICDAR 2021 Competition on Document Visual Question Answering
Rub`en Tito,MINESH MATHEW,Jawahar C V,Ernest Valveny,Dimosthenis Karatzas
International Conference on Document Analysis and Recognition Workshops, ICDAR-W, 2021
@inproceedings{bib_ICDA_2021, AUTHOR = {Rub`en Tito, MINESH MATHEW, Jawahar C V, Ernest Valveny, Dimosthenis Karatzas}, TITLE = {ICDAR 2021 Competition on Document Visual Question Answering}, BOOKTITLE = {International Conference on Document Analysis and Recognition Workshops}. YEAR = {2021}}
In this report we present results of the ICDAR 2021 edition of the Document Visual Question Challenges. This edition complements the previous tasks on Single Document VQA and Document Collection VQA with a newly introduced on Infographics VQA. Infographics VQA is based on a new dataset of more than 5 , 000 infographics images and 30 , 000 question-answer pairs. The winner methods have scored 0 .6120 ANLS in Infographics VQA task, 0 .7743 ANLSL in Document Collection VQA task and 0 .8705 ANLS in Single Document VQA. We present a summary of the datasets used for each task, description of each of the submitted methods and the results and analysis of their performance. A summary of the progress made on Single Document VQA since the first edition of the DocVQA 2020 challenge is also presented.
Towards Boosting the Accuracy of Non-latin Scene Text Recognition
G SANJANA,Rohit Saluja,Jawahar C V
International Conference on Document Analysis and Recognition Workshops, ICDAR-W, 2021
@inproceedings{bib_Towa_2021, AUTHOR = {G SANJANA, Rohit Saluja, Jawahar C V}, TITLE = {Towards Boosting the Accuracy of Non-latin Scene Text Recognition}, BOOKTITLE = {International Conference on Document Analysis and Recognition Workshops}. YEAR = {2021}}
Scene-text recognition is remarkably better in Latin languages than the non-Latin languages due to several factors like multiple fonts, simplistic vocabulary statistics, updated data generation tools, and writing systems. This paper examines the possible reasons for low accuracy by comparing English datasets with non-Latin languages. We compare various features like the size (width and height) of the word images and word length statistics. Over the last decade, generating synthetic datasets with powerful deep learning techniques has tremendously improved scene-text recognition. Several controlled experiments are performed on English, by varying the number of (i) fonts to create the synthetic data and (ii) created word images. We discover that these factors are critical for the scene-text recognition systems. The English synthetic datasets utilize over 1400 fonts while Arabic and other non-Latin datasets utilize less than
Transfer Learning for Scene Text Recognition in Indian Languages
G SANJANA,Rohit Saluja,Jawahar C V
International Conference on Document Analysis and Recognition Workshops, ICDAR-W, 2021
@inproceedings{bib_Tran_2021, AUTHOR = {G SANJANA, Rohit Saluja, Jawahar C V}, TITLE = {Transfer Learning for Scene Text Recognition in Indian Languages}, BOOKTITLE = {International Conference on Document Analysis and Recognition Workshops}. YEAR = {2021}}
Scene text recognition in low-resource Indian languages is challenging because of complexities like multiple scripts, fonts, text size, and orientations. In this work, we investigate the power of transfer learning for all the layers of deep scene text recognition networks from English to two common Indian languages. We perform experiments on the conventional CRNN model and STAR-Net to ensure generalisability. To study the effect of change in different scripts, we initially run our experiments on synthetic word images rendered using Unicode fonts. We show that the transfer of English models to simple synthetic datasets of Indian languages is not practical. Instead, we propose to apply transfer learning techniques among Indian languages due to similarity in their n-gram distributions and visual features like the vowels and conjunct characters. We then study the transfer learning among six Indian languages
Asking questions on handwritten document collections
MINESH MATHEW,Lluis Gomez,Dimosthenis Karatzas,Jawahar C V,Jawahar C V
International Journal on Document Analysis and Recognition, IJDAR, 2021
@inproceedings{bib_Aski_2021, AUTHOR = {MINESH MATHEW, Lluis Gomez, Dimosthenis Karatzas, Jawahar C V, Jawahar C V}, TITLE = {Asking questions on handwritten document collections}, BOOKTITLE = {International Journal on Document Analysis and Recognition}. YEAR = {2021}}
This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual Question Answering (VQA) formulations where the answer is a short text, we aim to locate a document snippet where the answer lies. The proposed approach works without recognizing the text in the documents. We argue that the recognitionfree approach is suitable for handwritten documents and historical collections where robust text recognition is often difficult. At the same time, for human users, document image snippets containing answers act as a valid alternative to textual answers. The proposed approach uses an off-the-shelf deep embedding network which can project both textual words and word images into a common sub-space. This embedding bridges the textual and visual domains and helps us retrieve document snippets that potentially answer a question. We evaluate results of the proposed approach on two new datasets: (i) HW-SQuAD: a synthetic, handwritten document image counterpart of SQuAD1.0 dataset and (ii) BenthamQA: a smaller set of QA pairs defined on documents from the popular Bentham manuscripts collection. We also present a thorough analysis of the proposed recognition-free approach compared to a recognition-based approach which uses text recognized from the images using an OCR. Datasets presented in this work are available to download at docvqa.org.
iiit-indic-hw-words: A Dataset for Indic Handwritten Text Recognition
Santhoshini Reddy Gongidi,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2021
@inproceedings{bib_iiit_2021, AUTHOR = {Santhoshini Reddy Gongidi, Jawahar C V}, TITLE = {iiit-indic-hw-words: A Dataset for Indic Handwritten Text Recognition}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2021}}
Handwritten text recognition (htr) for Indian languages is not yet a well-studied problem. This is primarily due to the unavailability of large annotated datasets in the associated scripts. Existing datasets are small in size. They also use small lexicons. Such datasets are not sufficient to build robust solutions to htr using modern machine learning techniques. In this work, we introduce a large-scale handwritten dataset for Indic scripts containing 868K handwritten instances written by 135 writers in 8 widely-used scripts. A comprehensive dataset of ten Indic scripts are derived by combining the newly introduced dataset with the earlier datasets developed for Devanagari (iiit-hw-dev) and Telugu (iiit-hw-telugu), referred to as the iiit-indic-hw-words. We further establish a high baseline for text recognition in eight Indic scripts. Our recognition scheme follows the contemporary design principles from other recognition literature, and yields competitive results on English. iiit-indic-hw-words along with the recognizers are available publicly1. We further (i) study the reasons for changes in htr performance across scripts (ii) explore the utility of pre-training for Indic htrs. We hope our efforts will catalyze research and fuel applications related to handwritten document understanding in Indic scripts.
Towards Automatic Speech to Sign Language Generation
Parul Kapoor,Rudrabha Mukhopadhyay,Sindhu Balachandra Hegde,Vinay P namboodiri,Jawahar C V
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021
@inproceedings{bib_Towa_2021, AUTHOR = {Parul Kapoor, Rudrabha Mukhopadhyay, Sindhu Balachandra Hegde, Vinay P Namboodiri, Jawahar C V}, TITLE = {Towards Automatic Speech to Sign Language Generation}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2021}}
We aim to solve the highly challenging task of generating continuous sign language videos solely from speech segments for the first time. Recent efforts in this space have focused on generating such videos from human-annotated text transcripts without considering other modalities. However, replacing speech with sign language proves to be a practical solution while communicating with people suffering from hearing loss. Therefore, we eliminate the need of using text as input and design techniques that work for more natural, continuous, freely uttered speech covering an extensive vocabulary. Since the current datasets are inadequate for generating sign language directly from speech, we collect and release the first Indian sign language dataset comprising speech-level annotations, text transcripts, and the corresponding sign-language videos. Next, we propose a multi-tasking transformer network trained to generate signer’s poses from speech segments. With speech-to-text as an auxiliary task and an additional cross-modal discriminator, our model learns to generate continuous sign pose sequences in an end-to-end manner. Extensive experiments and comparisons with other baselines demonstrate the effectiveness of our approach. We also conduct additional ablation studies to analyze the effect of different modules of our network. A demo video containing several results is attached to the supplementary material. Index T
More Parameters? No Thanks!
Zeeshan Khan,Akella S V Sukruth Sai Kartheek,Vinay P Namboodiri,Jawahar C V
Technical Report, arXiv, 2021
@inproceedings{bib_More_2021, AUTHOR = {Zeeshan Khan, Akella S V Sukruth Sai Kartheek, Vinay P Namboodiri, Jawahar C V}, TITLE = {More Parameters? No Thanks!}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
This work studies the long-standing problems of model capacity and negative interference in multilingual neural machine translation (MNMT). We use network pruning techniques and observe that pruning 50-70% of the parameters from a trained MNMT model results only in a 0.29-1.98 drop in the BLEU score. Suggesting that there exist large redundancies even in MNMT models. These observations motivate us to use the redundant parameters and counter the interference problem efficiently. We propose a novel adaptation strategy, where we iteratively prune and retrain the redundant parameters of an MNMT to improve bilingual representations while retaining the multilinguality. Negative interference severely affects high resource languages, and our method alleviates it without any additional adapter modules. Hence, we call it parameterfree adaptation strategy, paving way for the efficient adaptation of MNMT. We demonstrate the effectiveness of our method on a 9 language MNMT trained on TED talks, and report an average improvement of +1.36 BLEU on high resource pairs. Code will be released
Looking Farther in Parametric Scene Parsing with Ground and Aerial Imagery
Raghava Modhugu,Harish Rithish Sethuram,Manmohan Chandraker,Jawahar C V
International Conference on Robotics and Automation, ICRA, 2021
@inproceedings{bib_Look_2021, AUTHOR = {Raghava Modhugu, Harish Rithish Sethuram, Manmohan Chandraker, Jawahar C V}, TITLE = {Looking Farther in Parametric Scene Parsing with Ground and Aerial Imagery}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2021}}
rametric models that represent layout in terms of scene attributes are an attractive avenue for road scene understanding in autonomous navigation. Prior works that rely only on ground imagery are limited by the narrow field of view of the camera, occlusions and perspective foreshortening. In this paper, we demonstrate the effectiveness of using aerial imagery as an additional modality to overcome the above challenges. We propose a novel architecture, Unified, that combines features from both aerial and ground imagery to infer scene attributes. We quantitatively evaluate on the KITTI dataset and show that our Unified model outperforms prior works. Since this dataset is limited to road scenes close to the vehicle, we supplement the publicly available Argoverse dataset with scene attribute annotations and evaluate on far-away scenes. We show both quantitatively and qualitatively, the importance of aerial imagery in understanding road scenes, especially in regions farther away from the ego-vehicle. All code, models, and data, including scene attribute annotations on the Argoverse dataset along with collected and processed aerial imagery, are available
Revisiting Low Resource Status of Indian Languages in Machine Translation
JERIN PHILIP,SIRIPRAGADA SHASHANK,Vinay P namboodiri,Jawahar C V
India Joint International Conference on Data Science & Management of Data, COMAD/CODS, 2021
@inproceedings{bib_Revi_2021, AUTHOR = {JERIN PHILIP, SIRIPRAGADA SHASHANK, Vinay P Namboodiri, Jawahar C V}, TITLE = {Revisiting Low Resource Status of Indian Languages in Machine Translation }, BOOKTITLE = {India Joint International Conference on Data Science & Management of Data}. YEAR = {2021}}
Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an automated
Audio-Visual Speech Super-Resolution
Rudrabha Mukhopadhyay,Sindhu B Hegde,Vinay P Namboodiri,Jawahar C V
British Machine Vision Conference, BMVC, 2021
@inproceedings{bib_Audi_2021, AUTHOR = {Rudrabha Mukhopadhyay, Sindhu B Hegde, Vinay P Namboodiri, Jawahar C V}, TITLE = {Audio-Visual Speech Super-Resolution}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2021}}
In this paper, we present an audio-visual model to perform speech super-resolution at large scale-factors (8× and 16×). Previous works attempted to solve this problem using only the audio modality as input and thus were limited to low scale-factors of 2× and 4×. In contrast, we propose to incorporate both visual and auditory signals to superresolve speech of sampling rates as low as 1kHz. In such challenging situations, the visual features assist in learning the content and improves the quality of the generated speech. Further, we demonstrate the applicability of our approach to arbitrary speech signals where the visual stream is not accessible. Our “pseudo-visual network” precisely synthesizes the visual stream solely from the low-resolution speech input. Extensive experiments and the demo video illustrate our method’s remarkable results and benefits over state-of-the-art audio-only speech super-resolution approaches.
Translating Sign Language Videos to Talking Faces
Seshadri Mazumder,Rudrabha Mukhopadhyay,Vinay P namboodiri,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2021
@inproceedings{bib_Tran_2021, AUTHOR = {Seshadri Mazumder, Rudrabha Mukhopadhyay, Vinay P Namboodiri, Jawahar C V}, TITLE = {Translating Sign Language Videos to Talking Faces}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2021}}
Communication with the deaf community relies profoundly on the interpretation of sign languages performed by the signers. In light of the recent breakthroughs in sign language translations, we propose a pipeline that we term "Translating Sign Language Videos to Talking Faces". In this context, we improve the existing sign language translation systems by using POS tags to improve language modeling. We further extend the challenge to develop a system that can interpret a video from a signer to an avatar speaking in spoken languages. We focus on the translation systems that attempt to translate sign languages to text without glosses,
Exploring Genetic-histologic Relationships in Breast Cancer
Ruchi Chauhan,Vinod Palakkad Krishnanunni,Jawahar C V
IEEE International Symposium on Biomedical Imaging, ISBI, 2021
@inproceedings{bib_Expl_2021, AUTHOR = {Ruchi Chauhan, Vinod Palakkad Krishnanunni, Jawahar C V}, TITLE = {Exploring Genetic-histologic Relationships in Breast Cancer}, BOOKTITLE = {IEEE International Symposium on Biomedical Imaging}. YEAR = {2021}}
The advent of digital pathology presents opportunities for computer vision for fast, accurate, and objective solutions for histopathological images and aid in knowledge discovery. This work uses deep learning to predict genomic biomarkers - TP53 mutation, PIK3CA mutation, ER status, PR status, HER2 status, and intrinsic subtypes, from breast cancer histopathology images. Furthermore, we attempt to understand the underlying morphology as to how these genomic biomarkers manifest in images. Since gene sequencing is expensive, not always available, or even feasible, predicting these biomarkers from images would help in diagnosis, prognosis, and effective treatment planning. We outperform the existing works with a minimum improvement of 0.02 and a maximum of 0.13 AUROC scores across all tasks. We also gain insights that can serve as hypotheses for further experimentations, including the presence of lymphocytes and karyorrhexis. Moreover, our fully automated workflow can be extended to other tasks across other cancer subtypes.
MMBERT: Multimodal BERT Pretraining for Improved Medical VQA
Yash Khare,Viraj Bagal,MINESH MATHEW,Adithi Devi,Deva Priyakumar U,Jawahar C V
IEEE International Symposium on Biomedical Imaging, ISBI, 2021
@inproceedings{bib_MMBE_2021, AUTHOR = {Yash Khare, Viraj Bagal, MINESH MATHEW, Adithi Devi, Deva Priyakumar U, Jawahar C V}, TITLE = {MMBERT: Multimodal BERT Pretraining for Improved Medical VQA}, BOOKTITLE = {IEEE International Symposium on Biomedical Imaging}. YEAR = {2021}}
Images in the medical domain are fundamentally different from the general domain images. Consequently, it is infeasible to directly employ general domain Visual Question Answering (VQA) models for the medical domain. Additionally, medical image annotation is a costly and time-consuming process. To overcome these limitations, we propose a solution inspired by self-supervised pretraining of Transformer-style architectures for NLP, Vision, and Language tasks. Our method involves learning richer medical image and text semantic representations using Masked Vision-Language Modeling as the pretext task on a large medical image+caption dataset. The proposed solution achieves new state-of-the-art performance on two VQA datasets for radiology images – VQA-Med 2019 and VQA-RAD, outperforming even the ensemble models of previous best solutions. Moreover, our solution provides attention maps which help in model interpretability.
DeepPocket: Ligand Binding Site Detection and Segmentation using 3D Convolutional Neural Networks
Rishal Aggarwal, Akash Gupta,Vineeth Ravindra Chelur,Jawahar C V,Deva Priyakumar U
Journal of Chemical Information and Modeling, JCIM, 2021
@inproceedings{bib_Deep_2021, AUTHOR = {Rishal Aggarwal, Akash Gupta, Vineeth Ravindra Chelur, Jawahar C V, Deva Priyakumar U}, TITLE = {DeepPocket: Ligand Binding Site Detection and Segmentation using 3D Convolutional Neural Networks}, BOOKTITLE = {Journal of Chemical Information and Modeling}. YEAR = {2021}}
A structure-based drug design pipeline involves the development of potential drug molecules or ligands that form stable complexes with a given receptor at its binding site. A prerequisite to this is finding druggable and functionally relevant binding sites on the 3D structure of the protein. Although several methods for detecting binding sites have been developed beforehand, a majority of them surprisingly fail in the identification and ranking of binding sites accurately. The rapid adoption and success of deep learning algorithms in various sections of structural biology beckons the usage of such algorithms for accurate binding site detection. As a combination of geometry based software and deep learning, we report a novel framework, DeepPocket that utilises 3D convolutional neural networks for the rescoring of pockets identified by Fpocket and further segments these identified cavities on the protein surface. Apart from this, we also propose another dataset SC6K containing protein structures submitted in the Protein Data Bank (PDB) from 1st January, 2018 till 28th February, 2020 for ligand binding site (LBS) detection. DeepPocket’s results on various binding site datasets and SC6K highlights its better
Multi-target Tracking with Sparse Group Features and Position Using Discrete-Continuous Optimization
Billy Peralta Márque,Alvaro Soto,Jawahar C V,S Shan
Asian Conference on Computer Vision, ACCV, 2021
@inproceedings{bib_Mult_2021, AUTHOR = {Billy Peralta Márque, Alvaro Soto, Jawahar C V, S Shan}, TITLE = {Multi-target Tracking with Sparse Group Features and Position Using Discrete-Continuous Optimization}, BOOKTITLE = {Asian Conference on Computer Vision}. YEAR = {2021}}
Multi-target tracking of pedestrians is a challenging task due to uncertainty about targets, caused mainly by similarity between pedestrians, occlusion over a relatively long time and a cluttered background. A usual scheme for tackling multi-target tracking is to divide it into two sub-problems: data association and trajectory estimation. A reasonable approach is based on joint optimization of a discrete model for data association and a continuous model for trajectory estimation in a Markov Random Field framework. Nonetheless, usual solutions of the data association problem are based only on location information, while the visual information in the images is ignored. Visual features can be useful for associating detections with true targets more reliably, because the targets usually have discriminative features. In this work, we propose a combination of position and visual feature information in a discrete data association …
Towards Automatic Speech to Sign Language Generation
Parul Kapoor,Rudrabha Mukhopadhyay,Sindhu Balachandra Hegde,Vinay Namboodiri,Jawahar C V
Technical Report, arXiv, 2021
@inproceedings{bib_Towa_2021, AUTHOR = {Parul Kapoor, Rudrabha Mukhopadhyay, Sindhu Balachandra Hegde, Vinay Namboodiri, Jawahar C V}, TITLE = {Towards Automatic Speech to Sign Language Generation}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
We aim to solve the highly challenging task of generating continuous sign language videos solely from speech segments for the first time. Recent efforts in this space have focused on generating such videos from human-annotated text transcripts without considering other modalities. However, replacing speech with sign language proves to be a practical solution while communicating with people suffering from hearing loss. Therefore, we eliminate the need of using text as input and design techniques that work for more natural, continuous, freely uttered speech covering an extensive vocabulary. Since the current datasets are inadequate for generating sign language directly from speech, we collect and release the first Indian sign language dataset comprising speech-level annotations, text transcripts, and the corresponding sign-language videos. Next, we propose a multi-tasking transformer network trained to generate signer's poses from speech segments. With speech-to-text as an auxiliary task and an additional cross-modal discriminator, our model learns to generate continuous sign pose sequences in an end-to-end manner. Extensive experiments and comparisons with other baselines demonstrate the effectiveness of our approach. We also conduct additional ablation studies to analyze the effect of different modules of our network. A demo video containing several results is attached to the supplementary material.
Exploring Genetic-histologic Relationships in Breast Cancer
Ruchi Chauhan,Vinod Palakkad Krishnanunni,Jawahar C V
IEEE International Symposium on Biomedical Imaging, ISBI, 2021
@inproceedings{bib_Expl_2021, AUTHOR = {Ruchi Chauhan, Vinod Palakkad Krishnanunni, Jawahar C V}, TITLE = {Exploring Genetic-histologic Relationships in Breast Cancer}, BOOKTITLE = {IEEE International Symposium on Biomedical Imaging}. YEAR = {2021}}
The advent of digital pathology presents opportunities for computer vision for fast, accurate, and objective solutions for histopathological images and aid in knowledge discovery. This work uses deep learning to predict genomic biomarkers - TP53 mutation, PIK3CA mutation, ER status, PR status, HER2 status, and intrinsic subtypes, from breast cancer histopathology images. Furthermore, we attempt to understand the underlying morphology as to how these genomic biomarkers manifest in images. Since gene sequencing is expensive, not always available, or even feasible, predicting these biomarkers from images would help in diagnosis, prognosis, and effective treatment planning. We outperform the existing works with a minimum improvement of 0.02 and a maximum of 0.13 AUROC scores across all tasks. We also gain insights that can serve as hypotheses for further experimentations, including the presence of lymphocytes and karyorrhexis. Moreover, our fully automated workflow can be extended to other tasks across other cancer subtypes.
Context Aware Group Activity Recognition
Avijit Dasgupta,Jawahar C V,KARTEEK ALAHARI
International conference on Pattern Recognition, ICPR, 2021
@inproceedings{bib_Cont_2021, AUTHOR = {Avijit Dasgupta, Jawahar C V, KARTEEK ALAHARI}, TITLE = {Context Aware Group Activity Recognition}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2021}}
This paper addresses the task of group activity recognition in multi-person videos. Existing approaches decompose this task into feature learning and relational reasoning. Despite showing progress, these methods only rely on appearance features for people and overlook the available contextual information, which can play an important role in group activity understanding. In this work, we focus on the feature learning aspect and propose a two-stream architecture that not only considers person-level appearance features, but also makes use of contextual information present in videos for group activity recognition. In particular, we propose to use two types of contextual information beneficial for two different scenarios: pose context and scene context that provide crucial cues for group activity understanding. We combine appearance and contextual features to encode each person with an enriched representation. Finally, these combined features are used in relational reasoning for predicting group activities. We evaluate our method on two benchmarks, Volleyball and Collective Activity and show that joint modeling of contextual information with appearance features benefits in group activity understanding.
Data-Efficient Training Strategies for Neural TTS Systems
KR Prajwal,Jawahar C V
India Joint International Conference on Data Science & Management of Data, COMAD/CODS, 2021
@inproceedings{bib_Data_2021, AUTHOR = {KR Prajwal, Jawahar C V}, TITLE = {Data-Efficient Training Strategies for Neural TTS Systems}, BOOKTITLE = {India Joint International Conference on Data Science & Management of Data}. YEAR = {2021}}
India is a country with thousands of languages and dialects spoken across a billion-strong population. For multi-lingual content creation and accessibility, text-to-speech systems will play a crucial role. However, the current neural TTS systems are data-hungry and need about 20 hours of clean single-speaker speech data for each language and speaker. This is not scalable for the large number of Indian languages and dialects. In this work, we demonstrate three simple, yet effective pre-training strategies that allow us to train neural TTS systems with just about one-tenth of the data needs while also achieving better accuracy and naturalness. We show that such pre-trained neural TTS systems can be quickly adapted to different speakers across languages and genders with less than 2 hours of data, thus significantly reducing the effort for future expansions to the thousands of rare Indian languages. We specifically …
Visual Speech Enhancement Without A Real Visual Stream
Sindhu Balachandra Hegde, KR Prajwal,Rudrabha Mukhopadhyay,Vinay P Namboodiri,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2021
@inproceedings{bib_Visu_2021, AUTHOR = {Sindhu Balachandra Hegde, KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, Jawahar C V}, TITLE = {Visual Speech Enhancement Without A Real Visual Stream}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2021}}
In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over" audio-only" methods. But, these methods cannot be used for several applications where the visual stream is unreliable or completely absent. We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech-driven lip synthesis. Using one such model as a teacher network, we train a robust student network to produce accurate lip movements that mask away the noise, thus acting as a" visual noise filter". The intelligibility of the speech enhanced by our pseudo-lip approach is comparable (< 3% difference) to the case of using real lips. This implies that we can exploit the advantages of using lip movements even in the absence of a real video stream. We rigorously evaluate our model using quantitative metrics as well as human evaluations. Additional ablation studies and a demo video on our website containing qualitative comparisons and results clearly illustrate the effectiveness of our approach.
DocVQA: A dataset for vqa on document images
MINESH MATHEW,Dimosthenis Karatzas,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2021
@inproceedings{bib_DocV_2021, AUTHOR = {MINESH MATHEW, Dimosthenis Karatzas, Jawahar C V}, TITLE = {DocVQA: A dataset for vqa on document images}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2021}}
We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at docvqa. org
Improving Word Recognition using Multiple Hypotheses and Deep Embeddings
Siddhant Bansal,PRAVEEN KRISHNAN,Jawahar C V
International conference on Pattern Recognition, ICPR, 2021
@inproceedings{bib_Impr_2021, AUTHOR = {Siddhant Bansal, PRAVEEN KRISHNAN, Jawahar C V}, TITLE = {Improving Word Recognition using Multiple Hypotheses and Deep Embeddings}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2021}}
We propose a novel scheme for improving the word recognition accuracy using word image embeddings. We use a trained text recognizer, which can predict multiple text hypothesis for a given word image. Our fusion scheme improves the recognition process by utilizing the word image and text embeddings obtained from a trained word image embedding network. We propose EmbedNet, which is trained using a triplet loss for learning a suitable embedding space where the embedding of the word image lies closer to the embedding of the corresponding text transcription. The updated embedding space thus helps in choosing the correct prediction with higher confidence. To further improve the accuracy, we propose a plug-and-play module called Confidence based Accuracy Booster (CAB). The CAB module takes in the confidence scores obtained from the text recognizer and Euclidean distances between the embeddings to generate an updated distance vector. The updated distance vector has lower distance values for the correct words and higher distance values for the incorrect words. We rigorously evaluate our proposed method systematically on a collection of books in the Hindi language. Our method achieves an absolute improvement of around 10 percent in terms of word recognition accuracy.
CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images
Madhav Agarwal,Ajoy Mondal,Jawahar C V
International conference on Pattern Recognition, ICPR, 2021
@inproceedings{bib_CDeC_2021, AUTHOR = {Madhav Agarwal, Ajoy Mondal, Jawahar C V}, TITLE = {CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2021}}
Localizing page elements/objects such as tables, figures, equations, etc. is the primary step in extracting information from document images. We propose a novel end-to-end trainable deep network,(CDeC-Net) for detecting tables present in the documents. The proposed network consists of a multistage extension of Mask R-CNN with a dual backbone having deformable convolution for detecting tables varying in scale with high detection accuracy at higher IoU threshold. We empirically evaluate CDeC-Net on all the publicly available benchmark datasets-ICDAR-2013, ICDAR-2017, ICDAR-2019, UNLV, Marmot, PubLayNet, and TableBank-with extensive experiments. Our solution has three important properties:(i) a single trained model CDeC-Net‡ performs well across all the popular benchmark datasets;(ii) we report excellent performances across multiple, including higher, thresholds of IoU;(iii) by following the same protocol of the recent papers for each of the benchmarks, we consistently demonstrate the superior quantitative performance. Our code and models will be publicly released for enabling the reproducibility of the results.
Document Visual Question Answering Challenge 2020
MINESH MATHEW,Rubn Tito,Dimosthenis Karatzas , R. Manmatha ,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2021
@inproceedings{bib_Docu_2021, AUTHOR = {MINESH MATHEW, Rubn Tito, Dimosthenis Karatzas , R. Manmatha , Jawahar C V}, TITLE = {Document Visual Question Answering Challenge 2020}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2021}}
his paper presents results of Document Visual Question Answering Challenge organized as part of" Text and Documents in the Deep Learning Era" workshop, in CVPR 2020. The challenge introduces a new problem-Visual Question Answering on document images. The challenge comprised two tasks. The first task concerns with asking questions on a single document image. On the other hand, the second task is set as a retrieval task where the question is posed over a collection of images. For the task 1 a new dataset is introduced comprising 50,000 questions-answer (s) pairs defined over 12,767 document images. For task 2 another dataset has been created comprising 20 questions over 14,362 document images which share the same document template.
DGAZE: Driver Gaze Mapping on Road
ISHA DUA,THRUPTHI ANN JOHN,Riya Gupta,Jawahar C V
International Conference on Intelligent Robots and Systems, IROS, 2020
@inproceedings{bib_DGAZ_2020, AUTHOR = {ISHA DUA, THRUPTHI ANN JOHN, Riya Gupta, Jawahar C V}, TITLE = {DGAZE: Driver Gaze Mapping on Road}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2020}}
Driver gaze mapping is crucial to estimate driver attention and determine which objects the driver is focusing on while driving. We introduce DGAZE, the first large-scale driver gaze mapping dataset. Unlike previous works, our dataset does not require expensive wearable eye-gaze trackers and instead relies on mobile phone cameras for data collection. The data was collected in a lab setting designed to mimic real driving conditions and has point and object-level annotation. It consists of 227,178 road-driver image pairs collected from 20 drivers and contains 103 unique objects on the road belonging to 7 classes: cars, pedestrians, traffic signals, motorbikes, auto-rickshaws, buses and signboards. We also present I-DGAZE, a fused convolutional neural network for predicting driver gaze on the road, which was trained on the DGAZE dataset. Our architecture combines facial features such as face location and head pose along with the image of the left eye to get optimum results. Our model achieves an error of 186.89 pixels on the road view of resolution 1920×1080 pixels. We compare our model with state-of-the-art eye gaze works and present extensive ablation results.
Graph Representation Ensemble Learning
Palash Goyal,Sachin Raja,Di Huang,Sujit Rokka Chhetri,Arquimedes Canedo,Ajoy Mondal,Jaya Shree,Jawahar C V
IEEE International Conference on Advances in Social Networks Analysis and Mining, ASONAM, 2020
@inproceedings{bib_Grap_2020, AUTHOR = {Palash Goyal, Sachin Raja, Di Huang, Sujit Rokka Chhetri, Arquimedes Canedo, Ajoy Mondal, Jaya Shree, Jawahar C V}, TITLE = {Graph Representation Ensemble Learning}, BOOKTITLE = {IEEE International Conference on Advances in Social Networks Analysis and Mining}. YEAR = {2020}}
Representation learning on graphs has been gaining attention due to its wide applicability in predicting missing links and classifying and recommending nodes. Most embedding methods aim to preserve specific properties of the original graph in the low dimensional space. However, real-world graphs have a combination of several features that are difficult to characterize and capture by a single approach. In this work, we introduce the problem of graph representation ensemble learning and provide a first of its kind framework to aggregate multiple graph embedding methods efficiently. We provide analysis of our framework and analyze – theoretically and empirically – the dependence between state-of-the-art embedding methods. We test our models on the node classification task on four realworld graphs and show that proposed ensemble approaches can outperform the state-of-the-art methods by up to 20% on macroF1. We further show that the strategy is even more beneficial for underrepresented classes with an improvement of up to 40%. Index Terms—Graph Representation, Node Embedding, Ensemble Learning, Greedy Search, Node Classification, Distance Correlation, Prediction Diversity
Dear Commissioner, please fix these: A scalable system for inspecting road infrastructure
Durga Nagendra Raghava Kumar M,B RANJITH REDDY,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2020
@inproceedings{bib_Dear_2020, AUTHOR = {Durga Nagendra Raghava Kumar M, B RANJITH REDDY, Jawahar C V}, TITLE = {Dear Commissioner, please fix these: A scalable system for inspecting road infrastructure}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2020}}
Inspecting and assessing the quality of traffic infrastructure (such as the state of the signboards or road markings) is challenging for humans due to (i) the massive length of roads that countries will have and (ii) the regular frequency at which this needs to be done. In this paper, we demonstrate a scalable system that uses computer vision for automatic inspection of road infrastructure from a simple video captured from a moving vehicle. We validated our method on 1500 kms of roads captured in and around the city of Hyderabad, India. Qualitative and quantitative results demonstrate the feasibility, scalability and effectiveness of our solution.
Spatial Feedback Learning to Improve Semantic Segmentation in Hot Weather
Shyam Nandan Rai,Vineeth N Balasubramanian,Anbumani Subramanian,Jawahar C V
British Machine Vision Conference, BMVC, 2020
@inproceedings{bib_Spat_2020, AUTHOR = {Shyam Nandan Rai, Vineeth N Balasubramanian, Anbumani Subramanian, Jawahar C V}, TITLE = {Spatial Feedback Learning to Improve Semantic Segmentation in Hot Weather}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2020}}
High-temperature weather conditions induce geometrical distortions in images which can adversely affect the performance of a computer vision model performing downstream tasks such as semantic segmentation. The performance of such models has been shown to improve by adding a restoration network before a semantic segmentation network. The restoration network removes the geometrical distortions from the images and shows improved segmentation results. However, this approach suffers from a major architec- tural drawback that is the restoration network does not learn directly from the errors of the segmentation network. In other words, the restoration network is not task aware. In this work, we propose a semantic feedback learning approach, which improves the task of semantic segmentation giving a feedback response into the restoration network. This response works as an attend and fix mechanism by focusing on those areas of an image where restoration needs improvement. Also, we proposed loss functions: Iterative Fo- cal Loss (iFL) and Class-Balanced Iterative Focal Loss (CB-iFL), which are specifically designed to improve the performance of the feedback network. These losses focus more on those samples that are continuously miss-classified over successive iterations. Our approach gives a gain of 17.41 mIoU over the standard segmentation model, including the additional gain of 1.9 mIoU with CB-iFL on the Cityscapes dataset.
Recurrent Image Annotation With Explicit Inter-Label Dependencies
Ayushi Dutta,Yashaswi Verma,Jawahar C V
European Conference on Computer Vision, ECCV, 2020
@inproceedings{bib_Recu_2020, AUTHOR = {Ayushi Dutta, Yashaswi Verma, Jawahar C V}, TITLE = {Recurrent Image Annotation With Explicit Inter-Label Dependencies}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2020}}
Inspired by the success of the CNN-RNN framework in the image captioning task, several works have explored this in multi-label image annotation with the hope that the RNN followed by a CNN would encode inter-label dependencies better than using a CNN alone. To do so, for each training sample, the earlier methods converted the ground- truth label-set into a sequence of labels based on their frequencies (e.g., rare-to-frequent) for training the RNN. However, since the ground-truth is an unordered set of labels, imposing a fixed and predefined sequence on them does not naturally align with this task. To address this, some of the recent papers have proposed techniques that are capable to train the RNN without feeding the ground-truth labels in a particular se- quence/order. However, most of these techniques leave it to the RNN to implicitly choose one sequence for the ground-truth labels corresponding to each sample at the time of training, thus making it inherently biased. In this paper, we address this limitation and propose a novel approach in which the RNN is explicitly forced to learn multiple relevant inter- label dependencies, without the need of feeding the ground-truth in any particular order. Using thorough empirical comparisons, we demonstrate that our approach outperforms several state-of-the-art techniques on two popular datasets (MS-COCO and NUS-WIDE). Additionally, it provides a new perspecitve of looking at an unordered set of labels as equivalent to a collection of different permutations (sequences) of those labels, thus naturally aligning with the image annotation task. Our code is available
Table Structure Recognition using Top-Down and Bottom-Up Cues
Sachin Raja,Ajoy Mondal,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2020
@inproceedings{bib_Tabl_2020, AUTHOR = {Sachin Raja, Ajoy Mondal, Jawahar C V}, TITLE = {Table Structure Recognition using Top-Down and Bottom-Up Cues}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2020}}
Tables are information-rich structured objects in document images. While significant work has been done in localizing tables as graphic objects in document images, only limited attempts exist on table structure recognition. Most existing literature on structure recognition depends on extraction of meta-features from the pdf document or on the optical character recognition (ocr) models to extract low-level lay- out features from the image. However, these methods fail to generalize well because of the absence of meta-features or errors made by the ocr when there is a significant variance in table layouts and text organization. In our work, we focus on tables that have complex structures, dense con- tent, and varying layouts with no dependency on meta-features and/or ocr. We present an approach for table structure recognition that combines cell detection and interaction modules to localize the cells and predict their row and column associations with other detected cells. We incorporate structural constraints as additional differential components to the loss function for cell detection. We empirically validate our method on the publicly available real-world datasets - icdar-2013, icdar-2019 (ctdar) archival, unlv, scitsr, scitsr-comp, tablebank, and pubtabnet. Our attempt opens up a new direction for table structure recognition by combining top-down (table cells detection) and bottom-up (structure recognition) cues in visually understanding the tables.
Region Pooling with Adaptive Feature Fusion for End-to-End Person Recognition
Guntireddy Vijay Kumar,Anoop Namboodiri,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2020
@inproceedings{bib_Regi_2020, AUTHOR = {Guntireddy Vijay Kumar, Anoop Namboodiri, Jawahar C V}, TITLE = {Region Pooling with Adaptive Feature Fusion for End-to-End Person Recognition}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2020}}
Current approaches for person recognition train an en- semble of region specific convolutional neural networks for representation learning, and then adopt naive fusion strate- gies to combine their features or predictions during testing. In this paper, we propose an unified end-to-end architec- ture that generates a complete person representation based on pooling and aggregation of features from multiple body regions. Our network takes a person image and the pre- determined locations of body regions as input, and gener- ates common feature maps that are shared across all the re- gions. Multiple features corresponding to different regions are then pooled and combined with an aggregation block, where the adaptive weights required for aggregation are obtained through an attention mechanism. Evaluations on three person recognition datasets - PIPA, Soccer and Han- nah show that a single model trained end-to-end is com- putationally faster, requires fewer parameters and achieves improved performance over separately trained models.
Munich to Dubai: How far is it for Semantic Segmentation?
Shyam Nandan Rai,Vineeth N Balasubramanian,Anbumani Subramanian,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2020
@inproceedings{bib_Muni_2020, AUTHOR = {Shyam Nandan Rai, Vineeth N Balasubramanian, Anbumani Subramanian, Jawahar C V}, TITLE = {Munich to Dubai: How far is it for Semantic Segmentation?}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2020}}
Cities having hot weather conditions results in geo- metrical distortion, thereby adversely affecting the perfor- mance of semantic segmentation model. In this work, we study the problem of semantic segmentation model in adapt- ing to such hot climate cities. This issue can be circum- vented by collecting and annotating images in such weather conditions and training segmentation models on those im- ages. But the task of semantically annotating images for every environment is painstaking and expensive. Hence, we propose a framework that improves the performance of semantic segmentation models without explicitly creat- ing an annotated dataset for such adverse weather varia- tions. Our framework consists of two parts, a restoration network to remove the geometrical distortions caused by hot weather and an adaptive segmentation network that is trained on an additional loss to adapt to the statistics of the ground-truth segmentation map. We train our framework on the Cityscapes dataset, which showed a total IoU gain of 12.707 over standard segmentation models. We also ob- serve that the segmentation results obtained by our frame- work gave a significant improvement for small classes such as poles, person, and rider, which are essential and valu- able for autonomous navigation based applications.
Few Shot Learning With No Labels
Aditya Bharti,NB Vineeth,Jawahar C V
Technical Report, arXiv, 2020
@inproceedings{bib_Few__2020, AUTHOR = {Aditya Bharti, NB Vineeth, Jawahar C V}, TITLE = {Few Shot Learning With No Labels}, BOOKTITLE = {Technical Report}. YEAR = {2020}}
Few-shot learners aim to recognize new categories given only a small number of training samples. The core challenge is to avoid overfitting to the limited data while ensuring good generalization to novel classes. Existing literature makes use of vast amounts of annotated data by simply shifting the label requirement from novel classes to base classes. Since data annotation is time-consuming and costly, reducing the label requirement even further is an important goal. To that end, our paper presents a more challenging few-shot setting where no label access is allowed during training or testing. By leveraging self-supervision for learning image representations and image similarity for classification at test time, we achieve competitive baselines while using\textbf {zero} labels, which is at least fewer labels than state-of-the-art. We hope that this work is a step towards developing few-shot learning methods which do not depend on annotated data at all. Our code will be publicly released.
Exploring Pair-Wise NMT for Indian Languages
Kartheek Akella,Sai Himal Allu,Sridhar Suresh Ragupathi,NAMAN SINGHAL,Zeeshan khan, Vinay P Namboodiri,Jawahar C V
Technical Report, arXiv, 2020
@inproceedings{bib_Expl_2020, AUTHOR = {Kartheek Akella, Sai Himal Allu, Sridhar Suresh Ragupathi, NAMAN SINGHAL, Zeeshan Khan, Vinay P Namboodiri, Jawahar C V}, TITLE = {Exploring Pair-Wise NMT for Indian Languages}, BOOKTITLE = {Technical Report}. YEAR = {2020}}
n this paper, we address the task of improving pair-wise machine translation for specific low resource Indian languages. Multilingual NMT models have demonstrated a reasonable amount of effectiveness on resource-poor languages. In this work, we show that the performance of these models can be significantly improved upon by using back-translation through a filtered back-translation process and subsequent fine-tuning on the limited pair-wise language corpora. The analysis in this paper suggests that this method can significantly improve a multilingual model's performance over its baseline, yielding state-of-the-art results for various Indian languages.
Bringing semantics into word image representation
Prasad Krishnan,Jawahar C V
Pattern Recognition, PR, 2020
@inproceedings{bib_Brin_2020, AUTHOR = {Prasad Krishnan, Jawahar C V}, TITLE = {Bringing semantics into word image representation}, BOOKTITLE = {Pattern Recognition}. YEAR = {2020}}
The shift from one-hot to distributed representation, popularly referred to as word embedding has changed the landscape of natural language processing (nlp) and information retrieval (ir) communities. In the domain of document images, we have always appreciated the need for learning a holistic word image representation which is popularly used for the task of word spotting. The representations proposed for word spotting is different from word embedding in text since the later captures the semantic aspects of the word which is a crucial ingredient to numerous nlp and ir tasks. In this work, we attempt to encode the notion of semantics into word image representation by bringing the advancements from the textual domain. We propose two novel forms of representations where the first form is designed to be inflection invariant by focusing on the approximate linguistic root of the word, while the second form is built …
A lip sync expert is all you need for speech to lip generation in the wild
K R Prajwal,Rudrabha Mukhopadhyay,Vinay P Namboodiri,Jawahar C V
ACM international conference on Multimedia, ACMMM, 2020
@inproceedings{bib_A_li_2020, AUTHOR = {K R Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, Jawahar C V}, TITLE = {A lip sync expert is all you need for speech to lip generation in the wild}, BOOKTITLE = {ACM international conference on Multimedia}. YEAR = {2020}}
In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people seen during the training phase. However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos. Extensive quantitative evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated by our Wav2Lip model is almost as good as real synced videos. We provide …
Weakly supervised instance segmentation by learning annotation consistent instances
ADITYA ARUN,Jawahar C V,M Pawan Kumar
European Conference on Computer Vision, ECCV, 2020
@inproceedings{bib_Weak_2020, AUTHOR = {ADITYA ARUN, Jawahar C V, M Pawan Kumar}, TITLE = {Weakly supervised instance segmentation by learning annotation consistent instances}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2020}}
Recent approaches for weakly supervised instance segmentations depend on two components:(i) a pseudo label generation model which provides instances that are consistent with a given annotation; and (ii) an instance segmentation model, which is trained in a supervised manner using the pseudo labels as ground-truth. Unlike previous approaches, we explicitly model the uncertainty in the pseudo label generation process using a conditional distribution. The samples drawn from our conditional distribution provide accurate pseudo labels due to the use of semantic class aware unary terms, boundary aware pairwise smoothness terms, and annotation aware higher order terms. Furthermore, we represent the instance segmentation model as an annotation agnostic prediction distribution. In contrast to previous methods, our representation allows us to define a joint probabilistic learning objective that minimizes the …
Revisiting Low Resource Status of Indian Languages in Machine Translation
JERIN PHILIP, Shashank Siripragada,Vinay P. Namboodiri,Jawahar C V
Technical Report, arXiv, 2020
@inproceedings{bib_Revi_2020, AUTHOR = {JERIN PHILIP, Shashank Siripragada, Vinay P. Namboodiri, Jawahar C V}, TITLE = {Revisiting Low Resource Status of Indian Languages in Machine Translation}, BOOKTITLE = {Technical Report}. YEAR = {2020}}
Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an automated framework also results in generating a relatively larger corpus as compared to existing corpora that are available for Indian languages. This corpus helps us obtain substantially improved results on the publicly available WAT evaluation benchmark and other standard evaluation benchmarks.
IIIT-AR-13K: a new dataset for graphical object detection in documents
Ajoy Mondal,Peter Lipps,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2020
@inproceedings{bib_IIIT_2020, AUTHOR = {Ajoy Mondal, Peter Lipps, Jawahar C V}, TITLE = {IIIT-AR-13K: a new dataset for graphical object detection in documents}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2020}}
We introduce a new dataset for graphical object detection in business documents, more specifically annual reports. This dataset, iiit-ar-13k, is created by manually annotating the bounding boxes of graphical or page objects in publicly available annual reports. This dataset contains a total of 13k annotated page images with objects in five different popular categories—table, figure, natural image, logo, and signature. It is the largest manually annotated dataset for graphical object detection. Annual reports created in multiple languages for several years from various companies bring high diversity into this dataset. We benchmark iiit-ar-13k dataset with two state of the art graphical object detection techniques using faster r-cnn [20] and mask r-cnn [11] and establish high baselines for further research. Our dataset is highly effective as training data for developing practical solutions for graphical object detection in both …
A Benchmark System for Indian Language Text Recognition
KRISHNA TULSYAN,Nimisha Srivastava,Ajoy Mondal,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2020
@inproceedings{bib_A_Be_2020, AUTHOR = {KRISHNA TULSYAN, Nimisha Srivastava, Ajoy Mondal, Jawahar C V}, TITLE = {A Benchmark System for Indian Language Text Recognition}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2020}}
he performance various academic and commercial text recognition solutions for many languages world-wide has been satisfactory. Many projects now use the ocr as a reliable module. As of now, Indian languages are far away from this state, which is unfortunate. Beyond many challenges due to script and language, this space is adversely affected by the scattered nature of research, lack of systematic evaluation, and poor resource dissemination. In this work, we aim to design and implement a web-based system that could indirectly address some of these aspects that hinder the development of ocr for Indian languages. We hope that such an attempt will help in (i) providing and establishing a consolidated view of state-of-the-art performances for character and word recognition at one place (ii) sharing resources and practices (iii) establishing standard benchmarks that clearly explain the capabilities and limitations …
Adapting OCR with Limited Supervision
Deepayan Das,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2020
@inproceedings{bib_Adap_2020, AUTHOR = {Deepayan Das, Jawahar C V}, TITLE = {Adapting OCR with Limited Supervision}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2020}}
Text recognition systems of today (aka OCRs) are mostly based on supervised learning of deep neural networks. Performance of these is limited by the type of data that is used for training. In the presence of diverse style in the document images (eg. fonts, print, writer, imaging process), creating a large amount of training data is impossible. In this paper, we explore the problem of adapting an existing OCR, already trained for a specific collection to a new collection, with minimal supervision or human effort. We explore three popular strategies for this: (i) Fine Tuning (ii) Self Training (ii) Fine Tuning + Self Training. We discuss details on how these popular approaches in Machine Learning can be adapted to the text recognition problem of our interest. We hope, our empirical observations on two different languages will be of relevance to wider use cases in text recognition.
A Multi-Space Approach to Zero-Shot Object Detection
Aditya Anantharaman,Nehal Mamgain,Sowmya Kamath S,Vineeth N Balasubramanian,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2020
@inproceedings{bib_A_Mu_2020, AUTHOR = {Aditya Anantharaman, Nehal Mamgain, Sowmya Kamath S, Vineeth N Balasubramanian, Jawahar C V}, TITLE = {A Multi-Space Approach to Zero-Shot Object Detection}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2020}}
Object detection has been at the forefront for higher level vision tasks such as scene understanding and contextual reasoning. Therefore, solving object detection for a large number of visual categories is paramount. Zero-Shot Object Detection (ZSD) – where training data is not available for some of the target classes – provides semantic scalability to object detection and reduces dependence on large amount of annotations, thus enabling a large number of applications in real-life scenarios. In this paper, we propose a novel multi-space approach to solve ZSD where we combine predictions obtained in two different search spaces. We learn the projection of visual features of proposals to the semantic embedding space and class labels in the semantic embedding space to visual space. We predict similarity scores in the individual spaces and combine them. We present promising results on two datasets, PASCAL VOC and MS COCO. We further discuss the problem of hubness and show that our approach alleviates hubness with a performance superior to previously proposed methods
Region Pooling with Adaptive Feature Fusion for End-to-End Person Recognition
VIJAYA KUMAR R,Anoop Namboodiri,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2020
@inproceedings{bib_Regi_2020, AUTHOR = {VIJAYA KUMAR R, Anoop Namboodiri, Jawahar C V}, TITLE = {Region Pooling with Adaptive Feature Fusion for End-to-End Person Recognition}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2020}}
Current approaches for person recognition train an ensemble of region specific convolutional neural networks for representation learning, and then adopt naive fusion strategies to combine their features or predictions during testing. In this paper, we propose an unified end-to-end architecture that generates a complete person representation based on pooling and aggregation of features from multiple body regions. Our network takes a person image and the predetermined locations of body regions as input, and generates common feature maps that are shared across all the regions. Multiple features corresponding to different regions are then pooled and combined with an aggregation block, where the adaptive weights required for aggregation are obtained through an attention mechanism. Evaluations on three person recognition datasets - PIPA, Soccer and Hannah show that a single model trained end-to-end is computationally faster, requires fewer parameters and achieves improved performance over separately trained models.
Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis
Prajwal K R,Rudrabha Mukhopadhyay,Vinay P. Namboodiri,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2020
@inproceedings{bib_Lear_2020, AUTHOR = {Prajwal K R, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, Jawahar C V}, TITLE = {Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2020}}
Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speakerspecific cues for accurate lip-reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings. To this end, we collect and release a large-scale benchmark dataset, the first of its kind, specifically to train and evaluate the singlespeaker lip to speech task in natural settings. We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time. Extensive evaluation using quantitative, qualitative metrics and human evaluation shows that our method is four times more intelligible than previous works in this space.
Guest Editorial: Special Issue on ACCV 2018
Jawahar C V,Hongdong Li,Greg Mor,Konrad Schindler
International Journal of Computer Vision, IJCV, 2020
@inproceedings{bib_Gues_2020, AUTHOR = {Jawahar C V, Hongdong Li, Greg Mor, Konrad Schindler}, TITLE = {Guest Editorial: Special Issue on ACCV 2018}, BOOKTITLE = {International Journal of Computer Vision}. YEAR = {2020}}
A Multilingual Parallel Corpora Collection Effort for Indian Languages
S SHASHANK,JERIN PHILIP,Vinay P. Namboodiri,Jawahar C V
International Conference on Language Resources and Evaluation, LREC, 2020
@inproceedings{bib_A_Mu_2020, AUTHOR = {S SHASHANK, JERIN PHILIP, Vinay P. Namboodiri, Jawahar C V}, TITLE = {A Multilingual Parallel Corpora Collection Effort for Indian Languages}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2020}}
We present sentence aligned parallel corpora across 10 Indian Languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.
IndicSpeech: Text-to-Speech Corpus for Indian Languages
Nimisha Srivastava,Rudrabha Mukhopadhyay,Prajwal K R,Jawahar C V
International Conference on Language Resources and Evaluation, LREC, 2020
@inproceedings{bib_Indi_2020, AUTHOR = {Nimisha Srivastava, Rudrabha Mukhopadhyay, Prajwal K R, Jawahar C V}, TITLE = {IndicSpeech: Text-to-Speech Corpus for Indian Languages}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2020}}
India is a country where several tens of languages are spoken by over a billion strong population. Text-to-speech systems for such languages will thus be extremely beneficial for wide-spread content creation and accessibility. Despite this, the current TTS systems for even the most popular Indian languages fall short of the contemporary state-of-the-art systems for English, Chinese, etc. We believe that one of the major reasons for this is the lack of large, publicly available text-to-speech corpora in these languages that are suitable for training neural text-to-speech systems. To mitigate this, we release a 24 hour text-to-speech corpus for 3 major Indian languages namely Hindi, Malayalam and Bengali. In this work, we also train a state-of-the-art TTS system for each of these languages and report their performances. The collected corpus, code, and trained models are made publicly available.
Machine Learning for Accurate Force Calculations in Molecular Dynamics Simulations
PUNYASLOK PATTNAIK,Shampa Raghunathan,K TARUN TEJA,Prabhakar Bhimalapuram,Jawahar C V,Deva Priyakumar U
Journal of Physical Chemistry A, PCA, 2020
@inproceedings{bib_Mach_2020, AUTHOR = {PUNYASLOK PATTNAIK, Shampa Raghunathan, K TARUN TEJA, Prabhakar Bhimalapuram, Jawahar C V, Deva Priyakumar U}, TITLE = {Machine Learning for Accurate Force Calculations in Molecular Dynamics Simulations}, BOOKTITLE = {Journal of Physical Chemistry A}. YEAR = {2020}}
The computationally expensive nature of ab initio molecular dynamics simulations severely limits its ability to simulate large system sizes and long time scales, both of which are necessary to imitate experimental conditions. In this work, we explore an approach to make use of the data obtained using the quantum mechanical density functional theory (DFT) on small systems and use deep learning to subsequently simulate large systems by taking liquid argon as a test case. A suitable vector representation was chosen to represent the surrounding environment of each Ar atom, and a -NetFF machine learning model where, the neural network was trained to predict the di↵erence in resultant forces obtained by DFT and classical force fields was introduced. Molecular dynamics simulations were then performed using forces from the neural network for various system sizes and time scales depending on the properties we calculated. A comparison of properties obtained from the classical force field and the neural network model was presented alongside available experimental data to validate the proposed method.
RoadText-1K: Text Detection & Recognition Dataset for Driving Videos
Sangeeth Reddy Battu,MINESH MATHEW,Lluis Gomez,Marcal Rusinol,Dimosthenis Karatzas,Jawahar C V
International Conference on Robotics and Automation, ICRA, 2020
@inproceedings{bib_Road_2020, AUTHOR = {Sangeeth Reddy Battu, MINESH MATHEW, Lluis Gomez, Marcal Rusinol, Dimosthenis Karatzas, Jawahar C V}, TITLE = {RoadText-1K: Text Detection & Recognition Dataset for Driving Videos}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2020}}
Abstract— Perceiving text is crucial to understand semantics of outdoor scenes and hence is a critical requirement to build intelligent systems for driver assistance and self-driving. Most of the existing datasets for text detection and recognition comprise still images and are mostly compiled keeping text in mind. This paper introduces a new ”RoadText-1K” dataset for text in driving videos. The dataset is 20 times larger than the existing largest dataset for text in videos. Our dataset comprises 1000 video clips of driving without any bias towards text and with annotations for text bounding boxes and transcriptions in every frame. State of the art methods for text detection, recognition and tracking are evaluated on the new dataset and the results signify the challenges in unconstrained driving videos compared to existing datasets. This suggests that RoadText-1K is suited for research and development of reading systems, robust enough to be incorporated into more complex downstream tasks like driver assistance and self-driving. The dataset can be found at http://cvit.iiit.ac.in/research/ projects/cvit-projects/roadtext-1k
DocVQA: A Dataset for VQA on Document Images
MINESH MATHEW,Dimosthenis Karatzas,R. Manmatha,Jawahar C V
Technical Report, arXiv, 2020
@inproceedings{bib_DocV_2020, AUTHOR = {MINESH MATHEW, Dimosthenis Karatzas, R. Manmatha, Jawahar C V}, TITLE = {DocVQA: A Dataset for VQA on Document Images}, BOOKTITLE = {Technical Report}. YEAR = {2020}}
We present a new dataset for Visual Question Answering on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. We provide detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document in crucial.
Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval
Siddhant Bansal,PRAVEEN KRISHNAN,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2020
@inproceedings{bib_Fuse_2020, AUTHOR = {Siddhant Bansal, PRAVEEN KRISHNAN, Jawahar C V}, TITLE = {Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2020}}
Recognition and retrieval of textual content from the large document collections have been a powerful use case for the document image analysis community. Often the word is the basic unit for recognition as well as retrieval. Systems that rely only on the text recognisers (ocr) output are not robust enough in many situations, especially when the word recognition rates are poor, as in the case of historic documents or digital libraries. An alternative has been word spotting based methods that retrieve/match words based on a holistic representation of the word. In this paper, we fuse the noisy output of text recogniser with a deep embeddings representation derived out of the entire word. We use average and max fusion for improving the ranked results in the case of retrieval. We validate our methods on a collection of Hindi documents. We improve word recognition rate by 1.4% and retrieval by 11.13% in the mAP.
Human-Machine Collaboration for Face Recognition
Saurabh Ravindranath,Rahul Baburaj,Vineeth N. Balasubramanian,Nageswararao Namburu,Sujit P Gujar,Jawahar C V
India Joint International Conference on. Data Science & Management of Data, COMAD/CODS, 2020
@inproceedings{bib_Huma_2020, AUTHOR = {Saurabh Ravindranath, Rahul Baburaj, Vineeth N. Balasubramanian, Nageswararao Namburu, Sujit P Gujar, Jawahar C V}, TITLE = {Human-Machine Collaboration for Face Recognition}, BOOKTITLE = {India Joint International Conference on. Data Science & Management of Data}. YEAR = {2020}}
Despite advances in deep learning and facial recognition techniques, the problem of fault-intolerant facial recognition remains challenging. With the current state of progress in the field of automatic face recognition and the in-feasibility of fully manual recognition, the situation calls for human-machine collaborative methods. We design a system that uses machine predictions for a given face to generate queries that are answered by human experts to provide the system with the information required to predict the identity of the face correctly. We use a Markov Decision Process for which we devise an appropriate query structure and a reward structure to generate these queries in a budget or accuracy-constrained setting. Finally, as we do not know the capabilities of the human experts involved, we model each human as a bandit and adopt a multi-armed bandit approach with consensus queries to efficiently estimate …
Indian Plant Recognition in the Wild
Muthireddy Vamsidhar,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics, NCVPRIG, 2019
@inproceedings{bib_Indi_2019, AUTHOR = {Muthireddy Vamsidhar, Jawahar C V}, TITLE = {Indian Plant Recognition in the Wild}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics}. YEAR = {2019}}
Conservation efforts to protect biodiversity rely on an accurate identification process. In the case of plant identification, traditional methods used are manual, time-consuming and require a degree of expertise to operate. As a result, there is an increasing interest today for an automated plant identification system. Such a system can help in aiding plant-related education, promoting ecotourism, creating a digital heritage for plant species among many others. We propose a solution using modern convolutional neural network architectures which achieves stateof-the-art performance for plant classification in the wild. An exhaustive set of experiments are performed to classify 112 species of plants from the challenging Indic-Leaf dataset. The best performing model gives Top-1 precision of 90.08 and Top-5 precision of 96.90.
Learning To Generate Atmospheric Turbulent Images
Shyam Nandan Rai,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics, NCVPRIG, 2019
@inproceedings{bib_Lear_2019, AUTHOR = {Shyam Nandan Rai, Jawahar C V}, TITLE = {Learning To Generate Atmospheric Turbulent Images}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics}. YEAR = {2019}}
Modeling atmospheric turbulence is a challenging problem since the light rays arbitrarily bend before entering the camera. Such models are critical to extend computer vision solutions developed in the laboratory to real-world use cases. Simulating atmospheric turbulence by using statistical models or by computer graphics is often computationally expensive. To overcome this problem, we train a generative adversarial network which outputs an atmospheric turbulent image by utilizing less computational resources than traditional methods. We propose a novel loss function to efficiently learn the atmospheric turbulence at the finer level. Experiments show that by using the proposed loss function, our network outperforms the existing state-of-the-art image to image translation network in turbulent image generation. We also perform extensive ablation studies on the loss function to demonstrate the improvement in the perceptual quality of turbulent images.
CVIT’s Submissions to WAT-2019
JERIN PHILIP,SIRIPRAGADA SHASHANK,Upendra Kumar,Vinay P. Namboodiri,Jawahar C V
Workshop on Asian Translation, WAT, 2019
@inproceedings{bib_CVIT_2019, AUTHOR = {JERIN PHILIP, SIRIPRAGADA SHASHANK, Upendra Kumar, Vinay P. Namboodiri, Jawahar C V}, TITLE = {CVIT’s Submissions to WAT-2019}, BOOKTITLE = {Workshop on Asian Translation}. YEAR = {2019}}
This paper describes the Neural Machine Translation systems used by IIIT Hyderabad (CVIT-MT) for the translation tasks part of WAT-2019. We participated in tasks pertaining to Indian languages and submitted results for English-Hindi, Hindi-English, English-Tamil and Tamil-English language pairs. We employ Transformer architecture experimenting with multilingual models and methods for lowresource languages.
Towards accurate handwritten word recognition for Hindi and Bangla
KARTIK DUTTA,PRAVEEN KRISHNAN,MINESH MATHEW,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2019
@inproceedings{bib_Towa_2019, AUTHOR = {KARTIK DUTTA, PRAVEEN KRISHNAN, MINESH MATHEW, Jawahar C V}, TITLE = {Towards accurate handwritten word recognition for Hindi and Bangla}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2019}}
Building accurate lexicon free handwritten text recognizers for Indic languages is a challenging task, mostly due to the inherent complexities in Indic scripts in addition to the cursive nature of handwriting. In this work, we demonstrate an end-to-end trainable CNN-RNN hybrid architecture which takes inspirations from recent advances of using residual blocks for training convolutional layers, along with the inclusion of spatial transformer layer to learn a model invariant to geometric distortions present in handwriting. In this work we focus building state of the art handwritten word recognizers for two popular Indic scripts – Devanagari and Bangla. To address the need of large scale training data for such low resources languages, we utilize synthetically rendered data for pre-training the network and later fine tune it on the real data. We outperform the previous lexicon based, state of the art methods on the test set …
CVIT’s Submissions to WAT-2019
JERIN PHILIP,SHASHANK S,Upendra Kumar,Vinay P. Namboodiri,Jawahar C V
Conference of the Association of Computational Linguistics, ACL, 2019
@inproceedings{bib_CVIT_2019, AUTHOR = {JERIN PHILIP, SHASHANK S, Upendra Kumar, Vinay P. Namboodiri, Jawahar C V}, TITLE = {CVIT’s Submissions to WAT-2019}, BOOKTITLE = {Conference of the Association of Computational Linguistics}. YEAR = {2019}}
This paper describes the Neural Machine Translation systems used by IIIT Hyderabad (CVIT-MT) for the translation tasks part of WAT-2019. We participated in tasks pertaining to Indian languages and submitted results for English-Hindi, Hindi-English, English-Tamil and Tamil-English language pairs. We employ Transformer architecture experimenting with multilingual models and methods for low-resource languages.
ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard
Xi Liu,Rui Zhang,Yongsheng Zhou, Qianyi Jian,Qi Song,Nan Li,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2019
@inproceedings{bib_ICDA_2019, AUTHOR = {Xi Liu, Rui Zhang, Yongsheng Zhou, Qianyi Jian, Qi Song, Nan Li, Jawahar C V}, TITLE = {ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2019}}
Chinese scene text reading is one of the most challenging problems in computer vision and has attracted great interest. Different from English text, Chinese has more than 6000 commonly used characters and Chinese characters can be arranged in various layouts with numerous fonts. The Chinese signboards in street view are a good choice for Chinese scene text images since they have different backgrounds, fonts and layouts. We organized a competition called ICDAR2019-ReCTS, which mainly focuses on reading Chinese text on signboard. This report presents the final results of the competition. A large-scale dataset of 25,000 annotated signboard images, in which all the text lines and characters are annotated with locations and transcriptions, were released. Four tasks, namely character recognition, text line recognition, text line detection and end-to-end recognition were set up. Besides, considering the Chinese text ambiguity issue, we proposed a multi ground truth (multi-GT) evaluation method to make evaluation fairer. The competition started on March 1, 2019 and ended on April 30, 2019. 262 submissions from 46 teams are received. Most of the participants come from universities, research institutes, and tech companies in China. There are also some participants from the United States, Australia, Singapore, and Korea. 21 teams submit results for Task 1, 23 teams submit results for Task 2, 24 teams submit results for Task 3, and 13 teams submit results for Task 4. The official website for the competition is http://rrc.cvc.uab.es/?ch=12.
Universal semi-supervised semantic segmentation
Tarun Kalluri,Girish Varma,Manmohan Chandraker,Jawahar C V
International Conference on Computer Vision, ICCV, 2019
@inproceedings{bib_Univ_2019, AUTHOR = {Tarun Kalluri, Girish Varma, Manmohan Chandraker, Jawahar C V}, TITLE = {Universal semi-supervised semantic segmentation}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2019}}
In recent years, the need for semantic segmentation has arisen across several different applications and environments. However, the expense and redundancy of annotation often limits the quantity of labels available for training in any domain, while deployment is easier if a single model works well across domains. In this paper, we pose the novel problem of universal semi-supervised semantic segmentation and propose a solution framework, to meet the dual needs of lower annotation and deployment costs. In contrast to counterpoints such as fine tuning, joint training or unsupervised domain adaptation, universal semi-supervised segmentation ensures that across all domains: (i) a single model is deployed, (ii) unlabeled data is used, (iii) performance is improved, (iv) only a few labels are needed and (v) label spaces may differ. To address this, we minimizesupervised as well as within and cross-domain unsupervised losses, introducing a novel feature alignment objective based on pixel-aware entropy regularization for the latter. We demonstrate quantitative advantages over other approaches on several combinations of segmentation datasets across different geographies (Germany, England, India) and environments (outdoors, indoors), as well as qualitative insights on the aligned representations.
Dissimilarity coefficient based weakly supervised object detection
ADITYA ARUN,Jawahar C V,M. Pawan Kumar
Computer Vision and Pattern Recognition, CVPR, 2019
@inproceedings{bib_Diss_2019, AUTHOR = {ADITYA ARUN, Jawahar C V, M. Pawan Kumar }, TITLE = {Dissimilarity coefficient based weakly supervised object detection}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2019}}
We consider the problem of weakly supervised object detection, where the training samples are annotated using only image-level labels that indicate the presence or absence of an object category. In order to model the uncertainty in the location of the objects, we employ a dissimilarity coefficient based probabilistic learning objective. The learning objective minimizes the difference between an annotation agnostic prediction distribution and an annotation aware conditional distribution. The main computational challenge is the complex nature of the conditional distribution, which consists of terms over hundreds or thousands of variables. The complexity of the conditional distribution rules out the possibility of explicitly modeling it. Instead, we exploit the fact that deep learning frameworks rely on stochastic optimization. This allows us to use a state of the art discrete generative model that can provide annotation consistent samples from the conditional distribution. Extensive experiments on PASCAL VOC 2007 and 2012 data sets demonstrate the efficacy of our proposed approach.
Improved road connectivity by joint learning of orientation and segmentation
Anil Kumar Batra,Suriya Singh,Guan Pang,Saikat Basu,Manohar Paluri,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2019
@inproceedings{bib_Impr_2019, AUTHOR = {Anil Kumar Batra, Suriya Singh, Guan Pang, Saikat Basu, Manohar Paluri, Jawahar C V}, TITLE = {Improved road connectivity by joint learning of orientation and segmentation}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2019}}
Road network extraction from satellite images often produce fragmented road segments leading to road maps unfit for real applications. Pixel-wise classification fails to predict topologically correct and connected road masks due to the absence of connectivity supervision and difficulty in enforcing topological constraints. In this paper, we propose a connectivity task called Orientation Learning, motivated by the human behavior of annotating roads by tracing it at a specific orientation. We also develop a stacked multibranch convolutional module to effectively utilize the mutual information between orientation learning and segmentation tasks. These contributions ensure that the model predicts topologically correct and connected road masks. We also propose Connectivity Refinement approach to further enhance the estimated road networks. The refinement model is pre-trained to connect and refine the corrupted groundtruth masks and later fine-tuned to enhance the predicted road masks. We demonstrate the advantages of our approach on two diverse road extraction datasets SpaceNet [30] and DeepGlobe [11]. Our approach improves over the tate-of-the-art techniques by 9% and 7.5% in road topology metric on SpaceNet and DeepGlobe, respectively.
Scene text visual question answering
Ali Furkan Biten,Ruben Tito,Andres Mafla,Lluis Gomez,Marc¸al Rusinol,Ernest Valveny,Jawahar C V,Dimosthenis Karatzas
International Conference on Computer Vision, ICCV, 2019
@inproceedings{bib_Scen_2019, AUTHOR = {Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc¸al Rusinol, Ernest Valveny, Jawahar C V, Dimosthenis Karatzas}, TITLE = {Scene text visual question answering}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2019}}
Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting highlevel semantic information present in images as textual cues in the Visual Question Answering process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research
Region-based active learning for efficient labeling in semantic segmentation
Kasarla Tejaswi,NAGENDAR. G,Guruprasad M. Hegde,V. Balasubramanian,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2019
@inproceedings{bib_Regi_2019, AUTHOR = {Kasarla Tejaswi, NAGENDAR. G, Guruprasad M. Hegde, V. Balasubramanian, Jawahar C V}, TITLE = {Region-based active learning for efficient labeling in semantic segmentation}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2019}}
As vision-based autonomous systems, such as selfdriving vehicles, become a reality, there is an increasing need for large annotated datasets for developing solutions to vision tasks. One important task that has seen significant interest in recent years is semantic segmentation. However, the cost of annotating every pixel for semantic segmentation is immense, and can be prohibitive in scaling to various settings and locations. In this paper, we propose a region-based active learning method for efficient labeling in semantic segmentation. Using the proposed active learning strategy, we show that we are able to judiciously select the regions for annotation such that we obtain 93.8% of the baseline performance (when all pixels are labeled) with labeling of 10% of the total number of pixels. Further, we show that this approach can be used to transfer annotations from a model trained on a given dataset (Cityscapes) to a different dataset (Mapillary), thus highlighting its promise and potential
Low-Cost Transfer Learning of Face Tasks
THRUPTHI ANN JOHN,ISHA DUA,Vineeth N Balasubramanian,Jawahar C V
Technical Report, arXiv, 2019
@inproceedings{bib_Low-_2019, AUTHOR = {THRUPTHI ANN JOHN, ISHA DUA, Vineeth N Balasubramanian, Jawahar C V}, TITLE = {Low-Cost Transfer Learning of Face Tasks}, BOOKTITLE = {Technical Report}. YEAR = {2019}}
Do we know what the different filters of a face network represent? Can we use this filter information to train other tasks without transfer learning? For instance, can age, head pose, emotion and other face related tasks be learned from face recognition network without transfer learning? Understanding the role of these filters allows us to transfer knowledge across tasks and take advantage of large data sets in related tasks. Given a pretrained network, we can infer which tasks the network generalizes for and the best way to transfer the information to a new task. We demonstrate a computationally inexpensive algorithm to reuse the filters of a face network for a task it was not trained for. Our analysis proves these attributes can be extracted with an accuracy comparable to what is obtained with transfer learning, but 10 times faster. We show that the information about other tasks is present in relatively small number of filters. We use these insights to do task specific pruning of a pretrained network. Our method gives significant compression ratios with reduction in size of 95% and computational reduction of 60%
Spotting words in silent speech videos: a retrieval-based approach
Abhishek Jha,· Vinay P. Namboodiri,Jawahar C V
Machine Vision and Applications, MVA, 2019
@inproceedings{bib_Spot_2019, AUTHOR = {Abhishek Jha, · Vinay P. Namboodiri, Jawahar C V}, TITLE = {Spotting words in silent speech videos: a retrieval-based approach}, BOOKTITLE = {Machine Vision and Applications}. YEAR = {2019}}
Our goal is to spot words in silent speech videos without explicitly recognizing the spoken words, where the lip motion of the speaker is clearly visible and audio is absent. Existing work in this domain has mainly focused on recognizing a fixed set of words in word-segmented lip videos, which limits the applicability of the learned model due to limited vocabulary and high dependency on the model’s recognition performance. Our contribution is twofold: (1) we develop a pipeline for recognition-free retrieval and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. (2) We introduce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatiotemporal landmarks of the query and the top retrieval candidates. Our word spotting method achieves 35% higher mean average precision over recognition-based method on large-scale LRW dataset. We also demonstrate the application of the method by word spotting in a popular speech video (“The great dictator” by Charlie Chaplin) where we show that the word retrieval can be used to understand what was spoken perhaps in the silent movies. Finally, we compare our model against ASR in a noisy environment and analyze the effect of the performance of underlying lip-reader and input video quality on the proposed word spotting pipeline
CVIT-MT Systems for WAT-2018
JERIN PHILIP,Vinay P. Namboodiri,Jawahar C V
Workshop on Asian Translation, WAT, 2019
@inproceedings{bib_CVIT_2019, AUTHOR = {JERIN PHILIP, Vinay P. Namboodiri, Jawahar C V}, TITLE = {CVIT-MT Systems for WAT-2018}, BOOKTITLE = {Workshop on Asian Translation}. YEAR = {2019}}
This document describes the machine translation system used in the submissions of IIIT-Hyderabad (CVIT-MT) for the WAT-2018 English-Hindi translation task. Performance is evaluated on the associated corpus provided by the organizers. We experimented with convolutional sequence to sequence architectures. We also train with additional data obtained through backtranslation.
A deep learning approach for robust corridor following from an arbitrary pose
Vishnu Sashank Dorbala,A.H. Abdul Hafez,Jawahar C V
Signal Processing and Communications Applications Conference, SIU, 2019
@inproceedings{bib_A_de_2019, AUTHOR = {Vishnu Sashank Dorbala, A.H. Abdul Hafez, Jawahar C V}, TITLE = {A deep learning approach for robust corridor following from an arbitrary pose}, BOOKTITLE = {Signal Processing and Communications Applications Conference}. YEAR = {2019}}
Abstract— For an autonomous corridor following task where the environment is continuously changing, several forms of environmental noise prevent an automated feature extraction procedure from performing reliably. Moreover, in cases where pre-defined features are absent from the captured data, a well defined control signal for performing the servoing task fails to get produced. In order to overcome these drawbacks, we present in this work, using a convolutional neural network (CNN) to directly estimate the required control signal from an image, encompassing feature extraction and control law computation into one single end-to-end framework. In particular, we study the task of autonomous corridor following using a CNN and present clear advantages in cases where a traditional method used for performing the same task fails to give a reliable outcome. We evaluate the performance of our method on this task on a Wheelchair Platform developed at our institute for this purpose.
Scene Text Visual Question Answering
Ali Furkan Biten,Ruben Tito ,Andres Mafla,Lluis Gomez,Marçal Rusiñol,MINESH MATHEW,Jawahar C V,Ernest Valveny,Dimosthenis Karatzas
Technical Report, arXiv, 2019
@inproceedings{bib_Scen_2019, AUTHOR = {Ali Furkan Biten, Ruben Tito , Andres Mafla, Lluis Gomez, Marçal Rusiñol, MINESH MATHEW, Jawahar C V, Ernest Valveny, Dimosthenis Karatzas}, TITLE = {Scene Text Visual Question Answering}, BOOKTITLE = {Technical Report}. YEAR = {2019}}
Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research
Cross-language Speech Dependent Lip-synchronization
Abhishek Jha,Vikram Voleti,Vinay Namboodiri,Jawahar C V
International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2019
@inproceedings{bib_Cros_2019, AUTHOR = {Abhishek Jha, Vikram Voleti, Vinay Namboodiri, Jawahar C V}, TITLE = {Cross-language Speech Dependent Lip-synchronization}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2019}}
Understanding videos of people speaking across internationalborders is hard as audiences from different demographies donot understand the language. Such speech videos are oftensupplemented with language subtitles. However, these ham-per the viewing experience as the attention is shared. Simpleaudio dubbing in a different language makes the video appearunnatural due to unsynchronized lip motion. In this paper, wepropose a system for automated cross-language lip synchro-nization for re-dubbed videos. Our model generates superiorphotorealistic lip-synchronization over original video in com-parison to the current re-dubbing method. With the help of auser-based study, we verify that our method is preferred overunsynchronized videos
AutoRate: How attentive is the driver?
ISHA DUA,Akshay Uttama Nambi,Jawahar C V,Venkat Padmanabhan
International Conference on Automatic Face and Gesture Recognition, FG, 2019
@inproceedings{bib_Auto_2019, AUTHOR = {ISHA DUA, Akshay Uttama Nambi, Jawahar C V, Venkat Padmanabhan}, TITLE = {AutoRate: How attentive is the driver?}, BOOKTITLE = {International Conference on Automatic Face and Gesture Recognition}. YEAR = {2019}}
Abstract— Driver inattention is one of the leading causesof vehicle crashes and incidents worldwide. Driver inattentionincludes driver fatigue leading to drowsiness and driver distrac-tion, say due to use of cellphone or rubbernecking, all of whichleads to a lack of situational awareness. Hitherto, techniquespresented to monitor driver attention evaluated factors suchas fatigue and distraction independently. However, in orderto develop a robust driver attention monitoring system allthe factors affecting driver’s attention needs to be analyzedholistically. In this paper, we presentAutoRate, a system thatleverages front camera of a windshield-mounted smartphoneto monitor driver’s attention by combining several features.We derive a driver attention rating by fusing spatio-temporalfeatures based on the driver state and behavior such as headpose, eye gaze, eye closure, yawns, use of cellphones, etc.We perform extensive evaluation ofAutoRateon real-world driving data and also data from controlled, static vehiclesettings with 30 drivers in a large city. We compareAutoRate’sautomatically-generated rating with the scores given by 5human annotators. Further, we compute the agreement betweenAutoRate’s rating and human annotator rating using kappacoefficient.AutoRate’s automatically-generated rating has anoverall agreement of 0.87 with the ratings provided by 5 humanannotators on the static dataset.
Beyond supervised learning: A computer vision perspective
Anbumani Subramanian,Vineeth N Balasubramanian,Jawahar C V
Journal on Indian Institute of Science, IIS, 2019
@inproceedings{bib_Beyo_2019, AUTHOR = {Anbumani Subramanian, Vineeth N Balasubramanian, Jawahar C V}, TITLE = {Beyond supervised learning: A computer vision perspective}, BOOKTITLE = {Journal on Indian Institute of Science}. YEAR = {2019}}
AbstractFully supervised deep learning-based methods have created a profound im-pact in various fields of computer science. Compared to classical methods, superviseddeep learning-based techniques face scalability issues as they require huge amountsof labeled data and, more significantly, are unable to generalize to multiple domainsand tasks. In recent years, a lot of research has been targeted towards addressingthese issues within the deep learning community. Although there have been extensivesurveys on learning paradigms such as semi-supervised and unsupervised learning,there are few timely reviews after the emergence of deep learning. In this paper, weprovide an overview of the contemporary literature surrounding alternatives to fullysupervised learning in the deep learning context. First, we summarize the relevanttechniques that fall between the paradigm of supervised and unsupervised learning.Second, we take autonomous navigation as a running example to explain and com-pare different models. Finally, we highlight some shortcomings of current methodsand suggest future directions.
ICDAR 2019 Competition on Scene Text Visual Question Answering
Ali Furkan Biten,Rub en Tito,Andres Mafla,Lluis Gomez,Marc ̧al Rusi ̃nol,MINESH MATHEW,Jawahar C V,Ernest Valveny,Dimosthenis Karatzas
International Conference on Document Analysis and Recognition, ICDAR, 2019
@inproceedings{bib_ICDA_2019, AUTHOR = {Ali Furkan Biten, Rub En Tito, Andres Mafla, Lluis Gomez, Marc ̧al Rusi ̃nol, MINESH MATHEW, Jawahar C V, Ernest Valveny, Dimosthenis Karatzas}, TITLE = {ICDAR 2019 Competition on Scene Text Visual Question Answering}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2019}}
Abstract—This paper presents final results of ICDAR 2019Scene Text Visual Question Answering competition (ST-VQA).ST-VQA introduces an important aspect that is not addressedby any Visual Question Answering system up to date, namelythe incorporation of scene text to answer questions asked aboutan image. The competition introduces a new dataset comprising23,038images annotated with31,791question / answer pairswhere the answer is always grounded on text instances presentin the image. The images are taken from7different publiccomputer vision datasets, covering a wide range of scenarios.The competition was structured in three tasks of increas-ing difficulty, that require reading the text in a scene andunderstanding it in the context of the scene, to correctlyanswer a given question. A novel evaluation metric is presented,which elegantly assesses both key capabilities expected from anoptimal model: text recognition and image understanding.A detailed analysis of results from different participants isshowcased, which provides insight into the current capabilitiesof VQA systems that can read. We firmly believe the datasetproposed in this challenge will be an important milestone toconsider towards a path of more robust and general models thatcan exploit scene text to achieve holistic image understanding.
Self-supervised visual representations for cross-modal retrieval
Yash Patel,Lluis Gomez,Marçal Rusiñol,Dimosthenis Karatzas,Jawahar C V
International Conference on Multimedia Retrieval, ICMR, 2019
@inproceedings{bib_Self_2019, AUTHOR = {Yash Patel, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas, Jawahar C V}, TITLE = {Self-supervised visual representations for cross-modal retrieval}, BOOKTITLE = {International Conference on Multimedia Retrieval}. YEAR = {2019}}
Cross-modal retrieval methods have been significantly improvedin last years with the use of deep neural networks and large-scaleannotated datasets such as ImageNet and Places. However, collect-ing and annotating such datasets requires a tremendous amountof human effort and, besides, their annotations are limited to dis-crete sets of popular visual classes that may not be representativeof the richer semantics found on large-scale cross-modal retrievaldatasets. In this paper, we present a self-supervised cross-modalretrieval framework that leverages as training data the correlationsbetween images and text on the entire set of Wikipedia articles.Our method consists in training a CNN to predict: (1) the semanticcontext of the article in which an image is more probable to appearas an illustration, and (2) the semantic context of its caption. Ourexperiments demonstrate that the proposed method is not only ca-pable of learning discriminative visual representations for solvingvision tasks like classification, but that the learned representationsare better for cross-modal retrieval when compared to supervisedpre-training of the network on the ImageNet dataset.
Pan-Renal Cell Carcinoma classification and survival prediction from histopathology images using deep learning
Vinod Palakkad Krishnanunni,Jawahar C V
NPG Nature Scientific Reports, NPG, 2019
@inproceedings{bib_Pan-_2019, AUTHOR = {Vinod Palakkad Krishnanunni, Jawahar C V}, TITLE = {Pan-Renal Cell Carcinoma classification and survival prediction from histopathology images using deep learning}, BOOKTITLE = {NPG Nature Scientific Reports}. YEAR = {2019}}
Histopathological images contain morphological markers of disease progression that have diagnostic and predictive values. In this study, we demonstrate how deep learning framework can be used for an automatic classification of Renal Cell Carcinoma (RCC) subtypes, and for identification of features that predict survival outcome from digital histopathological images. Convolutional neural networks (CNN’s) trained on whole-slide images distinguish clear cell and chromophobe RCC from normal tissue with a classification accuracy of 93.39% and 87.34%, respectively. Further, a CNN trained to distinguish clear cell, chromophobe and papillary RCC achieves a classification accuracy of 94.07%. Here, we introduced a novel support vector machine-based method that helped to break the multi-class classification task into multiple binary classification tasks which not only improved the performance of the model but also helped to deal with data imbalance. Finally, we extracted the morphological features from high probability tumor regions identified by the CNN to predict patient survival outcome of most common clear cell RCC. The generated risk index based on both tumor shape and nuclei features are significantly associated with patient survival outcome. These results highlight that deep learning can play a role in both cancer diagnosis and prognosis.
A baseline neural machine translation system for indian languages
JERIN PHILIP,Vinay P. Namboodiri,Jawahar C V
Technical Report, arXiv, 2019
@inproceedings{bib_A_ba_2019, AUTHOR = {JERIN PHILIP, Vinay P. Namboodiri, Jawahar C V}, TITLE = {A baseline neural machine translation system for indian languages}, BOOKTITLE = {Technical Report}. YEAR = {2019}}
We present a simple, yet effective, Neural Ma-chineTranslationsystemforIndianlanguages.Wedemonstratethefeasibilityformultiplelan-guage pairs, and establish a strong baseline forfurther research.
A Cost Efficient Approach to Correct OCR Errors in Large Document Collections
Deepayan Das,JERIN PHILIP,MINESH MATHEW,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2019
@inproceedings{bib_A_Co_2019, AUTHOR = {Deepayan Das, JERIN PHILIP, MINESH MATHEW, Jawahar C V}, TITLE = {A Cost Efficient Approach to Correct OCR Errors in Large Document Collections}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2019}}
A Cost Efficient Approach to Correct OCR Errors in Large Document CollectionsDeepayan Das, Jerin Philip, Minesh Mathew and C. V. JawaharCenter for Visual Information Technology, IIIT Hyderabad, India.{deepayan.das, jerin.philip, minesh.mathew}@research.iiit.ac.in, jawahar@iiit.ac.inAbstract—Word error rate of anOCRis often higher thanits character error rate. This is especially true whenOCRsaredesigned by recognizing characters. High word accuracies arecritical for many practical applications like content creationand text-to-speech systems. In order to detect and correct themisrecognised words, it is common for anOCRto employ apost-processor module to improve the word accuracy. However,conventional approaches to post-processing like looking up adictionary or using a statistical language model (SLM), are stilllimited. In many such scenarios, it is often required to removethe outstanding errors manually.We observe that the traditional post-processing schemes lookat error words sequentially sinceOCRs process documents oneat a time. We propose a cost efficient model to address the errorwords in batches rather than correcting them individually.We exploit the fact that a collection of documents (eg. abook), unlike a single document, has a structure leading torepetition of words. Such words, if efficiently grouped togetherand corrected together, can lead to significant reduction in theeffort. Error correction can be fully automatic or with a humanin the loop. We compare the performance of our method withvarious baseline approaches including the case where all theerrors are removed by a human. We demonstrate the efficacy ofour solution empirically by reporting more than70%reductionin the human effort with near perfect error correction. Wevalidate our method on books in both English and Hindi.
Textual Description for Mathematical Equations
Ajoy Mondal,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2019
@inproceedings{bib_Text_2019, AUTHOR = {Ajoy Mondal, Jawahar C V}, TITLE = {Textual Description for Mathematical Equations}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2019}}
bstract—Reading of mathematical expression or equationin the document images is very challenging due to the largevariability of mathematical symbols and expressions. In thispaper, we pose reading of mathematical equation as a taskof generation of the textual description which interprets theinternal meaning of this equation. Inspired by the naturalimage captioning problem in computer vision, we presenta mathematical equation description (MED) model, a novelend-to-end trainable deep neural network based approachthat learns to generate a textual description for readingmathematical equation images. OurMEDmodel consists of aconvolution neural network as an encoder that extracts featuresof input mathematical equation images and a recurrent neuralnetwork with attention mechanism which generates descriptionrelated to the input mathematical equation images. Due tothe unavailability of mathematical equation image data setswith their textual descriptions, we generate two data sets forexperimental purpose. To validate the effectiveness of ourMEDmodel, we conduct a real-world experiment to see whether thestudents are able to write equations by only reading or listeningtheir textual descriptions or not. Experiments conclude that thestudents are able to write most of the equations correctly byreading their textual descriptions only.
Graphical Object Detection in Document Images
Ranajit Saha,Ajoy Mondal,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2019
@inproceedings{bib_Grap_2019, AUTHOR = {Ranajit Saha, Ajoy Mondal, Jawahar C V}, TITLE = {Graphical Object Detection in Document Images}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2019}}
bstract—Graphical elements: particularly tables and fig-ures contain a visual summary of the most valuable infor-mation contained in a document. Therefore, localization ofsuch graphical objects in the document images is the initialstep to understand the content of such graphical objects ordocument images. In this paper, we present a novel end-to-endtrainable deep learning based framework to localize graphicalobjects in the document images called as Graphical ObjectDetection (GOD). Our framework is data-driven and does notrequire any heuristics or meta-data to locate graphical objectsin the document images. TheGODexplores the concept oftransfer learning and domain adaptation to handle scarcity oflabeled training images for graphical object detection task inthe document images. Performance analysis carried out on thevarious public benchmark data sets:ICDAR-2013,ICDAR-POD2017 andUNLVshows that our model yields promising resultsas compared to state-of-the-art techniques
Towards Automated Evaluation of Handwritten Assessments
VIJAY BAPANAIAH ROWTULA,OOTA SUBBA REDDY,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2019
@inproceedings{bib_Towa_2019, AUTHOR = {VIJAY BAPANAIAH ROWTULA, OOTA SUBBA REDDY, Jawahar C V}, TITLE = {Towards Automated Evaluation of Handwritten Assessments}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2019}}
Abstract—Automated evaluation of handwritten answers hasbeen a challenging problem for scaling the education systemfor many years. Speeding up the evaluation remains as themajor bottleneck for enhancing the throughput of instructors.This paper describes an effective method for automaticallyevaluating the short descriptive handwritten answers fromthe digitized images. Our goal is to evaluate a student’shandwritten answer by assigning an evaluation score that iscomparable to the human-assigned scores. Existing works inthis domain mainly focused on evaluating handwritten essayswith handcrafted, non-semantic features. Our contribution istwo-fold: 1) we model this problem as a self-supervised, feature-based classification problem, which can fine-tune itself for eachquestion without any explicit supervision. 2) We introduce theusage of semantic analysis for auto-evaluation in handwrittentext space using the combination of Information Retrievaland Extraction (IRE) and, Natural Language Processing (NLP)methods to derive a set of useful features. We tested our methodon three datasets created from various domains, using the helpof students of different age groups. Experiments show that ourmethod performs comparably to that of human evaluators.
Icdar2019 competition on scanned receipt ocr and information extraction
Zheng Huang,Kai Chen,Jianhua He,Xiang Bai,Dimosthenis Karatzas,Shijian Lu,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2019
@inproceedings{bib_Icda_2019, AUTHOR = {Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, Jawahar C V}, TITLE = {Icdar2019 competition on scanned receipt ocr and information extraction}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2019}}
Abstract—The ICDAR 2019 Challenge on “Scanned receiptsOCR and key information extraction” (SROIE) covers importantaspects related to the automated analysis of scanned receipts.The SROIE tasks play a key role in many document analysissystems and hold significant commercial potential. Although a lotof works have been published over the years on administrativedocument analysis, the community has advanced relatively slowly,as most datasets have been kept private. One of the keycontributions of SROIE to the document analysis community is tooffer a first, standardized dataset of 1000 whole scanned receiptimages and annotations, as well as an evaluation procedure forsuch tasks. The Challenge is structured around three tasks,namely Scanned Receipt Text Localization (Task 1), ScannedReceipt OCR (Task 2) and Key Information Extraction fromScanned Receipts (Task 3). The competition opened on 10thFebruary, 2019 and closed on 5th May, 2019. We received 29,24 and 18 valid submissions received for the three competitiontasks, respectively. This report presents the competition datasets,define the tasks and the evaluation protocols, offer detailedsubmission statistics, as well as an analysis of the submittedperformance. While the tasks of text localization and recognitionseem to be relatively easy to tackle, it is interesting to observethe variety of ideas and approaches proposed for the informationextraction task. According to the submissions’ performance webelieve there is still margin for improving information extractionperformance, although the current dataset would have to growsubstantially in following editions. Given the success of theSROIE competition evidenced by the wide interest generatedand the healthy number of submissions from academic, researchinstitutes and industry over different countries, we consider thatthe SROIE competition can evolve into a useful resource for thecommunity, drawing further attention and promoting researchand development efforts in this field
DocFigure: a dataset for scientific document figure classification
JOBIN K V,Ajoy Mondal,Jawahar C V
International Conference on Document Analysis and Recognition Workshops, ICDAR-W, 2019
@inproceedings{bib_DocF_2019, AUTHOR = {JOBIN K V, Ajoy Mondal, Jawahar C V}, TITLE = {DocFigure: a dataset for scientific document figure classification}, BOOKTITLE = {International Conference on Document Analysis and Recognition Workshops}. YEAR = {2019}}
Abstract—Document figure classification (DFC) is an im-portant stage of a document figure understanding system.The design of a DFC system required a well defined figurecategories and dataset. To the best of the author’s knowledge,the existing datasets related to classification of figures in thedocument images are limited with respect to their size andcategories [1]–[3]. In this paper, we introduce a scientificfigure classification dataset, named asDocFigure. The datasetconsists of33K annotated figures of28different categoriespresent in the document images which correspond to scientificarticles published in theCVPR,ECCV,ICCV, etc. conferences inlast several years. Manual annotation of such a large number(33K) of figures is time consuming and cost ineffective. Inthis article, we design a web based annotation tool which canefficiently assign category labels to large number of figures withthe minimum efforts of human annotators. To benchmark ourgenerated dataset on classification task, we propose three base-line classification techniques using deep feature, deep texturefeature and combination of both. In our analysis, we found thatthe combination of both deep feature and deep texture featureis more effective for document figure classification task thanthe individual features. The dataset and the code are publiclyavailable at https://researchweb.iiit.ac.in/∼jobin.kv/projects
Semantic Segmentation Datasets for Resource Constrained Training
Ashutosh Mishra,Sudhir Kumar,Tarun Kalluri,Girish Varma,Anbumani Subramaian,Manmohan Chandraker,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2019
@inproceedings{bib_Sema_2019, AUTHOR = {Ashutosh Mishra, Sudhir Kumar, Tarun Kalluri, Girish Varma, Anbumani Subramaian, Manmohan Chandraker, Jawahar C V}, TITLE = {Semantic Segmentation Datasets for Resource Constrained Training}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2019}}
Semantic Segmentation Datasets for ResourceConstrained TrainingAshutosh Mishra1?, Sudhir Kumar1,2?, Tarun Kalluri1,3?, Girish Varma1,Anbumani Subramaian4, Manmohan Chandraker3, and CV Jawahar11IIIT Hyderabad2University at Buffalo, State University of New York3University of California, San Diego4Intel BangaloreAbstract.Several large scale datasets, coupled with advances in deepneural network architectures have been greatly successful in pushingthe boundaries of performance in semantic segmentation in recent years.However, the scale and magnitude of such datasets prohibits ubiquitoususe and widespread adoption of such models, especially in settings withserious hardware and software resource constraints. Through this work,we propose two simple variants of the recently proposed IDD dataset,namelyIDD-miniandIDD-lite, for scene understanding in unstructuredenvironments. Our main objective is to enable research and benchmark-ing in training segmentation models. We believe that this will enablequick prototyping useful in applications like optimum parameter andarchitecture search, and encourage deployment on low resource hardwaresuch as Raspberry Pi. We show qualitatively and quantitatively that withonly 1 hour of training on 4GB GPU memory, we can achieve satisfactorysemantic segmentation performance on the proposed datasets.
Towards Increased Accessibility of Meme Images with the Help of Rich Face Emotion Captions
Prajwal K R,Jawahar C V,Ponnurangam Kumaraguru
International Conference on Multimedia, IMM, 2019
@inproceedings{bib_Towa_2019, AUTHOR = {Prajwal K R, Jawahar C V, Ponnurangam Kumaraguru}, TITLE = {Towards Increased Accessibility of Meme Images with the Help of Rich Face Emotion Captions}, BOOKTITLE = {International Conference on Multimedia}. YEAR = {2019}}
In recent years, there has been an explosion in the number of memesbeing created and circulated in online social networks. Despite theirrapidly increasing impact on how we communicate online, memeimages are virtually inaccessible to the visually impaired users. Ex-isting automated assistive systems that were primarily devised fornatural photos in social media, overlook the specific fine-grainedvisual details in meme images. In this paper, we concentrate ondescribing one such prominent visual detail: the meme face emo-tion. We propose a novel automated method that enables visuallyimpaired social media users to understand and appreciate memeface emotions with the help of rich textual captions. We first collecta challenging dataset of meme face emotion captions to supportfuture research in face emotion understanding. We design a two-stage approach that significantly outperforms baseline approaches across all the standard captioning metrics and also generates richerdiscriminative captions. By validating our solution with the helpof visually impaired social media users, we show that our emotioncaptions enable them to understand and appreciate one of the mostpopular classes of meme images encountered on the Internet forthe first time. Code, data, and models are publicly available.
Generating 1 Minute Summaries of Day Long Egocentric Videos
ANUJ RATHORE,Pravin Nagar,Chetan Arora,Jawahar C V
International Conference on Multimedia, IMM, 2019
@inproceedings{bib_Gene_2019, AUTHOR = {ANUJ RATHORE, Pravin Nagar, Chetan Arora, Jawahar C V}, TITLE = {Generating 1 Minute Summaries of Day Long Egocentric Videos}, BOOKTITLE = {International Conference on Multimedia}. YEAR = {2019}}
The popularity of egocentric cameras and their always-on nature has lead to the abundance of day-long first-person videos. Because of the extreme shake and highly redundant nature, these videos are difficult to watch from beginning to end and often require summarization tools for their efficient consumption. However, traditional summarization techniques developed for static surveillance videos, or highly curated sports videos and movies are, either, not suitable or simply do not scale for such hours long videos in the wild. On the other hand, specialized summarization techniques developed for egocentric videos limit their focus to important objects and people. In this paper, we present a novel unsupervised reinforcement learning technique to generate video summaries from day long egocentric videos. Our approach can be adapted to generate summaries of various lengths making it possible to view even 1-minute summaries of one’s entire day. The technique can also be adapted to various rewards, such as distinctiveness and indicativeness of the summary. When using the facial saliency-based reward, we show that our approach generates summaries focusing on social interactions, similar to the current state-of-the-art (SOTA). Quantitative comparison on the benchmark Disney dataset shows that our method achieves significant improvement in Relaxed F-Score (RFS) (32.56 vs. 19.21) and BLEU score (12.12 vs. 10.64). Finally, we show that our technique can be applied for summarizing traditional, short, hand-held videos as well, where we improve the SOTA F-score on benchmark SumMe and TVSum datasets from 41.4 to 45.6 and 57.6 to 59.1 respectively
Towards automatic face-to-face translation
Prajwal K R,Rudrabha Mukhopadhyay,JERIN PHILIP,Abhishek Jha,Vinay Namboodiri,Jawahar C V
International Conference on Multimedia, IMM, 2019
@inproceedings{bib_Towa_2019, AUTHOR = {Prajwal K R, Rudrabha Mukhopadhyay, JERIN PHILIP, Abhishek Jha, Vinay Namboodiri, Jawahar C V}, TITLE = {Towards automatic face-to-face translation}, BOOKTITLE = {International Conference on Multimedia}. YEAR = {2019}}
In light of the recent breakthroughs in automatic machine transla-tion systems, we propose a novel approach that we term as "Face-to-Face Translation". As today’s digital communication becomesincreasingly visual, we argue that there is a need for systems thatcan automatically translate a video of a person speaking in lan-guage A into a target language B with realistic lip synchronization.In this work, we create an automatic pipeline for this problem anddemonstrate its impact in multiple real-world applications. First, webuild a working speech-to-speech translation system by bringing together multiple existing modules from speech and language. Wethen move towards "Face-to-Face Translation" by incorporating anovel visual module, LipGAN for generating realistic talking facesfrom the translated audio. Quantitative evaluation of LipGAN onthe standard LRW test set shows that it significantly outperform sexisting approaches across all standard metrics. We also subjectour Face-to-Face Translation pipeline, to multiple human evalua-tions and show that it can significantly improve the overall user experience for consuming and interacting with multimodal content across languages. Code, models and demo video are made publicly available.
Technology interventions for road safety and beyond
Jawahar C V,Venkata N Padmanabhan
Communications of the ACM, CACM, 2019
@inproceedings{bib_Tech_2019, AUTHOR = {Jawahar C V, Venkata N Padmanabhan}, TITLE = {Technology interventions for road safety and beyond}, BOOKTITLE = {Communications of the ACM}. YEAR = {2019}}
No Abstract available
IDD: A dataset for exploring problems of autonomous navigation in unconstrained environments
Girish Varma,Anbumani Subramanian,Manmohan Chandraker,Anoop Namboodiri,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2019
@inproceedings{bib_IDD:_2019, AUTHOR = {Girish Varma, Anbumani Subramanian, Manmohan Chandraker, Anoop Namboodiri, Jawahar C V}, TITLE = {IDD: A dataset for exploring problems of autonomous navigation in unconstrained environments}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2019}}
While several datasets for autonomous navigation have become available in recent years, they have tended to focus on structured driving environments. This usually corresponds to well-delineated infrastructure such as lanes, a small number of well-defined categories for traffic participants, low variation in object or background appearance and strong adherence to traffic rules. We propose DS, a novel dataset for road scene understanding in unstructured environments where the above assumptions are largely not satisfied. It consists of 10,004 images, finely annotated with 34 classes collected from 182 drive sequences on Indian roads. The label set is expanded in comparison to popular benchmarks such as Cityscapes, to account for new classes. It also reflects label distributions of road scenes significantly different from existing datasets, with most classes displaying greater within-class diversity. Consistent with …
A Deep Learning Approach for Robust Corridor Following
Vishnu Sashank Dorbala,A.H. Abdul Hafez,Jawahar C V
International Conference on Intelligent Robots and Systems, IROS, 2019
@inproceedings{bib_A_De_2019, AUTHOR = {Vishnu Sashank Dorbala, A.H. Abdul Hafez, Jawahar C V}, TITLE = {A Deep Learning Approach for Robust Corridor Following}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2019}}
For an autonomous corridor following task where the environment is continuously changing, several forms of environmental noise prevent an automated feature extraction procedure from performing reliably. Moreover, in cases where pre-defined features are absent from the captured data, a well defined control signal for performing the servoing task fails to get produced. In order to overcome these drawbacks, we present in this work, using a convolutional neural network (CNN) to directly estimate the required control signal from an image, encompassing feature extraction and control law computation into one single end-to-end framework. In particular, we study the task of autonomous corridor following using a CNN and present clear advantages in cases where a traditional method used for performing the same task fails to give a reliable outcome. We evaluate the performance of our method on this task on a Wheelchair Platform developed at our institute for this purpose.
Hwnet v2: An efficient word image representation for handwritten documents
PRAVEEN KRISHNAN,Jawahar C V
International Journal on Document Analysis and Recognition, IJDAR, 2019
@inproceedings{bib_Hwne_2019, AUTHOR = {PRAVEEN KRISHNAN, Jawahar C V}, TITLE = {Hwnet v2: An efficient word image representation for handwritten documents}, BOOKTITLE = {International Journal on Document Analysis and Recognition}. YEAR = {2019}}
We present a framework for learning an efficient holistic representation for handwritten word images. The proposed method uses a deep convolutional neural network with traditional classification loss. The major strengths of our work lie in: (i) the efficient usage of synthetic data to pre-train a deep network, (ii) an adapted version of the ResNet-34 architecture with the region of interest pooling (referred to as HWNet v2) which learns discriminative features for variable sized word images, and (iii) a realistic augmentation of training data with multiple scales and distortions which mimics the natural process of handwriting. We further investigate the process of transfer learning to reduce the domain gap between synthetic and real domain, and also analyze the invariances learned at different layers of the network using visualization techniques proposed in the literature.
Evaluation and Visualization of Driver Inattention Rating From Facial Features
ISHA DUA,Akshay Uttama Nambi,Jawahar C V,Venkata N. Padmanabhan
IEEE Transactions on Biometrics, Behavior, and Identity Science, TBOIM, 2019
@inproceedings{bib_Eval_2019, AUTHOR = {ISHA DUA, Akshay Uttama Nambi, Jawahar C V, Venkata N. Padmanabhan}, TITLE = {Evaluation and Visualization of Driver Inattention Rating From Facial Features}, BOOKTITLE = {IEEE Transactions on Biometrics, Behavior, and Identity Science}. YEAR = {2019}}
In this paper, we present AUTORATE, a system that leverages the front camera of a windshield-mounted smartphone to monitor driver’s attention by combining several features. We derive a driver attention rating by fusing spatio-temporal features based on the driver state and behavior such as head pose, eye gaze, eye closure, yawns, use of cellphones, etc. We perform extensive evaluation of AUTORATE on real-world driving data and also data from controlled, static vehicle settings with 30 drivers in a large city. We compare AUTORATE’s automatically-generated rating with the scores given by 5 human annotators. We compute the agreement between AUTORATE’s rating and human annotator rating using kappa coefficient. AUTORATE’s automatically-generated rating has an overall agreement of 0.88 with the ratings provided by 5 human annotators. We also propose soft attention mechanism in AUTORATE which improves AUTORATE’s accuracy by 10%. We use temporal and spatial attention to visualize the key frame and the key action which justify the model’s predicted rating. Further, we observe that personalization in AUTORATE can improve driver specific results by a significant amount
Learning optimal redistribution mechanisms through neural networks
P Manisha,Jawahar C V,Sujit P Gujar
International Conference on Autonomous Agents and Multiagent Systems, AAMAS, 2019
@inproceedings{bib_Lear_2019, AUTHOR = {P Manisha, Jawahar C V, Sujit P Gujar}, TITLE = {Learning optimal redistribution mechanisms through neural networks}, BOOKTITLE = {International Conference on Autonomous Agents and Multiagent Systems}. YEAR = {2019}}
We consider a setting where public resources are to be allocated among competing and strategic agents so as to maximize social welfare (the objects should be allocated to those who value them the most). This is called allocative efficiency (AE). We need the agents to report their valuations for obtaining these resources, truthfully referred to as dominant strategy incentive compatibility (DSIC). We use auction-based mechanisms to achieve AE and DSIC yet budget balance cannot be ensured, due to Green-Laffont Impossibility Theorem. That is, the net transfer of money cannot be zero. This problem has been addressed by designing a redistribution mechanism so as to ensure a minimum surplus of money as well as AE and DSIC. The objective could be to minimize surplus in expectation or in the worst case and these objects could be homogeneous or heterogeneous. Designing redistribution mechanisms which perform well in expectation becomes analytically challenging for heterogeneous settings. In this paper, we take a completely different, data-driven approach. We train a neural network to determine an optimal redistribution mechanism based on given settings with both the objectives, optimal in expectation and optimal in the worst case. We also propose a loss function to train a neural network to optimize worst case. We design neural networks with the underlying rebate functions being linear as well as nonlinear in terms of bids of the agents. Our networks' performances are same as the theoretical guarantees for the cases where it has been solved. We observe that a neural network based redistribution mechanism for homogeneous …
Word Spotting and Recognition using Deep Embedding
PRAVEEN KRISHNAN,KARTIK DUTTA,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2018
@inproceedings{bib_Word_2018, AUTHOR = {PRAVEEN KRISHNAN, KARTIK DUTTA, Jawahar C V}, TITLE = {Word Spotting and Recognition using Deep Embedding}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2018}}
Deep convolutional features for word images and textual embedding schemes have shown great success in word spotting. In this work, we follow these motivations to propose an End2End embedding framework which jointly learns both the text and image embeddings using state of the art deep convolutional architectures. The three major contributions of this work are: (i) an End2End embedding scheme to learn a common representation for word images and its labels, (ii) building a state of art word image descriptor and demonstrating its utility as off-the-shelf features for word spotting, and (iii) use of synthetic data as a complementary modality to further enhance word spotting and recognition. On the challenging IAM handwritten dataset, we report a mAP of 0.9509 for query-by-string based retrieval task. Under lexicon based word recognition, our proposed method report a 2.66 and 5.10 CER and WER respectively.
Document Image Segmentation Using Deep Features
JOBIN K V,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2018
@inproceedings{bib_Docu_2018, AUTHOR = {JOBIN K V, Jawahar C V}, TITLE = {Document Image Segmentation Using Deep Features}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2018}}
This paper explores the effectiveness of deep features for doc- ument image segmentation. The document image segmentation problem is modelled as a pixel labeling task where each pixel in the document im- age is classified into one of the predefined labels such as text, comments, decorations and background. Our method first extracts deep features from superpixels of the document image. Then we learn an svm classifier using these features, and segment the document image. Fisher vector encoded convolutional layer features (fv-cnn) and fully connected layer features (fc-cnn) are used in our study. Experiments validate that our method is effective and yields better results for segmenting document images in comparison to the popular approaches on benchmark hand- written datasets.
Offline handwriting recognition on devanagari using a new benchmark dataset
KARTIK DUTTA,PRAVEEN KRISHNAN,MINESH MATHEW,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2018
@inproceedings{bib_Offl_2018, AUTHOR = {KARTIK DUTTA, PRAVEEN KRISHNAN, MINESH MATHEW, Jawahar C V}, TITLE = {Offline handwriting recognition on devanagari using a new benchmark dataset}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2018}}
Handwriting recognition (HWR) in Indicscripts, like Devanagari is very challenging due to thesubtleties in the scripts, variations in rendering and the cur-sive nature of the handwriting. Lack of public handwritingdatasets in Indic scripts has long stymied the development ofoffline handwritten word recognizers and made comparisonacross different methods a tedious task in the field. Inthis paper, we release a new handwritten word dataset forDevanagari, IIIT-HW-Dev to alleviate some of these issues.We benchmark the IIIT-HW-Dev dataset using a CNN-RNN hybrid architecture. Furthermore, using this architec-ture, we empirically show that usage of synthetic data andcross lingual transfer learning helps alleviate the issue of lackof training data. We use this proposed pipeline on a publicdataset, RoyDB and achieve state of the art results.
Learning to round for discrete labeling problems
PRITISH MOHAPATRA,Jawahar C V,M. Pawan Kumar
International Conference on Artificial Intelligence and Statistics, AISTATS, 2018
@inproceedings{bib_Lear_2018, AUTHOR = {PRITISH MOHAPATRA, Jawahar C V, M. Pawan Kumar}, TITLE = {Learning to round for discrete labeling problems}, BOOKTITLE = {International Conference on Artificial Intelligence and Statistics}. YEAR = {2018}}
Discrete labeling problems are often solvedby formulating them as an integer program,and relaxing the integrality constraint to acontinuous domain. While the continuous re-laxation is closely related to the original in-teger program, its optimal solution is oftenfractional. Thus, the success of a relaxationdepends crucially on the availability of an ac-curate rounding procedure. The problem ofidentifying an accurate rounding procedurehas mainly been tackled in the theoreticalcomputer science community through math-ematical analysis of the worst-case. However,this approach is both onerous and ignores thedistribution of the data encountered in prac-tice. We present a novel interpretation ofrounding procedures as sampling from a la-tent variable model, which opens the door tothe use of powerful machine learning formu-lations in their design. Inspired by the recentsuccess of deep latent variable models we pa-rameterize rounding procedures as a neuralnetwork, which lends itself to efficient opti-mization via back-propagation. By minimiz-ing the expected value of the objective of thediscrete labeling problem over training sam-ples, we learn a rounding procedure that ismore suited to the task at hand. Using bothsynthetic and real world data sets, we demon-strate that our approach can outperform thestate-of-the-art hand-designed rounding pro-cedures
Word spotting in silent lip videos
Abhishek Jha,Vinay P. Namboodiri,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2018
@inproceedings{bib_Word_2018, AUTHOR = {Abhishek Jha, Vinay P. Namboodiri, Jawahar C V}, TITLE = {Word spotting in silent lip videos}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2018}}
Our goal is to spot words in silent speech videos withoutexplicitly recognizing the spoken words, where the lip mo-tion of the speaker is clearly visible and audio is absent. Ex-isting work in this domain has mainly focused on recogniz-ing a fixed set of words in word-segmented lip videos, whichlimits the applicability of the learned model due to limitedvocabulary and high dependency on the model’s recogni-tion performance.Our contribution is two-fold: 1) we develop a pipelinefor recognition-free retrieval, and show its performanceagainst recognition-based retrieval on a large-scale datasetand another set of out-of-vocabulary words. 2) We intro-duce a query expansion technique using pseudo-relevantfeedback and propose a novel re-ranking method based onmaximizing the correlation between spatio-temporal land-marks of the query and the top retrieval candidates. Ourword spotting method achieves 35% higher mean aver-age precision over recognition-based method on large-scaleLRW dataset. Finally, we demonstrate the application of themethod by word spotting in a popular speech video (“Thegreat dictator” by Charlie Chaplin) where we show that theword retrieval can be used to understand what was spokenperhaps in the silent movies
Efficient optimization for rank-based loss functions
PRITISH MOHAPATRA,Michal Rol´ınek,Jawahar C V,Vladimir Kolmogorov, M. Pawan Kumar
Computer Vision and Pattern Recognition, CVPR, 2018
@inproceedings{bib_Effi_2018, AUTHOR = {PRITISH MOHAPATRA, Michal Rol´ınek, Jawahar C V, Vladimir Kolmogorov, M. Pawan Kumar}, TITLE = {Efficient optimization for rank-based loss functions}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2018}}
The accuracy of information retrieval systems is often measured using complex loss functions such as the average precision (AP) or the normalized discounted cumulative gain (NDCG). Given a set of positive and negative samples, the parameters of a retrieval system can be estimated by minimizing these loss functions. However, the non-differentiability and non-decomposability of these loss functions does not allow for simple gradient based optimization algorithms. This issue is generally circumvented by either optimizing a structured hinge-loss upper bound to the loss function or by using asymptotic methods like the direct-loss minimization framework. Yet, the high computational complexity of loss-augmented inference, which is necessary for both the frameworks, prohibits its use in large training data sets. To alleviate this deficiency, we present a novel quicksort flavored algorithm for a large class of non-decomposable loss functions. We provide a complete characterization of the loss functions that are amenable to our algorithm, and show that it includes both AP and NDCG based loss functions. Furthermore, we prove that no comparison based algorithm can improve upon the computational complexity of our approach asymptotically. We demonstrate the effectiveness of our approach in the context of optimizing the structured hinge loss upper bound of AP and NDCG loss for learning models for a variety of vision tasks. We show that our approach provides significantly better results than simpler decomposable loss functions, while requiring a comparable training time.
Scaling Handwritten Student Assessments with a Document Image Workflow System
VIJAY BAPANAIAH ROWTULA,Varun Bhargavan,Mohan Kumar,Jawahar C V
Computer Vision and Pattern Recognition Conference workshops, CVPR-W, 2018
@inproceedings{bib_Scal_2018, AUTHOR = {VIJAY BAPANAIAH ROWTULA, Varun Bhargavan, Mohan Kumar, Jawahar C V}, TITLE = {Scaling Handwritten Student Assessments with a Document Image Workflow System}, BOOKTITLE = {Computer Vision and Pattern Recognition Conference workshops}. YEAR = {2018}}
With the increase in the number of students enrolled in the university system, regular assessment of student performance has become challenging. This is specially true in case of summative assessments, where one expects the student to write down an answer on paper, rather than selecting a correct answer from multiple choices. In this paper, we present a document image workflow system that helps in scaling the handwritten student assessments in a typical university setting. We argue that this improves the efficiency since the book keeping time as well as physical paper movement is minimized. An electronic workflow can make the anonymization easy, alleviating the fear of biases in many cases. Also, parallel and distributed assessment by multiple instructors is straightforward in an electronic workflow system. At the heart of our solution, we have (i) a distributed image capture module with a mobile phone (ii) image processing algorithms that improve the quality and readability (iii) image annotation module that process the evaluations/feedbacks as a separate layer. Our system also acts as a platform for modern image analysis which can be adapted to the domain of student assessments. This include (i) Handwriting recognition and word spotting [5] (ii) Measure of document similarity [6] (iii) Aesthetic analysis of handwriting [7] (iv) Identity of the writer [4] etc. With the handwriting assessment workflow system, all these recent advances in computer vision can become practical and applicable in evaluating student assessments
Efficient semantic segmentation using gradual grouping
V NIKITHA,A SAI KRISHANA SRIHARSHA,Girish Varma,Jawahar C V,Manu Mathew,Soyeb Nagori
Computer Vision and Pattern Recognition Conference workshops, CVPR-W, 2018
@inproceedings{bib_Effi_2018, AUTHOR = {V NIKITHA, A SAI KRISHANA SRIHARSHA, Girish Varma, Jawahar C V, Manu Mathew, Soyeb Nagori}, TITLE = {Efficient semantic segmentation using gradual grouping}, BOOKTITLE = {Computer Vision and Pattern Recognition Conference workshops}. YEAR = {2018}}
Deep CNNs for semantic segmentation have high memory and run time requirements. Various approaches have been proposed to make CNNs efficient like grouped, shuffled, depth-wise separable convolutions. We study the effectiveness of these techniques on a real-time semantic segmentation architecture like ERFNet for improving runtime by over 5X. We apply these techniques to CNN layers partially or fully and evaluate the testing accuracies on Cityscapes dataset. We obtain accuracy vs parameters/FLOPs trade offs, giving accuracy scores for models that can run under specified runtime budgets. We further propose a novel training procedure which starts out with a dense convolution but gradually evolves towards a grouped convolution. We show that our proposed training method and efficient architecture design can improve accuracies by over 8% with depthwise separable convolutions applied on the encoder of ERFNet and attaching a light weight decoder. This results in a model which has a 5X improvement in FLOPs while only suffering a 4% degradation in accuracy with respect to ERFNet.
NeuroIoU: Learning a Surrogate Loss for Semantic Segmentation
NAGENDAR. G,DIGVIJAY SINGH,V. Balasubramanian,Jawahar C V
British Machine Vision Conference, BMVC, 2018
@inproceedings{bib_Neur_2018, AUTHOR = {NAGENDAR. G, DIGVIJAY SINGH, V. Balasubramanian, Jawahar C V}, TITLE = {NeuroIoU: Learning a Surrogate Loss for Semantic Segmentation}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2018}}
Semantic segmentation is a popular task in computer vision today, and deep neural network models have emerged as the popular solution to this problem in recent times. The typical loss function used to train neural networks for this task is the cross-entropy loss. However, the success of the learned models is measured using Intersection-OverUnion (IoU), which is inherently non-differentiable. This gap between performance measure and loss function results in a fall in performance, which has also been studied by few recent efforts. In this work, we propose a novel method to automatically learn a surrogate loss function that approximates the IoU loss and is better suited for good IoU performance. To the best of our knowledge, this is the first such work that attempts to learn a loss function for this purpose. The proposed loss can be directly applied over any network. We validated our method over different networks (FCN, SegNet, UNet) on the PASCAL VOC and Cityscapes datasets. Our results on this work show consistent improvement over baseline methods.
Efficient Query Specific DTW Distance for Document Retrieval with Unlimited Vocabulary
NAGENDAR. G,Viresh Ranjan,Gaurav Harit,Jawahar C V
Journal of Imaging, JIma, 2018
@inproceedings{bib_Effi_2018, AUTHOR = {NAGENDAR. G, Viresh Ranjan, Gaurav Harit, Jawahar C V}, TITLE = {Efficient Query Specific DTW Distance for Document Retrieval with Unlimited Vocabulary}, BOOKTITLE = {Journal of Imaging}. YEAR = {2018}}
In this paper, we improve the performance of the recently proposed Direct Query Classifier (DQC). The (DQC) is a classifier based retrieval method and in general, such methods have been shown to be superior to the OCR-based solutions for performing retrieval in many practical document image datasets. In (DQC), the classifiers are trained for a set of frequent queries and seamlessly extended for the rare and arbitrary queries. This extends the classifier based retrieval paradigm to an unlimited number of classes (words) present in a language. The (DQC) requires indexing cut-portions (n-grams) of the word image and DTW distance has been used for indexing. However, DTW is computationally slow and therefore limits the performance of the (DQC). We introduce query specific DTW distance, which enables effective computation of global principal alignments for novel queries. Since the proposed query specific DTW distance is a linear approximation of the DTW distance, it enhances the performance of the (DQC). Unlike previous approaches, the proposed query specific DTW distance uses both the class mean vectors and the query information for computing the global principal alignments for the query. Since the proposed method computes the global principal alignments using n-grams, it works well for both frequent and rare queries. We also use query expansion (QE) to further improve the performance of our query specific DTW. This also allows us to seamlessly adapt our solution to new fonts, styles and collections. We have demonstrated the utility of the proposed technique over 3 different datasets. The proposed query specific DTW performs well compared to the previous DTW approximations.
Towards structured analysis of broadcast badminton videos
ANURAG GHOSH,SURIYA SINGH,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2018
@inproceedings{bib_Towa_2018, AUTHOR = {ANURAG GHOSH, SURIYA SINGH, Jawahar C V}, TITLE = {Towards structured analysis of broadcast badminton videos}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2018}}
Sports video data is recorded for nearly every major tournament but remains archived and inaccessible to large scale data mining and analytics. It can only be viewed sequentially or manually tagged with higher-level labels which is time consuming and prone to errors. In this work, we propose an end-to-end framework for automatic attributes tagging and analysis of sport videos. We use commonly available broadcast videos of matches and, unlike previous approaches, does not rely on special camera setups or additional sensors. Our focus is on Badminton as the sport of interest. We propose a method to analyze a large corpus of badminton broadcast videos by segmenting the points played, tracking and recognizing the players in each point and annotating their respective badminton strokes. We evaluate the performance on 10 Olympic matches with 20 players and achieved 95.44% point segmentation accuracy, 97.38% player detection score (mAP@0.5), 97.98% player identification accuracy, and stroke segmentation edit scores of 80.48%. We further show that the automatically annotated videos alone could enable the gameplay analysis and inference by computing understandable metrics such as player’s reaction time, speed, and footwork around the court, etc.
Word spotting and recognition using deep embedding
PRAVEEN KRISHNAN,KARTIK DUTTA,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2018
@inproceedings{bib_Word_2018, AUTHOR = {PRAVEEN KRISHNAN, KARTIK DUTTA, Jawahar C V}, TITLE = {Word spotting and recognition using deep embedding}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2018}}
Deep convolutional features for word images and textual embedding schemes have shown great success in word spotting. In this work, we follow these motivations to propose an End2End embedding framework which jointly learns both the text and image embeddings using state of the art deep convolutional architectures. The three major contributions of this work are: (i) an End2End embedding scheme to learn a common representation for word images and its labels, (ii) building a state of art word image descriptor and demonstrating its utility as off-the-shelf features for word spotting, and (iii) use of synthetic data as a complementary modality to further enhance word spotting and recognition. On the challenging IAM handwritten dataset, we report a mAP of 0.9509 for query-by-string based retrieval task. Under lexicon based word recognition, our proposed method report a 2.66 and 5.10 CER and WER respectively.
Unsupervised learning of face representations
Samyak Datta,Gaurav Sharma,Jawahar C V
International Conference on Automatic Face and Gesture Recognition, FG, 2018
@inproceedings{bib_Unsu_2018, AUTHOR = {Samyak Datta, Gaurav Sharma, Jawahar C V}, TITLE = {Unsupervised learning of face representations}, BOOKTITLE = {International Conference on Automatic Face and Gesture Recognition}. YEAR = {2018}}
We present an approach for unsupervised training of CNNs in order to learn discriminative face representations. We mine supervised training data by noting that multiple faces in the same video frame must belong to different persons and the same face tracked across multiple frames must belong to the same person. We obtain millions of face pairs from hundreds of videos without using any manual supervision. Although faces extracted from videos have a lower spatial resolution than those which are available as part of standard supervised face datasets such as LFW and CASIA-WebFace, the former represent a much more realistic setting, e.g. in surveillance scenarios where most of the faces detected are very small. We train our CNNs with the relatively low resolution faces extracted from video frames collected, and achieve a higher verification accuracy on the benchmark LFW dataset cf. hand-crafted features such as LBPs, and even surpasses the performance of state-of-the-art deep networks such as VGG-Face, when they are made to work with low resolution input images.
Cross-specificity: modelling data semantics for cross-modal matching and retrieval
YASHASWI VERMA,Abhishek Jha,Jawahar C V
International Journal of Multimedia Information Retrieval, IJMIR, 2018
@inproceedings{bib_Cros_2018, AUTHOR = {YASHASWI VERMA, Abhishek Jha, Jawahar C V}, TITLE = {Cross-specificity: modelling data semantics for cross-modal matching and retrieval}, BOOKTITLE = {International Journal of Multimedia Information Retrieval}. YEAR = {2018}}
While dealing with multi-modal data such as pairs of images and text, though individual samples may demonstrate inherent heterogeneity in their content, they are usually coupled with each other based on some higher-level concepts such as their categories. This shared information can be useful in measuring semantics of samples across modalities in a relative manner. In this paper, we investigate the problem of analyzing the degree of specificity in the semantic content of a sample in one modality with respect to semantically similar samples in another modality. Samples that have high similarity with semantically similar samples from another modality are considered to be specific, while others are considered to be relatively ambiguous. To model this property, we propose a novel notion of “cross-specificity”. We present two mechanisms to measure cross-specificity: one based on human judgment and other based on an automated approach. We analyze different aspects of cross-specificity, and demonstrate its utility in cross-modal retrieval task. Experiments show that though conceptually simple, it can benefit several existing cross-modal retrieval techniques, and provides significant boost in their performance.
TextTopicNet - Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces
Yash Patel,Lluis Gomez,Raul Gomez,Marc¸al Rusinol ,Dimosthenis Karatzas,Jawahar C V
Technical Report, arXiv, 2018
@inproceedings{bib_Text_2018, AUTHOR = {Yash Patel, Lluis Gomez, Raul Gomez, Marc¸al Rusinol , Dimosthenis Karatzas, Jawahar C V}, TITLE = {TextTopicNet - Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces}, BOOKTITLE = {Technical Report}. YEAR = {2018}}
The immense success of deep learning based methods in computer vision heavily relies on large scale training datasets. These richly annotated datasets help the network learn discriminative visual features. Collecting and annotating such datasets requires a tremendous amount of human effort and annotations are limited to popular set of classes. As an alternative, learning visual features by designing auxiliary tasks which make use of freely available self-supervision has become increasingly popular in the computer vision community. In this paper, we put forward an idea to take advantage of multi-modal context to provide self-supervision for the training of computer vision algorithms. We show that adequate visual features can be learned efficiently by training a CNN to predict the semantic textual context in which a particular image is more probable to appear as an illustration. More specifically we use popular text embedding techniques to provide the self-supervision for the training of deep CNN.
Learning human poses from actions
ADITYA ARUN,Jawahar C V,M. Pawan Kumar
British Machine Vision Conference, BMVC, 2018
@inproceedings{bib_Lear_2018, AUTHOR = {ADITYA ARUN, Jawahar C V, M. Pawan Kumar}, TITLE = {Learning human poses from actions}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2018}}
We consider the task of learning to estimate human pose in still images. In order to avoid the high cost of full supervision, we propose to use a diverse data set, which consists of two types of annotations: (i) a small number of images are labeled using the expensive ground-truth pose; and (ii) other images are labeled using the inexpensive action label. As action information helps narrow down the pose of a human, we argue that this approach can help reduce the cost of training without significantly affecting the accuracy. To demonstrate this we design a probabilistic framework that employs two distributions: (i) a conditional distribution to model the uncertainty over the human pose given the image and the action; and (ii) a prediction distribution, which provides the pose of an image without using any action information. We jointly estimate the parameters of the two aforementioned distributions by minimizing their dissimilarity coefficient, as measured by a task-specific loss function. During both training and testing, we only require an efficient sampling strategy for both the aforementioned distributions. This allows us to use deep probabilistic networks that are capable of providing accurate pose estimates for previously unseen images. Using the MPII data set, we show that our approach outperforms baseline methods that either do not use the diverse annotations or rely on pointwise estimates of the pose.
Connecting visual experiences using max-flow network with application to visual localization
A.H. Abdul Hafez,Nakul Agarwal,Jawahar C V
Technical Report, arXiv, 2018
@inproceedings{bib_Conn_2018, AUTHOR = {A.H. Abdul Hafez, Nakul Agarwal, Jawahar C V}, TITLE = {Connecting visual experiences using max-flow network with application to visual localization}, BOOKTITLE = {Technical Report}. YEAR = {2018}}
We are motivated by the fact that multiple representations of the environment are required to stand for the changes in appearance with time and for changes that appear in a cyclic manner. These changes are, for example, from day to night time, and from day to day across seasons. In such situations, the robot visits the same routes multiple times and collects different appearances of it. Multiple visual experiences usually find robotic vision applications like visual localization, mapping, place recognition, and autonomous navigation. The novelty in this paper is an algorithm that connects multiple visual experiences via aligning multiple image sequences. This problem is solved by finding the maximum flow in a directed graph flow-network, whose vertices represent the matches between frames in the test and reference sequences. Edges of the graph represent the cost of these matches. The problem of finding the best match is reduced to finding the minimum-cut surface, which is solved as a maximum flow network problem. Application to visual localization is considered in this paper to show the effectiveness of the proposed multiple image sequence alignment method, without loosing its generality. Experimental evaluations show that the precision of sequence matching is improved by considering multiple visual sequences for the same route, and that the method performs favorably against state-of-the-art single representation methods like SeqSLAM and ABLE-M.
Localizing and recognizing text in lecture videos
KARTIK DUTTA,MINESH MATHEW,PRAVEEN KRISHNAN,Jawahar C V
International Conference on Frontiers in Handwriting Recognition, ICFHR, 2018
@inproceedings{bib_Loca_2018, AUTHOR = {KARTIK DUTTA, MINESH MATHEW, PRAVEEN KRISHNAN, Jawahar C V}, TITLE = {Localizing and recognizing text in lecture videos}, BOOKTITLE = {International Conference on Frontiers in Handwriting Recognition}. YEAR = {2018}}
Lecture videos are rich with textual information and to be able to understand the text is quite useful for larger video understanding/analysis applications. Though text recognition from images have been an active research area in computer vision, text in lecture videos has mostly been overlooked. In this work, we investigate the efficacy of state-of-the art handwritten and scene text recognition methods on text in lecture videos. To this end, a new dataset - LectureVideoDB compiled from frames from multiple lecture videos is introduced. Our experiments show that the existing methods do not fare well on the new dataset. The results necessitate the need to improvise the existing methods for robust performance on lecture videos.
Towards spotting and recognition of handwritten words in indic scripts
KARTIK DUTTA,PRAVEEN KRISHNAN,MINESH MATHEW,Jawahar C V
International Conference on Frontiers in Handwriting Recognition, ICFHR, 2018
@inproceedings{bib_Towa_2018, AUTHOR = {KARTIK DUTTA, PRAVEEN KRISHNAN, MINESH MATHEW, Jawahar C V}, TITLE = {Towards spotting and recognition of handwritten words in indic scripts}, BOOKTITLE = {International Conference on Frontiers in Handwriting Recognition}. YEAR = {2018}}
Abstract—Handwriting recognition (HWR) in Indic scripts is a challenging problem due to the inherent subtleties in the scripts, cursive nature of the handwriting and similar shape of the characters. Lack of publicly available handwriting datasets in Indic scripts has affected the development of handwritten word recognizers, and made direct comparisons across different methods an impossible task in the field. In this paper, we propose a framework for annotating large scale of handwritten word images with ease and speed. We also release a new handwritten word dataset for Telugu, which is collected and annotated using the proposed framework. We also benchmark major Indic scripts such as Devanagari, Bangla and Telugu for the tasks of word spotting and handwriting recognition using state of the art deep neural architectures. Finally, we evaluate the proposed pipeline on RoyDB, a public dataset, and achieve significant reduction in error rates.
Improving CNN-RNN Hybrid Networks for Handwriting Recognition
KARTIK DUTTA,PRAVEEN KRISHNAN,MINESH MATHEW,Jawahar C V
International Conference on Frontiers in Handwriting Recognition, ICFHR, 2018
@inproceedings{bib_Impr_2018, AUTHOR = {KARTIK DUTTA, PRAVEEN KRISHNAN, MINESH MATHEW, Jawahar C V}, TITLE = {Improving CNN-RNN Hybrid Networks for Handwriting Recognition}, BOOKTITLE = {International Conference on Frontiers in Handwriting Recognition}. YEAR = {2018}}
The success of deep learning based models have centered around recent architectures and the availability of large scale annotated data. In this work, we explore these two factors systematically for improving handwritten recognition for scanned off-line document images. We propose a modified CNN-RNN hybrid architecture with a major focus on effective training using: (i) efficient initialization of network using synthetic data for pretraining, (ii) image normalization for slant correction and (iii) domain specific data transformation and distortion for learning important invariances. We perform a detailed ablation study to analyze the contribution of individual modules and present state of art results for the task of unconstrained line and word recognition on popular datasets such as IAM, RIMES and GW.
Enhancing ocr accuracy with super resolution
Ankit lat,Jawahar C V
International conference on Pattern Recognition, ICPR, 2018
@inproceedings{bib_Enha_2018, AUTHOR = {Ankit Lat, Jawahar C V}, TITLE = {Enhancing ocr accuracy with super resolution}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2018}}
Abstract—Accuracy of OCR is often marred by the poor quality of the input document images. Generally this performance degradation is attributed to the resolution and quality of scanning. This calls for special efforts to improve the quality of document images before passing it to the OCR engine. One compelling option is to super-resolve these low resolution document images before passing them to the OCR engine. In this work we address this problem by super-resolving document images using Generative Adversarial Network (GAN). We propose a super resolution based preprocessing step that can enhance the accuracies of the OCRs (including the commercial ones). Our method is specially suited for printed document images. We validate the utility in wide variety of document images (where fonts, styles, and languages vary) without any pre-processing step to adapt across situations. Our experiments show an improvement upto 21% in accuracy OCR on test images scanned at low resolution. One immediate application of this can be in enhancing the recognition of historic documents which have been scanned at low resolutions.
Class2str: End to end latent hierarchy learning
Soham Saha,Girish Varma,Jawahar C V
International conference on Pattern Recognition, ICPR, 2018
@inproceedings{bib_Clas_2018, AUTHOR = {Soham Saha, Girish Varma, Jawahar C V}, TITLE = {Class2str: End to end latent hierarchy learning}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2018}}
t—Deep neural networks for image classification typically consists of a convolutional feature extractor followed by a fully connected classifier network. The predicted and the ground truth labels are represented as one hot vectors. Such a representation assumes that all classes are equally dissimilar. However, classes have visual similarities and often form a hierarchy. Learning this latent hierarchy explicitly in the architecture could provide invaluable insights. We propose an alternate architecture to the classifier network called the Latent Hierarchy (LH) Classifier and an end to end learned Class2Str mapping which discovers a latent hierarchy of the classes. We show that for some of the best performing architectures on CIFAR and Imagenet datasets, the proposed replacement and training by LH classifier recovers the accuracy, with a fraction of the number of parameters in the classifier part. Compared to the previous work of HDCNN, which also learns a 2 level hierarchy, we are able to learn a hierarchy at an arbitrary number of levels as well as obtain an accuracy improvement on the Imagenet classification task over them. We also verify that many visually similar classes are grouped together, under the learnt hierarchy
Augment and adapt: A simple approach to image tampering detection
Yashas Annadani,Jawahar C V
International conference on Pattern Recognition, ICPR, 2018
@inproceedings{bib_Augm_2018, AUTHOR = {Yashas Annadani, Jawahar C V}, TITLE = {Augment and adapt: A simple approach to image tampering detection}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2018}}
Abstract—Convolutional Neural Networks have been shown to be promising for image tampering detection in the recent years. However, the number of tampered images available to train a network is still small. This is mainly due to the cumbersomeness involved in creating lots of tampered images. As a result, the potential offered by these networks is not completely exploited. In this work, we propose a simple method to address this problem by augmenting data using inpainting and compositing schemes. We consider different forms of inpainting like simple inpainting and semantic inpainting as well as compositing schemes like feathering in order to augment the data. A domain adaptation technique is employed to reduce the domain shift between the augmented data and the data available using proprietary softwares. We demonstrate that this method of augmentation is effective in improving the detection accuracies. We present experimental evaluation on two popular datasets for image tampering detection to demonstrate the effectiveness of the proposed approach.
Improving multiclass classification by deep networks using DAGSVM and Triplet Loss
Nakul Agarwal,Vineeth N Balasubramanian,Jawahar C V
Pattern Recognition Letters, PRLJ, 2018
@inproceedings{bib_Impr_2018, AUTHOR = {Nakul Agarwal, Vineeth N Balasubramanian, Jawahar C V}, TITLE = {Improving multiclass classification by deep networks using DAGSVM and Triplet Loss}, BOOKTITLE = {Pattern Recognition Letters}. YEAR = {2018}}
With recent advances in the field of computer vision and especially deep learning, many fully connected and convolutional neural networks have been trained to achieve state-of-the-art performance on a wide variety of tasks such as speech recognition, image classification and natural language processing. For classification tasks however, most of these deep learning models employ the softmax activation function for prediction and minimize cross-entropy loss. In contrast, we demonstrate a consistent advantage by replacing the softmax layer by a set of binary SVM classifiers organized in a tree or DAG (Directed Acyclic Graph) structure. The idea is to not treat the multiclass classification problem as a whole but to break it down into smaller binary problems where each classifier acts as an expert by focusing on differentiating between only two classes and thus improves the overall accuracy. Furthermore, by arranging the classifiers in a DAG structure, we later also show how it is possible to further improve the performance of the binary classifiers by learning more discriminative features through the same deep network. We validated the proposed methodology on two benchmark datasets, and the results corroborated our claim.
Self-Supervised Feature Learning for Semantic Segmentation of Overhead Imagery.
SURIYA SINGH,Anil Kumar Batra,Guan Pang,Lorenzo Torresani,Saikat Basu,Manohar Paluri,Jawahar C V
British Machine Vision Conference, BMVC, 2018
@inproceedings{bib_Self_2018, AUTHOR = {SURIYA SINGH, Anil Kumar Batra, Guan Pang, Lorenzo Torresani, Saikat Basu, Manohar Paluri, Jawahar C V}, TITLE = {Self-Supervised Feature Learning for Semantic Segmentation of Overhead Imagery.}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2018}}
Overhead imageries play a crucial role in many applications such as urban planning, crop yield forecasting, mapping, and policy making. Semantic segmentation could enable automatic, efficient, and large-scale understanding of overhead imageries for these applications. However, semantic segmentation of overhead imageries is a challenging task, primarily due to the large domain gap from existing research in ground imageries, unavailability of large-scale dataset with pixel-level annotations, and inherent complexity in the task. Readily available vast amount of unlabeled overhead imageries share more common structures and patterns compared to the ground imageries, therefore, its large-scale analysis could benefit from unsupervised feature learning techniques. In this work, we study various self-supervised feature learning techniques for semantic segmentation of overhead imageries. We choose image semantic inpainting as a self-supervised task [36] for our experiments due to its proximity to the semantic segmentation task. We (i) show that existing approaches are inefficient for semantic segmentation, (ii) propose architectural changes towards self-supervised learning for semantic segmentation, (iii) propose an adversarial training scheme for self-supervised learning by increasing the pretext task’s difficulty gradually and show that it leads to learning better features, and (iv) propose a unified approach for overhead scene parsing, road network extraction, and land cover estimation. Our approach improves over training from scratch by more than 10% and ImageNet pre-trained network by more than 5% mIOU.
Neuro-IoU: Learning a Surrogate Loss for Semantic Segmentation.
NAGENDAR. G,DIGVIJAY SINGH,V. Balasubramanian,Jawahar C V
British Machine Vision Conference, BMVC, 2018
@inproceedings{bib_Neur_2018, AUTHOR = {NAGENDAR. G, DIGVIJAY SINGH, V. Balasubramanian, Jawahar C V}, TITLE = {Neuro-IoU: Learning a Surrogate Loss for Semantic Segmentation.}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2018}}
Semantic segmentation is a popular task in computer vision today, and deep neural network models have emerged as the popular solution to this problem in recent times. The typical loss function used to train neural networks for this task is the cross-entropy loss. However, the success of the learned models is measured using Intersection-OverUnion (IoU), which is inherently non-differentiable. This gap between performance measure and loss function results in a fall in performance, which has also been studied by few recent efforts. In this work, we propose a novel method to automatically learn a surrogate loss function that approximates the IoU loss and is better suited for good IoU performance. To the best of our knowledge, this is the first such work that attempts to learn a loss function for this purpose. The proposed loss can be directly applied over any network. We validated our method over different networks (FCN, SegNet, UNet) on the PASCAL VOC and Cityscapes datasets. Our results on this work show consistent improvement over baseline methods.
City-scale road audit system using deep learning
YARRAM SUDHIR KUMAR REDDY,Girish Varma,Jawahar C V
International Conference on Intelligent Robots and Systems, IROS, 2018
@inproceedings{bib_City_2018, AUTHOR = {YARRAM SUDHIR KUMAR REDDY, Girish Varma, Jawahar C V}, TITLE = {City-scale road audit system using deep learning}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2018}}
Abstract— Road networks in cities are massive and is a critical component of mobility. Fast response to defects, that can occur not only due to regular wear and tear but also because of extreme events like storms, is essential. Hence there is a need for an automated system that is quick, scalable and costeffective for gathering information about defects. We propose a system for city-scale road audit, using some of the most recent developments in deep learning and semantic segmentation. For building and benchmarking the system, we curated a dataset which has annotations required for road defects. However, many of the labels required for road audit have high ambiguity which we overcome by proposing a label hierarchy. We also propose a multi-step deep learning model that segments the road, subdivide the road further into defects, tags the frame for each defect and finally localizes the defects on a map gathered using GPS. We analyze and evaluate the models on image tagging as well as segmentation at different levels of the label hierarchy
Investigating the generalizability of EEG-based cognitive load estimation across visualizations
PAREKH VIRAL MAHESH KUMAR,BILALPUR MANEESH,SHRAVAN KUMAR P,Stefan Winkler,Jawahar C V,Ramanathan Subramanian
International Conference on Multimodal Interaction, ICMI, 2018
@inproceedings{bib_Inve_2018, AUTHOR = {PAREKH VIRAL MAHESH KUMAR, BILALPUR MANEESH, SHRAVAN KUMAR P, Stefan Winkler, Jawahar C V, Ramanathan Subramanian}, TITLE = {Investigating the generalizability of EEG-based cognitive load estimation across visualizations}, BOOKTITLE = {International Conference on Multimodal Interaction}. YEAR = {2018}}
We examine if EEG-based cognitive load (CL) estimation is generalizable across the character, spatial pattern, bar graph and pie chart-based visualizations for the n-back task. CL is estimated via two recent approaches: (a) Deep convolutional neural network [2], and (b) Proximal support vector machines [15]. Experiments reveal that CL estimation suffers across visualizations motivating the need for effective machine learning techniques to benchmark visual interface usability for a given analytic task.
Improved visual relocalization by discovering anchor points
SOHAM SAHA,Girish Varma,Jawahar C V
British Machine Vision Conference, BMVC, 2018
@inproceedings{bib_Impr_2018, AUTHOR = {SOHAM SAHA, Girish Varma, Jawahar C V}, TITLE = {Improved visual relocalization by discovering anchor points}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2018}}
We address the visual relocalization problem of predicting the location and camera orientation or pose (6DOF) of the given input scene. We propose a method based on how humans determine their location using the visible landmarks. We define anchor points uniformly across the route map and propose a deep learning architecture which predicts the most relevant anchor point present in the scene as well as the relative offsets with respect to it. The relevant anchor point need not be the nearest anchor point to the ground truth location, as it might not be visible due to the pose. Hence we propose a multi task loss function, which discovers the relevant anchor point, without needing the ground truth for it. We validate the effectiveness of our approach by experimenting on Cambridge Landmarks ( large scale outdoor scenes) as well as 7 Scenes (indoor scenes) using various CNN feature extractors. Our method improves the median error in indoor as well as outdoor localization datasets compared to the previous best deep learning model known as PoseNet (with geometric re-projection loss) using the same feature extractor. We improve the median error in localization in the specific case of Street scene, by over 8m.
Automatic image annotation: the quirks and what works
Ayushi Dutta,· Yashaswi Verma,Jawahar C V
Multimedia Tools Application, MTA, 2018
@inproceedings{bib_Auto_2018, AUTHOR = {Ayushi Dutta, · Yashaswi Verma, Jawahar C V}, TITLE = {Automatic image annotation: the quirks and what works}, BOOKTITLE = {Multimedia Tools Application}. YEAR = {2018}}
t Automatic image annotation is one of the fundamental problems in computer vision and machine learning. Given an image, here the goal is to predict a set of textual labels that describe the semantics of that image. During the last decade, a large number of image annotation techniques have been proposed that have been shown to achieve encouragingresults on various annotation datasets. However, their scope has mostly remained restricted to quantitative results on the test data, thus ignoring various key aspects related to dataset properties and evaluation metrics that inherently affect the performance to a considerable extent. In this paper, first we evaluate ten state-of-the-art (both deep-learning based as well as non-deep-learning based) approaches for image annotation using the same baseline CNN features. Then we propose new quantitative measures to examine various issues/aspects in the image annotation domain, such as dataset specific biases, per-label versus per-image evaluation criteria, and the impact of changing the number and type of predicted labels. We believe the conclusions derived in this paper through thorough empirical analyzes would be helpful in making systematic advancements in this domain.
Pos tagging and named entity recognition on handwritten documents
VIJAY BAPANAIAH ROWTULA,PRAVEEN KRISHNAN,Jawahar C V
International Conference on Natural Language Processing., ICON, 2018
@inproceedings{bib_Pos__2018, AUTHOR = {VIJAY BAPANAIAH ROWTULA, PRAVEEN KRISHNAN, Jawahar C V}, TITLE = {Pos tagging and named entity recognition on handwritten documents}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2018}}
In a world of proliferating data, the ability to rapidly summarize text is growing in importance. Automatic summarization of text can be thought of as a sequence to sequence problem. Another area of natural language processing that solves a sequence to sequence problem is machine translation, which is rapidly evolving due to the development of attentionbased encoder-decoder networks. This work applies these modern techniques to abstractive summarization. We perform analysis on various attention mechanisms for summarization with the goal of developing an approach and architecture aimed at improving the state of the art. In particular, we modify and optimize a translation model with self-attention for generating abstractive sentence summaries. The effectiveness of this base model along with attention variants is compared and analyzed in the context of standardized evaluation sets and test metrics. However, we show that these metrics are limited in their ability to effectively score abstractive summaries, and propose a new approach based on the intuition that an abstractive model requires an abstractive evaluation.
Semi-supervised annotation of faces in image collection
VIJAYA KUMAR R,Anoop Namboodiri,Jawahar C V
Signal,Image and Video Processing, SIViP, 2018
@inproceedings{bib_Semi_2018, AUTHOR = {VIJAYA KUMAR R, Anoop Namboodiri, Jawahar C V}, TITLE = {Semi-supervised annotation of faces in image collection}, BOOKTITLE = {Signal,Image and Video Processing}. YEAR = {2018}}
Automated top view registration of broadcast football videos
,RAHUL ANAND SHARMA,BHARATH BHAT,Vineet Gandhi,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2018
@inproceedings{bib_Auto_2018, AUTHOR = {, RAHUL ANAND SHARMA, BHARATH BHAT, Vineet Gandhi, Jawahar C V}, TITLE = {Automated top view registration of broadcast football videos}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2018}}
In this paper, we propose a novel method to register football broadcast video frames on the static top view model of the playing surface. The proposed method is fully automatic in contrast to the current state of the art which requires manual initialization of point correspondences between the image and the static model. Automatic registration using existing approaches has been difficult due to the lack of sufficient point correspondences. We investigate an alternate approach exploiting the edge information from the line markings on the field. We formulate the registration problem as a nearest neighbour search over a synthetically generated dictionary of edge map and homography pairs. The synthetic dictionary generation allows us to exhaustively cover a wide variety of camera angles and positions and reduce this problem to a minimal per-frame edge map matching procedure. We show that the per-frame results can be improved in videos using an optimization framework for temporal camera stabilization. We demonstrate the efficacy of our approach by presenting extensive results on a dataset collected from matches of football World Cup 2014.
Mobile Visual Search for Digital Heritage Applications
Rohit Girdhar,Jayaguru Panda,Jawahar C V
Preserving Indian Cultural Heritage, PICH, 2017
@inproceedings{bib_Mobi_2017, AUTHOR = {Rohit Girdhar, Jayaguru Panda, Jawahar C V}, TITLE = {Mobile Visual Search for Digital Heritage Applications}, BOOKTITLE = {Preserving Indian Cultural Heritage}. YEAR = {2017}}
Mobile image retrieval allows users to identify visual information about their environment by transmitting image queries to an online image database that has associated annotations (e.g. location, product information, etc.) with the images. However, this is reliant on a network connection to transmit the query and retrieve the information. This chapter examines mobile image retrieval for offline use when the data net- work connection is limited or not available. In this scenario, the entire visual search index must reside on the mobile device itself. More specifically, we are interested in ‘instance retrieval’, where the annotations associated with the images (e.g. building’s name, object information) are returned by the query and not the images themselves. Figure 1 shows an example use case where mobile camera photos are used to identify buildings and landmarks without the need for a network connection. While our targeted instance retrieval does not need to store the images, the entire visual index structure needs to fit on a mobile device, ideally within a small footprint (e.g. 16–32 MB). This small memory footprint serves two purposes. First, while mobile phones have up to 16–32 GB of storage, this is mainly in the form of slower flash memory that is an order of magnitude slower than RAM. Having the entire index within tens of MBs makes it possible for use in a resident application on the phone’s RAM. Second, this small size is inline with common practices for mobile applications; e.g. the iPhone average application size is currently 23 MB. Addition- ally, iPhone apps less than 50 MB can be downloaded using 3G/4G, anything larger
A Support Vector Approach for Cross-Modal Search of Images and Texts
YASHASWI VERMA,Jawahar C V
Computer Vision and Image Understanding, CVIU, 2017
@inproceedings{bib_A_Su_2017, AUTHOR = {YASHASWI VERMA, Jawahar C V}, TITLE = {A Support Vector Approach for Cross-Modal Search of Images and Texts}, BOOKTITLE = {Computer Vision and Image Understanding}. YEAR = {2017}}
Building bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary cross-modal prediction tasks: (i) predicting text(s) given a query image (“Im2Text”), and (ii) predicting image(s) given a piece of text (“Text2Im”). We make no assumption on the specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We pose both these tasks in a retrieval framework. For Im2Text, given a query image, our goal is to retrieve a ranked list of semantically relevant texts from an independent text-corpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query text, we aim to retrieve a ranked list of semantically relevant images from a collection of unannotated images (i.e., images without any associated textual meta-data). We propose a novel Structural SVM based unified framework for these two tasks, and show how it can be efficiently trained and tested. Using a variety of loss functions, extensive experiments are conducted on three popular datasets (two medium-scale datasets containing few thousands of samples, and one web-scale dataset containing one million samples). Experiments demonstrate that our framework gives promising results compared to competing baseline cross-modal search techniques, thus confirming its efficacy. Keywords: Image search; Image description; Cross-media analysis
Human pose search using deep networks
Nataraj Jammalamadaka,Andrew Zisserman,Jawahar C V
Image and Vision Computing, IVC, 2017
@inproceedings{bib_Huma_2017, AUTHOR = {Nataraj Jammalamadaka, Andrew Zisserman, Jawahar C V}, TITLE = {Human pose search using deep networks}, BOOKTITLE = {Image and Vision Computing}. YEAR = {2017}}
Human pose as a query modality is an alternative and rich experience for image and video retrieval. It has interesting retrieval applications in domains such as sports and dance databases. In this work we propose two novel ways for representing the image of a person striking a pose, one looking for parts and other looking at the whole image. These representations are then used for retrieval. Both the representations are obtained using deep learning methods. In the first method, we make the following contributions: (a) We introduce ‘deep poselets’ for pose- sensitive detection of various body parts, built on convolutional neural network (CNN) features. These deep poselets significantly outperform previous instantiations of Berkeley poselets [6], and (b) Using these detec- tor responses, we construct a pose representation that is suitable for pose search, and show that pose retrieval performance is on par with the previous methods. In the second method, we make the follow- ing contributions: (a) We design an optimized neural network which maps the input image to a very low dimensional space where similar poses are close by and dissimilar poses are farther away, and (b) We show that pose retrieval system using these low dimensional representation is on par with the deep poselet representation and is on par with the previous methods. The previous works with which the above two methods are compared include bag of visual words [44], Berkeley poselets [6] and human pose estimation algorithms [52]. All the methods are quantitatively eval- uated on a large dataset of images built from a number of standard benchmarks together with frames from
Unconstrained Scene Text and Video Text Recognition for Arabic Script
MOHIT JAIN,MINESH MATHEW,Jawahar C V
International Workshop on Arabic and derived Script Analysis and Recognition, ASAR, 2017
@inproceedings{bib_Unco_2017, AUTHOR = {MOHIT JAIN, MINESH MATHEW, Jawahar C V}, TITLE = {Unconstrained Scene Text and Video Text Recognition for Arabic Script}, BOOKTITLE = {International Workshop on Arabic and derived Script Analysis and Recognition}. YEAR = {2017}}
Building robust recognizers for Arabic has always been challenging. We demonstrate the effectiveness of an end-to- end trainable CNN - RNN hybrid architecture in recognizing Arabic text in videos and natural scenes. We outperform previous state- of-the-art on two publicly available video text datasets - ALIF and ACTIV . For the scene text recognition task, we introduce a new Arabic scene text dataset and establish baseline results. For scripts like Arabic, a major challenge in developing robust recognizers is the lack of large quantity of annotated data. We overcome this by synthesizing millions of Arabic text images from a large vocabulary of Arabic words and phrases. Our implementation is built on top of the model introduced here [37] which is proven quite effective for English scene text recognition. The model follows a segmentation-free, sequence to sequence transcription approach. The network transcribes a sequence of convolutional features from the input image to a sequence of target labels. This does away with the need for segmenting input image into constituent characters/glyphs, which is often difficult for Arabic script. Further, the ability of RNN s to model contextual dependencies yields superior recognition results.
Benchmarking Scene Text Recognition in Devanagari, Telugu and Malayalam
MINESH MATHEW,MOHIT JAIN,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2017
@inproceedings{bib_Benc_2017, AUTHOR = {MINESH MATHEW, MOHIT JAIN, Jawahar C V}, TITLE = {Benchmarking Scene Text Recognition in Devanagari, Telugu and Malayalam}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2017}}
Inspired by the success of Deep Learning based approaches to English scene text recognition, we pose and bench- mark scene text recognition for three Indic scripts - Devanagari, Telugu and Malayalam. Synthetic word images rendered from Unicode fonts are used for training the recognition system. And the performance is bench-marked on a new - IIIT-ILST dataset comprising of hundreds of real scene images containing text in the above mentioned scripts. We use a segmentation free, hybrid but end-to-end trainable CNN - RNN deep neural network for transcribing the word images to the corresponding texts. The cropped word images need not be segmented into the sub- word units and the error is calculated and backpropagated for the the given word image at once. The network is trained using CTC loss, which is proven quite effective for sequence- to-sequence transcription tasks. The CNN layers in the network learn to extract robust feature representations from word images. The sequence of features learnt by the convolutional block is transcribed to a sequence of labels by the RNN + CTC block. The transcription is not bound by word length or a lexicon and is ideal for Indian languages which are highly inflectional.
An Empirical Study of Effectiveness of Post-processing in Indic Scripts
VINITHA V S,MINESH MATHEW,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2017
@inproceedings{bib_An_E_2017, AUTHOR = {VINITHA V S, MINESH MATHEW, Jawahar C V}, TITLE = {An Empirical Study of Effectiveness of Post-processing in Indic Scripts}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2017}}
This paper explores the effectiveness of statistical language model ( SLM ) and dictionary based methods for detec- tion and correction of errors in Indic OCR output. In SLM , we use unicode level ngrams for building the language model. We compare its performance with akshara level ngrams and find that akshara level ngrams perform better in detecting the errors when compared to unicode level ngrams. We experimentally analyze the performance of Indic OCR post-processing using dictionary method, compare the performance with English and analyze the reasons for the under-performance in Indic scripts. We use four major Indian languages for our experiments, namely Hindi, Gurumukhi, Telugu and Malayalam.
Sequence-to-Sequence Learning for Human Pose Correction in Videos
SIRNAM SWETHA,Vineeth N Balasubramanian,Jawahar C V
Asian Conference on Pattern Recognition, ACPR, 2017
@inproceedings{bib_Sequ_2017, AUTHOR = {SIRNAM SWETHA, Vineeth N Balasubramanian, Jawahar C V}, TITLE = {Sequence-to-Sequence Learning for Human Pose Correction in Videos}, BOOKTITLE = {Asian Conference on Pattern Recognition}. YEAR = {2017}}
The power of ConvNets has been demonstrated in a wide variety of vision tasks including pose estimation. But they often produce absurdly erroneous predictions in videos due to unusual poses, challenging illumination, blur, self-occlusions etc. These erroneous predictions can be re- fined by leveraging previous and future predictions as the temporal smoothness constrain in the videos. In this paper, we present a generic approach for pose correction in videos using sequence learning that makes minimal assumptions on the sequence structure. The proposed model is generic, fast and surpasses the state-of-the-art on benchmark datasets. We use a generic pose estimator for initial pose estimates, which are further refined using our method. The proposed architecture uses Long Short-Term Memory ( LSTM ) encoder- decoder model to encode the temporal context and refine the estimations. We show 3.7% gain over the baseline Yang & Ramanan ( YR ) [1] and 2.07% gain over Spatial Fusion Network ( SFN ) [2] on a new challenging YouTube Pose Subset dataset [3]. Keywords-Pose estimation; sequence to sequence learnin
Automatic analysis of broadcast football videos using contextual priors
RAHUL ANAND SHARMA,Vineet Gandhi,Jawahar C V
Signal,Image and Video Processing, SIViP, 2017
@inproceedings{bib_Auto_2017, AUTHOR = {RAHUL ANAND SHARMA, Vineet Gandhi, Jawahar C V}, TITLE = {Automatic analysis of broadcast football videos using contextual priors}, BOOKTITLE = {Signal,Image and Video Processing}. YEAR = {2017}}
The presence of standard video editing practices in broadcast sports videos, like football, effectively means that such videos have stronger contextual priors than most generic videos. In this paper, we show that such information can be harnessed for automatic analysis of sports videos. Specifically, given an input video, we output per-frame information about camera angles and the events (goal, foul, etc.). Our main insight is that in the presence of temporal context (camera angles) for a video, the problem of event tagging (fouls, corners, goals, etc.) can be cast as per frame multi-class classification problem. We show that even with simple classifiers like linear SVM, we get significant improvement in the event tagging task when contextual information is included. We present extensive results for 10 matches from the recently concluded Football World Cup, to demonstrate the effectiveness of our approach.
Image annotation by propagating labels from semantic neighbourhoods
YASHASWI VERMA,Jawahar C V
International Journal of Computer Vision, IJCV, 2017
@inproceedings{bib_Imag_2017, AUTHOR = {YASHASWI VERMA, Jawahar C V}, TITLE = {Image annotation by propagating labels from semantic neighbourhoods}, BOOKTITLE = {International Journal of Computer Vision}. YEAR = {2017}}
Automatic image annotation aims at predicting a set of semantic labels for an image. Because of large annotation vocabulary, there exist large variations in the number of images corresponding to different labels (“class-imbalance”). Additionally, due to the limitations of human annotation, several images are not annotated with all the relevant labels (“incomplete-labelling”). These two issues affect the performance of most of the existing image annotation models. In this work, we propose 2-pass k-nearest neighbour (2PKNN) algorithm. It is a two-step variant of the classical k-nearest neighbour algorithm, that tries to address these issues in the image annotation task. The first step of 2PKNN uses “image-to-label” similarities, while the second step uses “image-to-image” similarities, thus combining the benefits of both. We also propose a metric learning framework over 2PKNN. This is done in a large margin set …
Self-supervised learning of visual features through embedding images into text topic spaces
Lluis Gomez,YASH PATEL,Marçal Rusiñol,Dimosthenis Karatzas,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2017
@inproceedings{bib_Self_2017, AUTHOR = {Lluis Gomez, YASH PATEL, Marçal Rusiñol, Dimosthenis Karatzas, Jawahar C V}, TITLE = {Self-supervised learning of visual features through embedding images into text topic spaces}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2017}}
End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible. In this paper we present a method that is able to take advantage of freely available multi-modal content to train computer vision algorithms without human supervision. We put forward the idea of performing self-supervised learning of visual features by mining a large scale corpus of multi-modal (text and image) documents. We show that discriminative visual features can be learnt efficiently by training a CNN to predict the semantic context in which a particular image is more probable to appear as an illustration. For this we leverage the hidden semantic structures discovered in the text corpus with a well-known topic modeling technique. Our experiments demonstrate state of the art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or natural-supervised approaches.
Pose-aware person recognition
VIJAYA KUMAR R,Anoop Namboodiri,Manohar Paluri,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2017
@inproceedings{bib_Pose_2017, AUTHOR = {VIJAYA KUMAR R, Anoop Namboodiri, Manohar Paluri, Jawahar C V}, TITLE = {Pose-aware person recognition}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2017}}
Person recognition methods that use multiple body regions have shown significant improvements over traditional face-based recognition. One of the primary challenges in full-body person recognition is the extreme variation in pose and view point. In this work,(i) we present an approach that tackles pose variations utilizing multiple models that are trained on specific poses, and combined using pose-aware weights during testing.(ii) For learning a person representation, we propose a network that jointly optimizes a single loss over multiple body regions.(iii) Finally, we introduce new benchmarks to evaluate person recognition in diverse scenarios and show significant improvements over previously proposed approaches on all the benchmarks including the photo album setting of PIPA.
An Interactive tour guide for a heritage site
SAHIL CHELARAMANI,Muthireddy Vamsidhar,Jawahar C V
International Conference on Computer Vision Workshops, ICCV-W, 2017
@inproceedings{bib_An_I_2017, AUTHOR = {SAHIL CHELARAMANI, Muthireddy Vamsidhar, Jawahar C V}, TITLE = {An Interactive tour guide for a heritage site}, BOOKTITLE = {International Conference on Computer Vision Workshops}. YEAR = {2017}}
Imagine taking a guided tour of a heritage site. Generally, tour guides have canned routes and stories about the monuments. As humans, we can inform the guide about topics which we are interested in, so as to ensure that we are presented stories which match our interests. Most digital storytelling approaches fail to take into account this aspect of a storyteller. In this work, we take on the task of interactive story generation, for a casually captured video-clip of a heritage site tour. We leverage user interaction to improve the relevance of the stories presented to the user. The stories generated vary from user to user, with the stories progressively becoming more aligned with the captured interests, as the number of interactions increase. We condition the stories on visual features from the video, along with the interests of the user. We additionally present a mechanism to generate questions to be posed to a user to gain additional insights into their interests.
Collaborative Contributions for Better Annotations.
PRIYAM BAKLIWAL, Guruprasad M. Hegde,Jawahar C V
International Conference on Computer Vision Theory and Applications, VISAPP, 2017
@inproceedings{bib_Coll_2017, AUTHOR = {PRIYAM BAKLIWAL, Guruprasad M. Hegde, Jawahar C V}, TITLE = {Collaborative Contributions for Better Annotations.}, BOOKTITLE = {International Conference on Computer Vision Theory and Applications}. YEAR = {2017}}
We propose an active learning based solution for efficient, scalable and accurate annotations of objects in video sequences. Recent computer vision solutions use machine learning. Effectiveness of these solutions relies on the amount of available annotated data which again depends on the generation of huge amount of accurately annotated data. In this paper, we focus on reducing the human annotation efforts with simultaneous increase in tracking accuracy to get precise, tight bounding boxes around an object of interest. We use a novel combination of two different tracking algorithms to track an object in the whole video sequence. We propose a sampling strategy to sample the most informative frame which is given for human annotation. This newly annotated frame is used to update the previous annotations. Thus, by collaborative efforts of both human and the system we obtain accurate annotations with minimal effort. Using the proposed method, user efforts can be reduced to half without compromising on the annotation accuracy. We have quantitatively and qualitatively validated the results on eight different datasets.
Unsupervised learning based approach for plagiarism detection in programming assignments
JITENDRA YASASWI BHARADWAJ KATTA,G SRI KAILASH,CHILUPURI ANIL,Venkata Suresh Reddy Purini,Jawahar C V
Proceedings of the 10th Innovations in Software Engineering Conference, PISEC, 2017
@inproceedings{bib_Unsu_2017, AUTHOR = {JITENDRA YASASWI BHARADWAJ KATTA, G SRI KAILASH, CHILUPURI ANIL, Venkata Suresh Reddy Purini, Jawahar C V}, TITLE = {Unsupervised learning based approach for plagiarism detection in programming assignments}, BOOKTITLE = {Proceedings of the 10th Innovations in Software Engineering Conference}. YEAR = {2017}}
In this work, we propose a novel hybrid approach for automatic plagiarism detection in programming assignments. Most of the well known plagiarism detectors either employ a text-based approach or use features based on the property of the program at a syntactic level. However, both these approaches succumb to code obfuscation which is a huge obstacle for automatic software plagiarism detection. Our proposed method uses static features extracted from the intermediate representation of a program in a compiler infrastructure such as gcc. We demonstrate the use of unsupervised learning techniques on the extracted feature representations and show that our system is robust to code obfuscation. We test our method on assignments from introductory programming course. The preliminary results show that our system is better when compared to other popular tools like MOSS. For visualizing the local and global …
Automated Top View Registration of Broadcast Football Videos
RAHUL ANAND SHARMA,BHARATH BHAT,Vineet Gandhi,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2017
@inproceedings{bib_Auto_2017, AUTHOR = {RAHUL ANAND SHARMA, BHARATH BHAT, Vineet Gandhi, Jawahar C V}, TITLE = {Automated Top View Registration of Broadcast Football Videos}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2017}}
In this paper, we propose a novel method to register football broadcast video frames on the static top view model of the playing surface. The proposed method is fully automatic in contrast to the current state of the art which requires manual initialization of point correspondences between the image and the static model. Automatic registration using existing approaches has been difficult due to the lack of sufficient point correspondences. We investigate an alternate approach exploiting the edge information from the line markings on the field. We formulate the registration problem as a nearest neighbour search over a synthetically generated dictionary of edge map and homography pairs. The synthetic dictionary generation allows us to exhaustively cover a wide variety of camera angles and positions and reduce this problem to a minimal per-frame edge map matching procedure. We show that the per-frame results can …
Unsupervised refinement of color and stroke features for text binarization
ANAND MISHRA,Karteek Alahari,Jawahar C V
International Journal on Document Analysis and Recognition, IJDAR, 2017
@inproceedings{bib_Unsu_2017, AUTHOR = {ANAND MISHRA, Karteek Alahari, Jawahar C V}, TITLE = {Unsupervised refinement of color and stroke features for text binarization}, BOOKTITLE = {International Journal on Document Analysis and Recognition}. YEAR = {2017}}
Color and strokes are the salient features of text regions in an image. In this work, we use both these features as cues, and introduce a novel energy function to formulate the text binarization problem. The minimum of this energy function corresponds to the optimal binarization. We minimize the energy function with an iterative graph cut-based algorithm. Our model is robust to variations in foreground and background as we learn Gaussian mixture models for color and strokes in each iteration of the graph cut. We show results on word images from the challenging ICDAR 2003/2011, born-digital image and street view text datasets, as well as full scene images containing text from ICDAR 2013 datasets, and compare our performance with state-of-the-art methods. Our approach shows significant improvements in performance under a variety of performance measures commonly used to assess text binarization …
Unsupervised Learning of Deep Feature Representation for Clustering Egocentric Actions.
BHARAT LAL BHATNAGAR,SURIYA SINGH,Chetan Arora,Jawahar C V
International Joint Conference on Artificial Intelligence, IJCAI, 2017
@inproceedings{bib_Unsu_2017, AUTHOR = {BHARAT LAL BHATNAGAR, SURIYA SINGH, Chetan Arora, Jawahar C V}, TITLE = {Unsupervised Learning of Deep Feature Representation for Clustering Egocentric Actions.}, BOOKTITLE = {International Joint Conference on Artificial Intelligence}. YEAR = {2017}}
Popularity of wearable cameras in life logging, law enforcement, assistive vision and other similar applications is leading to explosion in generation of egocentric video content. First person action recognition is an important aspect of automatic analysis of such videos. Annotating such videos is hard, not only because of obvious scalability constraints, but also because of privacy issues often associated with egocentric videos. This motivates the use of unsupervised methods for egocentric video analysis. In this work, we propose a robust and generic unsupervised approach for first person action clustering. Unlike the contemporary approaches, our technique is neither limited to any particular class of action nor requires priors such as pre-training, fine-tuning, etc. We learn time sequenced visual and flow features from an array of weak feature extractors based on convolutional and LSTM autoencoder networks. We demonstrate that clustering of such features leads to the discovery of semantically meaningful actions present in the video. We validate our approach on four disparate public egocentric actions datasets amounting to approximately 50 hours of videos. We show that our approach surpasses the supervised state of the art accuracies without using the action labels.
Compressing Deep Neural Networks for Recognizing Places
SOHAM SAHA,Girish Varma,Jawahar C V
Asian Conference on Pattern Recognition, ACPR, 2017
@inproceedings{bib_Comp_2017, AUTHOR = {SOHAM SAHA, Girish Varma, Jawahar C V}, TITLE = {Compressing Deep Neural Networks for Recognizing Places}, BOOKTITLE = {Asian Conference on Pattern Recognition}. YEAR = {2017}}
Visual place recognition on low memory devices such as mobile phones and robotics systems is a challenging problem. The state of the art models for this task uses deep learning architectures having close to 100 million parameters which takes over 400MB of memory. This makes these models infeasible to be deployed in low memory devices and gives rise to the need of compressing them. Hence we study the effectiveness of model compression techniques like trained quantization and pruning for reducing the number of parameters on one of the best performing image retrieval models called NetVLAD. We show that a compressed network can be created by starting with a model pre-trained for the task of visual place recognition and then fine-tuning it via trained pruning and quantization. The compressed model is able to produce the same mAP as the original uncompressed network. We achieve almost 50% parameter pruning with no loss in mAP and 70% pruning with close to 2% mAP reduction, while also performing 8-bit quantization. Furthermore, together with 5-bit quantization, we perform about 50% parameter reduction by pruning and get only about 3% reduction in mAP. The resulting compressed networks have sizes of around 30MB and 65MB which makes them easily usable in memory constrained devices.
Unconstrained ocr for urdu using deep cnn-rnn hybrid networks
MOHIT JAIN,MINESH MATHEW,Jawahar C V
Asian Conference on Pattern Recognition, ACPR, 2017
@inproceedings{bib_Unco_2017, AUTHOR = {MOHIT JAIN, MINESH MATHEW, Jawahar C V}, TITLE = {Unconstrained ocr for urdu using deep cnn-rnn hybrid networks}, BOOKTITLE = {Asian Conference on Pattern Recognition}. YEAR = {2017}}
Building robust text recognition systems for languages with cursive scripts like Urdu has always been challenging. Intricacies of the script and the absence of ample annotated data further act as adversaries to this task. We demonstrate the effectiveness of an end-to-end trainable hybrid CNN-RNN architecture in recognizing Urdu text from printed documents, typically known as Urdu OCR. The solution proposed is not bounded by any language specific lexicon with the model following a segmentation-free, sequence-tosequence transcription approach. The network transcribes a sequence of convolutional features from an input image to a sequence of target labels. This discards the need to segment the input image into its constituent characters/glyphs, which is often arduous for scripts like Urdu. Furthermore, past and future contexts modelled by bidirectional recurrent layers aids the transcription. We outperform …
Plagiarism detection in programming assignments using deep features
JITENDRA YASASWI BHARADWAJ KATTA,Venkata Suresh Reddy Purini,Jawahar C V
Asian Conference on Pattern Recognition, ACPR, 2017
@inproceedings{bib_Plag_2017, AUTHOR = {JITENDRA YASASWI BHARADWAJ KATTA, Venkata Suresh Reddy Purini, Jawahar C V}, TITLE = {Plagiarism detection in programming assignments using deep features}, BOOKTITLE = {Asian Conference on Pattern Recognition}. YEAR = {2017}}
This paper proposes a method for detecting plagiarism in source-codes using deep features. The embeddings for programs are obtained using a character-level Recurrent Neural Network (char-RNN), which is pre-trained on Linux Kernel source-code. Many popular plagiarism detection tools are based on n-gram techniques at syntactic level. However, these approaches to plagiarism detection fail to capture long term dependencies (non-contiguous interaction) present in the source-code. Contrarily, the proposed deep features capture non-contiguous interaction within n-grams. These are generic in nature and there is no need to fine-tune the char-RNN model again to program submissions from each individual problem-set. Our experiments show the effectiveness of deep features in the task of classifying assignment program submissions as copy, partial-copy and non-copy. Comparing our proposed features with …
Improving small object detection
HARISH KRISHNA V,Jawahar C V
Asian Conference on Pattern Recognition, ACPR, 2017
@inproceedings{bib_Impr_2017, AUTHOR = {HARISH KRISHNA V, Jawahar C V}, TITLE = {Improving small object detection}, BOOKTITLE = {Asian Conference on Pattern Recognition}. YEAR = {2017}}
While the problem of detecting generic objects in natural scene images has been the subject of research for a long time, the problem of detection of small objects has been largely ignored. While generic object detectors perform well on medium and large sized objects, they perform poorly for the overall task of recognition of small objects. This is because of the low resolution and simple shape of most small objects. In this work, we suggest a simple yet effective upsampling-based technique that performs better than the current state-of-the-art for end-to-end small object detection. Like most recent methods, we generate proposals and then classify them. We suggest improvements to both these steps for the case of small objects.
An EEG-based image annotation system
PAREKH VIRAL MAHESH KUMAR,Ramanathan Subramanian,Dipanjan Roy,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2017
@inproceedings{bib_An_E_2017, AUTHOR = {PAREKH VIRAL MAHESH KUMAR, Ramanathan Subramanian, Dipanjan Roy, Jawahar C V}, TITLE = {An EEG-based image annotation system}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2017}}
e datasets. We propose an EEG (Electroencephalogram)-based image annotation system. While humans can recognize objects in 20–200 ms, the need to manually label images results in a low annotation throughput. Our system employs brain signals captured via a consumer EEG device to achieve an annotation rate of up to 10 images per second. We exploit the P300 event-related potential (ERP) signature to identify target images during a rapid serial visual presentation (RSVP) task. We further perform unsupervised outlier removal to achieve an F1-score of 0.88 on the test set. The proposed system does not depend on category-specific EEG signatures enabling the annotation of any new image category without any model pre-training.
Smarttennistv: Automatic indexing of tennis videos
ANURAG GHOSH,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2017
@inproceedings{bib_Smar_2017, AUTHOR = {ANURAG GHOSH, Jawahar C V}, TITLE = {Smarttennistv: Automatic indexing of tennis videos}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2017}}
In this paper, we demonstrate a score based indexing approach for tennis videos. Given a broadcast tennis video (btv), we index all the video segments with their scores to create a navigable and searchable match. Our approach temporally segments the rallies in the video and then recognizes the scores from each of the segments, before refining the scores using the knowledge of the tennis scoring system. We finally build an interface to effortlessly retrieve and view the relevant video segments by also automatically tagging the segmented rallies with human accessible tags such as ‘fault’ and ‘deuce’. The efficiency of our approach is demonstrated on btv’s from two major tennis tournaments.
Face Fiducial Detection by Consensus of Exemplars
Mallikarjun B R,Visesh Chari,Jawahar C V,Akshay Asthana
Winter Conference on Applications of Computer Vision, WACV, 2016
@inproceedings{bib_Face_2016, AUTHOR = {Mallikarjun B R, Visesh Chari, Jawahar C V, Akshay Asthana}, TITLE = {Face Fiducial Detection by Consensus of Exemplars}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2016}}
Facial fiducial detection is a challenging problem for several reasons like varying pose, appearance, expression, partial occlusion and others. In the past, several approaches like mixture of trees [32], regression based methods [8], exemplar based methods [7] have been proposed to tackle this challenge. In this paper, we propose an exemplar based approach to select the best solution from among outputs of regression and mixture of trees based algorithms (which we call candidate algorithms). We show that by using a very simple SIFT and HOG based descriptor, it is possible to identify the most accurate fiducial outputs from a set of results produced by candidate algorithms on any given test image. Our approach manifests as two algorithms, one based on optimizing an objective function with quadratic terms and the other based on simple kNN. Both algorithms take as input fiducial locations produced by running state-of-the-art candidate algorithms on an input image, and output accurate fiducials using a set of automatically selected exemplar images with annotations. Our surprising result is that in this case, a simple algorithm like kNN is able to take advantage of the seemingly huge complementarity of these candidate algorithms, better than optimization based algorithms. We do extensive experiments on several datasets, and show that our approach outperforms state-of-the-art consistently. In some cases, we report as much as a 10% improvement in accuracy. We also extensively analyze each component of our approach, to illustrate its efficacy. An implementation and extended technical report of our approach is available www.sites.google.com/site/wacv2016facefiducialexemplars.
Error Detection in Indic OCRs
VINITHA V S,Jawahar C V
Document Analysis Systems, DAS, 2016
@inproceedings{bib_Erro_2016, AUTHOR = {VINITHA V S, Jawahar C V}, TITLE = {Error Detection in Indic OCRs}, BOOKTITLE = {Document Analysis Systems}. YEAR = {2016}}
A good post processing module is an indispensable part of an OCR pipeline. In this paper, we propose a novel method for error detection in Indian language OCR output. Our solution uses a recurrent neural network ( RNN ) for classification of a word as an error or not. We propose a generic error detection method and demonstrate its effectiveness on four popular Indian languages. We divide the words into their constituent aksharas and use their bigram and trigram level information to build a feature representation. In order to train the classifier on incorrect words, we use the mis-recognized words in the output of the OCR . In addition to RNN , we also explore the effectiveness of a generative model such as GMM for our task and demonstrate an improved performance by combining both the approaches. We tested our method on four popular Indian languages and report an average error detection performance above 80%.
Trajectory aligned features for first person action recognition
SURIYA SINGH,Chetan Arora,Jawahar C V
Pattern Recognition, PR, 2016
@inproceedings{bib_Traj_2016, AUTHOR = {SURIYA SINGH, Chetan Arora, Jawahar C V}, TITLE = {Trajectory aligned features for first person action recognition}, BOOKTITLE = {Pattern Recognition}. YEAR = {2016}}
Egocentric videos are characterized by their ability to have the first person view. With the popularity of Google Glass and GoPro, use of egocentric videos is on the rise. With the substantial increase in the number of egocentric videos, the value and utility of recognizing actions of the wearer in such videos has also thus increased. Unstructured movement of the camera due to natural head motion of the wearer causes sharp changes in the visual field of the egocentric camera causing many standard third person action recognition techniques to perform poorly on such videos. Objects present in the scene and hand gestures of the wearer are the most important cues for first person action recognition but are difficult to segment and recognize in an egocentric video. We propose a novel representation of the first person actions derived from feature trajectories. The features are simple to compute using standard point tracking and do not assume segmentation of hand/objects or recognizing object or hand pose unlike in many previous approaches. We train a bag of words classifier with the proposed features and report a performance improvement of more than 11% on publicly available datasets. Although not designed for the particular case, we show that our technique can also recognize wearer's actions when hands or objects are not visible.
Discriminative learning based visual servoing across object instances
HARIT PANDYA,K Madhava Krishna,Jawahar C V
International Conference on Robotics and Automation, ICRA, 2016
@inproceedings{bib_Disc_2016, AUTHOR = {HARIT PANDYA, K Madhava Krishna, Jawahar C V}, TITLE = {Discriminative learning based visual servoing across object instances}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2016}}
Classical visual servoing approaches use visual features based on geometry of the object such as points, lines, region, etc. to attain the desired camera pose. However, geometrical features are not suited for visual servoing across different object instances due to large variations in appearance and shape. In this paper, we present a new framework for visual servoing across object instances. Our approach is based on a discriminative learning framework where the desired pose is estimated using previously seen examples. Specifically, we learn a binary classifier that separates the desired pose from all other poses for that object category. The classification error is then used to control the end-effector so that the desired pose is attained. We present controllers for linear, kernel and exemplar Support Vector Machine (SVM) and empirically discuss their performance in the visual servoing context. To address large intra-category variation in appearance, we propose a modified version of Histogram of Oriented Gradients (HOG) features for visual servoing. We show effective servoing across diverse instances over 3 object categories with zero terminal velocity and acceptable camera pose error at termination.
First person action recognition using deep learned descriptors
SURIYA SINGH,Chetan Arora,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2016
@inproceedings{bib_Firs_2016, AUTHOR = {SURIYA SINGH, Chetan Arora, Jawahar C V}, TITLE = {First person action recognition using deep learned descriptors}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2016}}
We focus on the problem of wearer's action recognition in first person aka egocentric videos. This problem is more challenging than third person activity recognition due to unavailability of wearer's pose and sharp movements in the videos caused by the natural head motion of the wearer. Carefully crafted features based on hands and objects cues for the problem have been shown to be successful for limited targeted datasets. We propose convolutional neural networks (CNNs) for end to end learning and classification of wearer's actions. The proposed network makes use of egocentric cues by capturing hand pose, head motion and saliency map. It is compact. It can also be trained from relatively small number of labeled egocentric videos that are available. We show that the proposed network can generalize and give state of the art performance on various disparate egocentric action datasets.
Learning multiple experiences useful visual features for active maps localization in crowded environments
A. H. Abdul Hafez,Manpreet Arora,K Madhava Krishna,Jawahar C V
Advanced Robotics, AR, 2016
@inproceedings{bib_Lear_2016, AUTHOR = {A. H. Abdul Hafez, Manpreet Arora, K Madhava Krishna, Jawahar C V}, TITLE = {Learning multiple experiences useful visual features for active maps localization in crowded environments}, BOOKTITLE = {Advanced Robotics}. YEAR = {2016}}
Crowded urban environments are composed of different types of dynamic and static elements. Learning and classification of features is a major task in solving the localization problem in such environments. This work presents a gradual learning methodology to learn the useful features using multiple experiences. The usefulness of an observed element is evaluated by a scoring mechanism which uses two scores – reliability and distinctiveness. The visual features thus learned are used to partition the visual map into smaller regions. The robot is efficiently localized in such a partitioned environment using two-level localization. The concept of active map (AM) is proposed here, which is a map that represents one partition of the environment in which there is a high probability of the robot existing. High-level localization is used to track the mode of the AMs using discrete Bayes filter. Low-level localization uses a bag-of …
Fine-tuning human pose estimations in videos
DIGVIJAY SINGH,Vineeth Balasubramanian,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2016
@inproceedings{bib_Fine_2016, AUTHOR = {DIGVIJAY SINGH, Vineeth Balasubramanian, Jawahar C V}, TITLE = {Fine-tuning human pose estimations in videos}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2016}}
We propose a semi-supervised self-training method for fine-tuning human pose estimations in videos that provides accurate estimations even for complex sequences. We surpass state-of-the-art on most of the datasets used and also show a 2.33% gain over the baseline on our new dataset of unrestricted sports videos. The self-training model presented has two components: a static Pictorial Structure (PS) based model and a dynamic ensemble of exemplars. We present a pose quality criteria that is primarily used for batch selection and automatic parameter selection. The same criteria works as a low-level pose evaluator used in post-processing. We set a new challenge by introducing a full human body-parts annotated complex dataset, CVIT-SPORTS, which contains complex videos from the sports domain. The strength of our method is demonstrated by adapting to videos of complex activities such as cricket …
Efficient object annotation for surveillance and automotive applications
SIRNAM SWETHA,ANAND MISHRA,Guruprasad M Hegde,Jawahar C V
Winter Conference on Applications of Computer Vision Workshops, WACV-W, 2016
@inproceedings{bib_Effi_2016, AUTHOR = {SIRNAM SWETHA, ANAND MISHRA, Guruprasad M Hegde, Jawahar C V}, TITLE = {Efficient object annotation for surveillance and automotive applications}, BOOKTITLE = {Winter Conference on Applications of Computer Vision Workshops}. YEAR = {2016}}
Accurately annotated large video data is critical for the development of reliable surveillance and automotive related vision solutions. In this work, we propose an efficient and yet accurate annotation scheme for objects in videos (pedestrians in this case) with minimal supervision. We annotate objects with tight bounding boxes. We propagate the annotations across the frames with a self training based approach. An energy minimization scheme for the segmentation is the central component of our method. Unlike the popular grab cut like segmentation schemes, we demand minimal user intervention. Since our annotation is built on an accurate segmentation, our bounding boxes are tight. We validate the performance of our approach on multiple publicly available datasets.
Enhancing energy minimization framework for scene text recognition with top-down cues
ANAND MISHRA, Karteek Alahari,Jawahar C V
Computer Vision and Image Understanding, CVIU, 2016
@inproceedings{bib_Enha_2016, AUTHOR = {ANAND MISHRA, Karteek Alahari, Jawahar C V}, TITLE = {Enhancing energy minimization framework for scene text recognition with top-down cues}, BOOKTITLE = {Computer Vision and Image Understanding}. YEAR = {2016}}
Recognizing scene text is a challenging problem, even more so than the recognition of scanned documents. This problem has gained significant attention from the computer vision community in recent years, and several methods based on energy minimization frameworks and deep learning approaches have been proposed. In this work, we focus on the energy minimization framework and propose a model that exploits both bottom-up and top-down cues for recognizing cropped words extracted from street images. The bottom-up cues are derived from individual character detections from an image. We build a conditional random field model on these detections to jointly model the strength of the detections and the interactions between them. These interactions are top-down cues obtained from a lexicon-based prior, i.e., language statistics. The optimal word represented by the text image is obtained by minimizing the …
A simple and effective solution for script identification in the wild
AJEET KUMAR SINGH,ANAND MISHRA,Pranav Dabral,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2016
@inproceedings{bib_A_si_2016, AUTHOR = {AJEET KUMAR SINGH, ANAND MISHRA, Pranav Dabral, Jawahar C V}, TITLE = {A simple and effective solution for script identification in the wild}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2016}}
We present an approach for automatically identifying the script of the text localized in the scene images. Our approach is inspired by the advancements in mid-level features. We represent the text images using mid-level features which are pooled from densely computed local features. Once text images are represented using the proposed mid-level feature representation, we use an off-the-shelf classifier to identify the script of the text image. Our approach is efficient and requires very less labeled data. We evaluate the performance of our method on a recently introduced CVSI dataset, demonstrating that the proposed approach can correctly identify script of 96.70% of the text images. In addition, we also introduce and benchmark a more challenging Indian Language Scene Text (ILST) dataset for evaluating the performance of our method.
Multilingual ocr for indic scripts
MINESH MATHEW,AJEET KUMAR SINGH,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2016
@inproceedings{bib_Mult_2016, AUTHOR = {MINESH MATHEW, AJEET KUMAR SINGH, Jawahar C V}, TITLE = {Multilingual ocr for indic scripts}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2016}}
In Indian scenario, a document analysis system has to support multiple languages at the same time. With emerging multilingualism in urban India, often bilingual, trilingual or even more languages need to be supported. This demands development of a multilingual OCR system which can work seamlessly across Indic scripts. In our approach the script is identified at word level, prior to the recognition of the word. An end-to-end RNN based architecture which can detect the script and recognize the text in a segmentation-free manner is proposed for this purpose. We demonstrate the approach for 12 Indian languages and English. It is observed that, even with the similar architecture, performance on Indian languages are poorer compared to English. We investigate this further. Our approach is evaluated on a large corpus comprising of thousands of pages. The Hindi OCR is compared with other popular OCRs for the …
Diverse yet efficient retrieval using locality sensitive hashing
P VIDYADHAR RAO,Prateek Jain,Jawahar C V
International Conference on Multimedia Retrieval, ICMR, 2016
@inproceedings{bib_Dive_2016, AUTHOR = {P VIDYADHAR RAO, Prateek Jain, Jawahar C V}, TITLE = {Diverse yet efficient retrieval using locality sensitive hashing}, BOOKTITLE = {International Conference on Multimedia Retrieval}. YEAR = {2016}}
Typical retrieval systems have three requirements: a) Accurate retrieval, ie, the method should have high precision, b) Diverse retrieval, ie, the obtained set of samples should be diverse, and c) Retrieval time should be small. However, most of the existing methods address only one or two of the above mentioned requirements. In this work, we present a method based on randomized locality sensitive hashing which tries to address all of the above requirements simultaneously. While earlier hashing-based approaches considered approximate retrieval to be acceptable only for the sake of efficiency, we argue that one can further exploit approximate retrieval to provide impressive trade-offs between accuracy and diversity. We also extend our method to the problem of multi-label prediction, where the goal is to output a diverse and accurate set of labels for a given document in real-time. Finally, we present empirical …
Single and multiple view support order prediction in clutter for manipulation
SWAGATIKA PANDA,·A. H. Abdul Hafez,Jawahar C V
Journal of Intelligent and Robotic Systems, JIRS, 2016
@inproceedings{bib_Sing_2016, AUTHOR = {SWAGATIKA PANDA, ·A. H. Abdul Hafez, Jawahar C V}, TITLE = {Single and multiple view support order prediction in clutter for manipulation}, BOOKTITLE = {Journal of Intelligent and Robotic Systems}. YEAR = {2016}}
Robotic manipulation of objects in clutter remains a challenging problem to date. The challenge is posed by various levels of complexity involved in interaction among objects. Understanding these semantic interactions among different objects is important to manipulate in complex settings. It can play a significant role in extending the scope of manipulation to cluttered environment involving generic objects, and both direct and indirect physical contact. In our work, we aim at learning semantic interaction among objects of generic shapes and sizes lying in clutter involving physical contact. We infer three types of support relationships: “support from below”, “support from side”, and “containment”. Subsequently, the learned semantic interaction or support relationship is used to derive a sequence or order in which the objects surrounding the object of interest should be removed without causing damage to the …
Generating synthetic data for text recognition
PRAVEEN KRISHNAN,Jawahar C V
Technical Report, arXiv, 2016
@inproceedings{bib_Gene_2016, AUTHOR = {PRAVEEN KRISHNAN, Jawahar C V}, TITLE = {Generating synthetic data for text recognition}, BOOKTITLE = {Technical Report}. YEAR = {2016}}
Generating synthetic images is an art which emulates the natural process of image generation in a closest possible manner. In this work, we exploit such a framework for data generation in handwritten domain. We render synthetic data using open source fonts and incorporate data augmentation schemes. As part of this work, we release 9M synthetic handwritten word image corpus which could be useful for training deep network architectures and advancing the performance in handwritten word spotting and recognition tasks.
A robust distance with correlated metric learning for multi-instance multi-label data
YASHASWI VERMA,Jawahar C V
International Conference on Multimedia, IMM, 2016
@inproceedings{bib_A_ro_2016, AUTHOR = {YASHASWI VERMA, Jawahar C V}, TITLE = {A robust distance with correlated metric learning for multi-instance multi-label data}, BOOKTITLE = {International Conference on Multimedia}. YEAR = {2016}}
In multi-instance data, every object is a bag that contains multiple elements or instances. Each bag may be assigned to one or more classes, such that it has at least one instance corresponding to every assigned class. However, since the annotations are at bag-level, there is no direct association between the instances within a bag and the assigned class labels, hence making the problem significantly challenging. While existing methods have mostly focused on Bag-to-Bag or Class-to-Bag distances, in this paper, we address the multiple instance learning problem using a novel Bag-to-Class distance measure. This is based on two observations:(a) existence of outliers is natural in multi-instance data, and (b) there may exist multiple instances within a bag that belong to a particular class. In order to address these, in the proposed distance measure (a) we employ L1-distance that brings robustness against outliers …
Matching handwritten document images
PRAVEEN KRISHNAN,Jawahar C V
European Conference on Computer Vision, ECCV, 2016
@inproceedings{bib_Matc_2016, AUTHOR = {PRAVEEN KRISHNAN, Jawahar C V}, TITLE = {Matching handwritten document images}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2016}}
We address the problem of predicting similarity between a pair of handwritten document images written by potentially different individuals. This has applications related to matching and mining in image collections containing handwritten content. A similarity score is computed by detecting patterns of text re-usages between document images irrespective of the minor variations in word morphology, word ordering, layout and paraphrasing of the content. Our method does not depend on an accurate segmentation of words and lines. We formulate the document matching problem as a structured comparison of the word distributions across two document images. To match two word images, we propose a convolutional neural network (cnn) based feature descriptor. Performance of this representation surpasses the state-of-the-art on handwritten word spotting. Finally, we demonstrate the applicability of our …
IIIT-CFW: a benchmark database of cartoon faces in the wild
Ashutosh Mishra,Shyam Nandan Rai,ANAND MISHRA,Jawahar C V
European Conference on Computer Vision, ECCV, 2016
@inproceedings{bib_IIIT_2016, AUTHOR = { Ashutosh Mishra, Shyam Nandan Rai, ANAND MISHRA, Jawahar C V}, TITLE = {IIIT-CFW: a benchmark database of cartoon faces in the wild}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2016}}
n this paper, we introduce the cartoon faces in the wild (IIIT-CFW) database and associated problems. This database contains 8,928 annotated images of cartoon faces of 100 public figures. It will be useful in conducting research on spectrum of problems associated with cartoon understanding. Note that to our knowledge, such realistic and large databases of cartoon faces are not available in the literature.
Partial linearization based optimization for multi-class SVM
PRITISH MOHAPATRA,Puneet Kumar Dokania,Jawahar C V,M. Pawan Kumar
European Conference on Computer Vision, ECCV, 2016
@inproceedings{bib_Part_2016, AUTHOR = {PRITISH MOHAPATRA, Puneet Kumar Dokania, Jawahar C V, M. Pawan Kumar}, TITLE = {Partial linearization based optimization for multi-class SVM}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2016}}
We propose a novel partial linearization based approach for optimizing the multi-class svm learning problem. Our method is an intuitive generalization of the Frank-Wolfe and the exponentiated gradient algorithms. In particular, it allows us to combine several of their desirable qualities into one approach: (i) the use of an expectation oracle (which provides the marginals over each output class) in order to estimate an informative descent direction, similar to exponentiated gradient; (ii) analytical computation of the optimal step-size in the descent direction that guarantees an increase in the dual objective, similar to Frank-Wolfe; and (iii) a block coordinate formulation similar to the one proposed for Frank-Wolfe, which allows us to solve large-scale problems. Using the challenging computer vision problems of action classification, object recognition and gesture recognition, we demonstrate the efficacy of our …
Dynamic narratives for heritage tour
ANURAG GHOSH,YASH PATEL,MOHAK KUMAR SUKHWANI,Jawahar C V
European Conference on Computer Vision, ECCV, 2016
@inproceedings{bib_Dyna_2016, AUTHOR = {ANURAG GHOSH, YASH PATEL, MOHAK KUMAR SUKHWANI, Jawahar C V}, TITLE = {Dynamic narratives for heritage tour}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2016}}
We present a dynamic story generation approach for the egocentric videos from the heritage sites. Given a short video clip of a ‘heritage-tour’ our method selects a series of short descriptions from the collection of pre-curated text and create a larger narrative. Unlike in the past, these narratives are not merely monotonic static versions from simple retrievals. We propose a method to generate on the fly dynamic narratives of the tour. The series of the text messages selected are optimised over length, relevance, cohesion and information simultaneously. This results in ‘tour guide’ like narratives which are seasoned and adapted to the participants selection of the tour path. We simultaneously use visual and gps cues for precision localization on the heritage site which is conceptually formulated as a graph. The efficacy of the approach is demonstrated on a heritage site, Golconda Fort, situated in Hyderabad …
Visual aesthetic analysis for handwritten document images
ANSHUMAN MAJUMDAR,PRAVEEN KRISHNAN,Jawahar C V
International Conference on Frontiers in Handwriting Recognition, ICFHR, 2016
@inproceedings{bib_Visu_2016, AUTHOR = {ANSHUMAN MAJUMDAR, PRAVEEN KRISHNAN, Jawahar C V}, TITLE = {Visual aesthetic analysis for handwritten document images}, BOOKTITLE = {International Conference on Frontiers in Handwriting Recognition}. YEAR = {2016}}
We present an approach for analyzing the visual aesthetic property of a handwritten document page which matches with human perception. We formulate the problem at two independent levels: (i) coarse level which deals with the overall layout, space usages between lines, words and margins, and (ii) fine level, which analyses the construction of each word and deals with the aesthetic properties of writing styles. We present our observations on multiple local and global features which can extract the aesthetic cues present in the handwritten documents.
Deep feature embedding for accurate recognition and retrieval of handwritten text
PRAVEEN KRISHNAN,KARTIK DUTTA,Jawahar C V
International Conference on Frontiers in Handwriting Recognition, ICFHR, 2016
@inproceedings{bib_Deep_2016, AUTHOR = {PRAVEEN KRISHNAN, KARTIK DUTTA, Jawahar C V}, TITLE = {Deep feature embedding for accurate recognition and retrieval of handwritten text}, BOOKTITLE = {International Conference on Frontiers in Handwriting Recognition}. YEAR = {2016}}
We propose a deep convolutional feature representation that achieves superior performance for word spotting and recognition for handwritten images. We focus on: -(i) enhancing the discriminative ability of the convolutional features using a reduced feature representation that can scale to large datasets, and (ii) enabling query-by-string by learning a common subspace for image and text using the embedded attribute framework. We present our results on popular datasets such as the IAM corpus and historical document collections from the Bentham and George Washington pages. On the challenging IAM dataset, we achieve a state of the art mAP of 91.58% on word spotting using textual queries and a mean word error rate of 6.69% for the word recognition task.
Align Me: A framework to generate Parallel Corpus Using OCRs and Bilingual Dictionaries
PRIYAM BAKLIWAL,Devadath V V,Jawahar C V
Workshop on South and Southeast Asian Natural Language Processing, SSANLP-W, 2016
@inproceedings{bib_Alig_2016, AUTHOR = {PRIYAM BAKLIWAL, Devadath V V, Jawahar C V}, TITLE = {Align Me: A framework to generate Parallel Corpus Using OCRs and Bilingual Dictionaries}, BOOKTITLE = {Workshop on South and Southeast Asian Natural Language Processing}. YEAR = {2016}}
Multilingual language processing tasks like statistical machine translation and cross language information retrieval rely mainly on availability of accurate parallel corpora. Manual construction of such corpus can be extremely expensive and time consuming. In this paper we present a simple yet efficient method to generate huge amount of reasonably accurate parallel corpus with minimal user efforts. We utilize the availability of large number of English books and their corresponding translations in other languages to build parallel corpus. Optical Character Recognizing systems are used to digitize such books. We propose a robust dictionary based parallel corpus generation system for alignment of multilingual text at different levels of granularity (sentence, paragraphs, etc). We show the performance of our proposed method on a manually aligned dataset of 300 Hindi-English sentences and 100 English-Malayalam sentences.
Align Me : A framework to generate Parallel Corpus Using OCRs & Bilingual Dictionaries
PRIYAM BAKLIWAL,Devadath V V,Jawahar C V
Workshop on South and Southeast Asian Natural Language Processing, SSANLP-W, 2016
@inproceedings{bib_Alig_2016, AUTHOR = {PRIYAM BAKLIWAL, Devadath V V, Jawahar C V}, TITLE = {Align Me : A framework to generate Parallel Corpus Using OCRs & Bilingual Dictionaries}, BOOKTITLE = {Workshop on South and Southeast Asian Natural Language Processing}. YEAR = {2016}}
Multilingual language processing tasks like statistical machine translation and cross language information retrieval rely mainly on availability of accurate parallel corpora. Manual construction of such corpus can be extremely expensive and time consuming. In this paper we present a simple yet efficient method to generate huge amount of reasonably accurate parallel corpus with minimal user efforts. We utilize the availability of large number of English books and their corresponding translations in other languages to build parallel corpus. Optical Character Recognizing systems are used to digitize such books. We propose a robust dictionary based parallel corpus generation system for alignment of multilingual text at different levels of granularity (sentence, paragraphs, etc). We show the performance of our proposed method on a manually aligned dataset of 300 Hindi-English sentences and 100 English-Malayalam sentences.
Frame level annotations for tennis videos
Mohak Sukhwani,Jawahar C V
International conference on Pattern Recognition, ICPR, 2016
@inproceedings{bib_Fram_2016, AUTHOR = {Mohak Sukhwani, Jawahar C V}, TITLE = {Frame level annotations for tennis videos}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2016}}
Content based indexing is critical to the effective access of the multimedia data. To this end, visual data is often annotated with textual content for bridging the semantic gap. In this paper, we present a method to generate frame level fine grained annotations for a given video clip. Access to the frame level fine grained annotations lead to rich, dense and meaningful semantic associations between the text and video. This in turn makes the video retrieval systems more accurate. We demonstrate the use of probabilistic label consistent sparse coding and dictionary learning with a K-SVD algorithm to generate `fine grained' annotations for a class of videos - lawn tennis. The algorithm simultaneously learns a classifier and a dictionary to generate the frame level annotations for the tennis videos using available textual descriptions. The utility of the proposed algorithm is demonstrated on a publicly available tennis dataset …
Efficient Face Frontalization in Unconstrained Images
Mallikarjun B R,Visesh Chari,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2015
@inproceedings{bib_Effi_2015, AUTHOR = {Mallikarjun B R, Visesh Chari, Jawahar C V}, TITLE = {Efficient Face Frontalization in Unconstrained Images}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2015}}
Face frontalization is the process of synthesizing a frontal view of a face, given its non-frontal view. Frontalization is used in intelligent photo editing tools and also aids in improving the accuracy of face recognition systems. For example, in the case of photo editing, faces of persons in a group photo can be corrected to look into the camera, if they are looking elsewhere. Similarly, even though recent methods in face recognition claim accuracy which surpasses that of humans in some cases, perfor- mance of recognition systems degrade when profile view of faces are given as input. One way to address this issue is to synthesize frontal views of faces before recognition. We propose a simple and efficient method to address the face frontalization problem. Our method leverages the fact that faces in general have a definite structure and can be represented in a low dimensional subspace. We employ an exemplar based approach to find the transformation that relates the profile view to the frontal view, and use it to generate realistic frontalizations. Our method does not involve estimating 3 D model of the face, which is a common approach in previous work in this area. This leads to an efficient solution, since we avoid the complexity of adding one more dimension to the problem. Our method also retains the structural information of the individual as compared to that of a recent method [4], which assumes a generic 3 D model for synthesis. We show impressive qualitative and quantitative results in comparison to the state-of-the-art in this field.
Path planning for visual servoing and navigation using convex optimization
Abdul Hafez Abdul Hafez, Anil K Nelakanti,Jawahar C V
International Journal of Robotics and Automation, IJRA, 2015
@inproceedings{bib_Path_2015, AUTHOR = {Abdul Hafez Abdul Hafez, Anil K Nelakanti, Jawahar C V}, TITLE = {Path planning for visual servoing and navigation using convex optimization}, BOOKTITLE = {International Journal of Robotics and Automation}. YEAR = {2015}}
on the path are shown in Figure 6;(a) O1 from a;(b) O1 from b;(c) O1 from c;(d) O1 from d; and (e) O2 from e.(f) O2 from f.
Accurate localization by fusing images and GPS signals
KUMAR VISHAL,Jawahar C V,Visesh Chari
Computer Vision and Pattern Recognition Conference workshops, CVPR-W, 2015
@inproceedings{bib_Accu_2015, AUTHOR = {KUMAR VISHAL, Jawahar C V, Visesh Chari}, TITLE = {Accurate localization by fusing images and GPS signals}, BOOKTITLE = {Computer Vision and Pattern Recognition Conference workshops}. YEAR = {2015}}
Localization in 3D is an important problem with wide ranging applications from autonomous navigation in robotics to location specific services on mobile devices. GPS sensors are a commercially viable option for localization, and are ubiquitous in their use, especially in portable de-vices. With the proliferation of mobile cameras however, maturing localization algorithms based on computer vision are emerging as a viable alternative. Although both vision and GPS based localization algorithms have many limita-tions and inaccuracies, there are some interesting compli-mentarities in their success/failure scenarios that justify an investigation into their joint utilization. Such investigations are further justified considering that many of the modern wearable and mobile computing devices come with sensors for both GPS and vision. In this work, we investigate approaches to reinforce GPS localization with vision algorithms and vice versa. Specif-ically, we show how noisy GPS signals can be rectified by vision based localization of images captured in the vicin-ity. Alternatively, we also show how GPS readouts might be used to disambiguate images when they are visually similar looking but belong to different places. Finally, we empiri-cally validate our solutions to show that fusing both these approaches can result in a more accurate and reliable lo-calization of videos captured with a Contour action camera, over a 600 meter long path, over 10 different days.
Multi-label cross-modal retrieval
Viresh Ranjan,Nikhil Rasiwasia,Jawahar C V
International Conference on Computer Vision, ICCV, 2015
@inproceedings{bib_Mult_2015, AUTHOR = {Viresh Ranjan, Nikhil Rasiwasia, Jawahar C V}, TITLE = {Multi-label cross-modal retrieval}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2015}}
In this work, we address the problem of cross-modal retrieval in presence of multi-label annotations. In particular, we introduce multi-label Canonical Correlation Analysis (ml-CCA), an extension of CCA, for learning shared subspaces taking into account high level semantic information in the form of multi-label annotations. Unlike CCA, ml-CCA does not rely on explicit pairing between modalities, instead it uses the multi-label information to establish correspondences. This results in a discriminative subspace which is better suited for cross-modal retrieval tasks. We also present Fast ml-CCA, a computationally efficient version of ml-CCA, which is able to handle large scale datasets. We show the efficacy of our approach by conducting extensive cross-modal retrieval experiments on three standard benchmark datasets. The results show that the proposed approach achieves state of the art retrieval performance on the three datasets.
Visual phrases for exemplar face detection
N VIJAY KUMAR,Anoop Namboodiri,Jawahar C V
International Conference on Computer Vision, ICCV, 2015
@inproceedings{bib_Visu_2015, AUTHOR = {N VIJAY KUMAR, Anoop Namboodiri, Jawahar C V}, TITLE = {Visual phrases for exemplar face detection}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2015}}
Recently, exemplar based approaches have been successfully applied for face detection in the wild. Contrary to traditional approaches that model face variations from a large and diverse set of training examples, exemplar-based approaches use a collection of discriminatively trained exemplars for detection. In this paradigm, each exemplar casts a vote using retrieval framework and generalized Hough voting, to locate the faces in the target image. The advantage of this approach is that by having a large database that covers all possible variations, faces in challenging conditions can be detected without having to learn explicit models for different variations. Current schemes, however, make an assumption of independence between the visual words, ignoring their relations in the process. They also ignore the spatial consistency of the visual words. Consequently, every exemplar word contributes equally during voting regardless of its location. In this paper, we propose a novel approach that incorporates higher order information in the voting process. We discover visual phrases that contain semantically related visual words and exploit them for detection along with the visual words. For spatial consistency, we estimate the spatial distribution of visual words and phrases from the entire database and then weigh their occurrence in exemplars. This ensures that a visual word or a phrase in an exemplar makes a major contribution only if it occurs at its semantic location, thereby suppressing the noise significantly. We perform extensive experiments on standard FDDB, AFW and G-album datasets and show significant improvement over previous exemplar …
Multi-label annotation of music
Hiba Ahsan,N VIJAY KUMAR,Jawahar C V
International Conference on Applied Pattern Recognition, ICAPR, 2015
@inproceedings{bib_Mult_2015, AUTHOR = {Hiba Ahsan, N VIJAY KUMAR, Jawahar C V}, TITLE = {Multi-label annotation of music}, BOOKTITLE = {International Conference on Applied Pattern Recognition}. YEAR = {2015}}
Automatic annotation of an audio or a music piece with multiple labels helps in understanding the composition of a music. Such meta-level information can be very useful in applications such as music transcription, retrieval, organization and personalization. In this work, we formulate the problem of annotation as multi-label classification which is considerably different from that of a popular single (binary or multi-class) label classification. We employ both the nearest neighbour and max-margin (SVM) formulations for the automatic annotation. We consider K-NN and SVM that are adapted for multi-label classification using one-vs-rest strategy and a direct multi-label classification formulation using ML-KNN and M3L. In the case of music, often the signatures of the labels (e.g. instruments and vocal signatures) are fused in the features. We therefore propose a simple feature augmentation technique based on non …
Domain adaptation by aligning locality preserving subspaces
VIRESH RANJAN,Gaurav Harit,Jawahar C V
International Conference on Applied Pattern Recognition, ICAPR, 2015
@inproceedings{bib_Doma_2015, AUTHOR = {VIRESH RANJAN, Gaurav Harit, Jawahar C V}, TITLE = {Domain adaptation by aligning locality preserving subspaces}, BOOKTITLE = {International Conference on Applied Pattern Recognition}. YEAR = {2015}}
The mismatch between the training data and the test data distributions is a challenging issue while designing many practical computer vision systems. In this paper, we propose a domain adaptation technique to tackle this issue. We are interested in a domain adaptation scenario where source domain has large amount of labeled examples and the target domain has large amount of unlabeled examples. We align the source domain subspace with the target domain subspace in order to reduce the mismatch between the two distributions. We model the subspace using Locality Preserving Projections (LPP). Unlike previous subspace alignment approaches, we introduce a strategy to effectively utilize the training labels in order to learn discriminative subspaces. We validate our domain adaptation approach by testing it on two different domains, i.e. handwritten and printed digit images. We compare our approach with …
Document retrieval with unlimited vocabulary
VIRESH RANJAN,Gaurav Harit,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2015
@inproceedings{bib_Docu_2015, AUTHOR = {VIRESH RANJAN, Gaurav Harit, Jawahar C V}, TITLE = {Document retrieval with unlimited vocabulary}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2015}}
In this paper, we describe a classifier based retrieval scheme for efficiently and accurately retrieving relevant documents. We use SVM classifiers for word retrieval, and argue that the classifier based solutions can be superior to the OCR based solutions in many practical situations. We overcome the practical limitations of the classifier based solution in terms of limited vocabulary support, and availability of training data. In order to overcome these limitations, we design a one-shot learning scheme for dynamically synthesizing classifiers. Given a set of SVM classifiers, we appropriately join them to create novel classifiers. This extends the classifier based retrieval paradigm to an unlimited number of classes (words) present in a language. We validate our method on multiple datasets, and compare it with popular alternatives like OCR and word spotting. Even on a language like English, where OCRs have been fairly …
Fast approximate dynamic warping kernels
NAGENDAR. G,Jawahar C V
ACM IKDD Conference on Data Sciences, IKDD-CDS, 2015
@inproceedings{bib_Fast_2015, AUTHOR = {NAGENDAR. G, Jawahar C V}, TITLE = {Fast approximate dynamic warping kernels}, BOOKTITLE = {ACM IKDD Conference on Data Sciences}. YEAR = {2015}}
The dynamic time warping (DTW) distance is a popular similarity measure for comparing time series data. It has been successfully applied in many fields like speech recognition, data mining and information retrieval to automatically cope with time deformations and variations in the length of the time dependent data. There have been attempts in the past to define kernels on DTW distance. These kernels try to approximate the DTW distance. However, these have quadratic complexity and these are computationally expensive for large time series. In this paper, we introduce FastDTW kernel, which is a linear approximation of the DTW kernel and can be used with linear SVM.
Human pose search using deep poselets
NATARAJ J,Andrew Zisserman,Jawahar C V
International Conference on Automatic Face and Gesture Recognition, FG, 2015
@inproceedings{bib_Huma_2015, AUTHOR = {NATARAJ J, Andrew Zisserman, Jawahar C V}, TITLE = {Human pose search using deep poselets}, BOOKTITLE = {International Conference on Automatic Face and Gesture Recognition}. YEAR = {2015}}
Human pose as a query modality is an alternative and rich experience for image and video retrieval. We present a novel approach for the task of human pose retrieval, and make the following contributions: first, we introduce `deep poselets' for pose-sensitive detection of various body parts, that are built on convolutional neural network (CNN) features. These deep poselets significantly outperform previous instantiations of Berkeley poselets [2]. Second, using these detector responses, we construct a pose representation that is suitable for pose search, and show that pose retrieval performance exceeds previous methods by a factor of two. The compared methods include Bag of visual words [24], Berkeley poselets [2] and Human pose estimation algorithms [28]. All the methods are quantitatively evaluated on a large dataset of images built from a number of standard benchmarks together with frames from Hollywood …
Servoing across object instances: Visual servoing for object category
HARIT PANDYA,K Madhava Krishna,Jawahar C V
International Conference on Robotics and Automation, ICRA, 2015
@inproceedings{bib_Serv_2015, AUTHOR = {HARIT PANDYA, K Madhava Krishna, Jawahar C V}, TITLE = {Servoing across object instances: Visual servoing for object category}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2015}}
Traditional visual servoing is able to navigate a robotic system between two views of the same object. However, it is not designed to servo between views of different objects. In this paper, we consider a novel problem of servoing any instance (exemplar) of an object category to a desired pose (view) and propose a strategy to accomplish the task. We use features that semantically encode the locations of object parts and define the servoing error as the difference between positions of corresponding parts in the image space. Our controller is based on the linear combination of 3D models, such that the resulting model interpolates between the given and desired instances. We conducted our experiments on five different object categories in simulation framework and show that our approach achieves the desired pose with smooth trajectory. Furthermore, we show the performance gain achieved by using a linear …
Online handwriting recognition using depth sensors
RAJAT AGGARWAL,SIRNAM SWETHA,Anoop Namboodiri,Jayanthi Sivaswamy,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2015
@inproceedings{bib_Onli_2015, AUTHOR = {RAJAT AGGARWAL, SIRNAM SWETHA, Anoop Namboodiri, Jayanthi Sivaswamy, Jawahar C V}, TITLE = {Online handwriting recognition using depth sensors}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2015}}
In this work, we propose an online handwriting solution, where the data is captured with the help of depth sensors. Users may write in the air and our method recognizes it in real time using the proposed feature representation. Our method uses an efficient fingertip tracking approach and reduces the necessity of pen-up/pen-down switching. We validate our method on two depth sensors, Kinect and Leap Motion Controller. On a dataset collected from 20 users, we achieve a recognition accuracy of 97.59% for character recognition. We also demonstrate how this system can be extended for lexicon recognition with reliable performance. We have also prepared a dataset containing 1,560 characters and 400 words with the intention of providing common benchmark for handwritten character recognition using depth sensors and related research.
Unsupervised feature learning for optical character recognition
DEVENDRA KR SAHU,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2015
@inproceedings{bib_Unsu_2015, AUTHOR = {DEVENDRA KR SAHU, Jawahar C V}, TITLE = {Unsupervised feature learning for optical character recognition}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2015}}
Most of the popular optical character recognition (OCR) architectures use a set of handcrafted features and a powerful classifier for isolated character classification. Success of these methods often depend on the suitability of these features for the language of interest. In recent years, whole word recognition based on Recurrent Neural Networks (RNN) has gained popularity. These methods use simple features such as raw pixel values or profiles. Success of these methods depend on the learning capabilities of these networks to encode the script and language information. In this work, we investigate the possibility of learning an appropriate set of features for designing OCR for a specific language. We learn the language specific features from the data with no supervision. This enables the seamless adaptation of the architecture across languages. In this work, we learn features using a stacked Restricted Boltzman …
Efficient word image retrieval using fast DTW distance
NAGENDAR. G,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2015
@inproceedings{bib_Effi_2015, AUTHOR = {NAGENDAR. G, Jawahar C V}, TITLE = {Efficient word image retrieval using fast DTW distance}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2015}}
Dynamic time warping (DTW) is a popular distance measure used for recognition free document image retrieval. However, it has quadratic complexity and hence is computationally expensive for large scale word image retrieval. In this paper, we use a fast approximation to the DTW distance, which makes word retrieval efficient. For a pair of sequences, to compute their DTW distance, we need to find the optimal alignment from all the possible alignments. This is a computationally expensive operation. In this work, we learn a small set of global principal alignments from the training data and avoid the computation of alignments for query images. Thus, our proposed approximation is significantly faster compared to DTW distance, and gives 40 times speed up. We approximate the DTW distance as a sum of multiple weighted Eulidean distances which are known to be amenable to indexing and efficient retrieval. We …
Can RNNs reliably separate script and language at word and line level?
AJEET KUMAR SINGH,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2015
@inproceedings{bib_Can__2015, AUTHOR = {AJEET KUMAR SINGH, Jawahar C V}, TITLE = {Can RNNs reliably separate script and language at word and line level?}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2015}}
In this work, we investigate the utility of Recurrent Neural Networks (RNNs) for script and language identification. Both these problems have been attempted in the past with representations computed from the distribution of connected components or characters (e.g. texture, n-gram). Often these features are computed from a larger segment (a paragraph or a page). We argue that one can predict the script or language with minimal evidence (e.g. given only a word or a line) very accurately with the help of a pre-trained RNN. We propose a simple and generic solution for the task of script and language identification which do not require any special tuning. Our method represents the word images as a sequence of feature vectors, and employ the RNNs for the identification. We verify the method on a large corpus of more than 15.03M words from 55K document images comprising 15 scripts and languages. We report an …
Exploring Locally Rigid Discriminative Patches for Learning Relative Attributes.
YASHASWI VERMA,Jawahar C V
British Machine Vision Conference, BMVC, 2015
@inproceedings{bib_Expl_2015, AUTHOR = {YASHASWI VERMA, Jawahar C V}, TITLE = {Exploring Locally Rigid Discriminative Patches for Learning Relative Attributes.}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2015}}
Relative attributes help in comparing two images based on their visual properties. These are of great interest as they have been shown to be useful in several vision related problems such as recognition, retrieval, and understanding image collections in general. In the recent past, quite a few techniques have been proposed for the relative attribute learning task that give reasonable performance. However, these have focused either on the algorithmic aspect or the representational aspect. In this work, we revisit these approaches and integrate their broader ideas to develop simple baselines. These not only take care of the algorithmic aspects, but also take a step towards analyzing a simple yet domain independent patch-based representation for this task. This representation can capture local shape in an image, as well as spatially rigid correspondences across regions in an image pair. The baselines are extensively evaluated on three challenging relative attribute datasets (OSR, LFW-10 and UT-Zap50K). Experiments demonstrate that they achieve promising results on the OSR and LFW-10 datasets, and perform better than the current state-of-the-art on the UT-Zap50K dataset. Moreover, they also provide some interesting insights about the problem, that could be helpful in developing the future techniques in this domain.
Semantic Classification of Boundaries of an RGBD Image.
Anoop Namboodiri,Jawahar C V,Srikumar Ramalingam
British Machine Vision Conference, BMVC, 2015
@inproceedings{bib_Sema_2015, AUTHOR = {Anoop Namboodiri, Jawahar C V, Srikumar Ramalingam}, TITLE = {Semantic Classification of Boundaries of an RGBD Image.}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2015}}
The problem of labeling the edges present in a single color image as convex, concave, and occluding entities is one of the fundamental problems in computer vision. It has been shown that this information can contribute to segmentation, reconstruction and recognition problems. Recently, it has been shown that this classification is not straightforward even using RGBD data. This makes us wonder whether this apparent simple cue has more information than a depth map? In this paper, we propose a novel algorithm using random forest for classifying edges into convex, concave and occluding entities. We release a data set with more than 500 RGBD images with pixel-wise ground labels. Our method produces promising results and achieves an F-score of 0.84 on the data set.
Diverse Yet Efficient Retrieval using Hash Functions
P VIDYADHAR RAO,PRATEEK JAIN,Jawahar C V
International Conference on Multimedia Retrieval, ICMR, 2015
@inproceedings{bib_Dive_2015, AUTHOR = {P VIDYADHAR RAO, PRATEEK JAIN, Jawahar C V}, TITLE = {Diverse Yet Efficient Retrieval using Hash Functions}, BOOKTITLE = {International Conference on Multimedia Retrieval}. YEAR = {2015}}
Typical retrieval systems have three requirements: a) Accurate retrieval ie, the method should have high precision, b) Diverse retrieval, ie, the obtained set of points should be diverse, c) Retrieval time should be small. However, most of the existing methods address only one or two of the above mentioned requirements. In this work, we present a method based on randomized locality sensitive hashing which tries to address all of the above requirements simultaneously. While earlier hashing approaches considered approximate retrieval to be acceptable only for the sake of efficiency, we argue that one can further exploit approximate retrieval to provide impressive trade-offs between accuracy and diversity. We extend our method to the problem of multi-label prediction, where the goal is to output a diverse and accurate set of labels for a given document in real-time. Moreover, we introduce a new notion to simultaneously evaluate a method's performance for both the precision and diversity measures. Finally, we present empirical results on several different retrieval tasks and show that our method retrieves diverse and accurate images/labels while ensuring -speed-up over the existing diverse retrieval approaches.
A Probabilistic Approach for Image Retrieval Using Descriptive Textual Queries
YASHASWI VERMA,Jawahar C V
ACM international conference on Multimedia, ACMMM, 2015
@inproceedings{bib_A_Pr_2015, AUTHOR = {YASHASWI VERMA, Jawahar C V}, TITLE = {A Probabilistic Approach for Image Retrieval Using Descriptive Textual Queries}, BOOKTITLE = {ACM international conference on Multimedia}. YEAR = {2015}}
We address the problem of image retrieval using textual queries. In particular, we focus on descriptive queries that can be either in the form of simple captions (eg,``a brown cat sleeping on a sofa''), or even long descriptions with multiple sentences. We present a probabilistic approach that seamlessly integrates visual and textual information for the task. It relies on linguistically and syntactically motivated mid-level textual patterns (or phrases) that are automatically extracted from available descriptions. At the time of retrieval, the given query is decomposed into such phrases, and images are ranked based on their joint relevance with these phrases. Experiments on two popular datasets (UIUC Pascal Sentence and IAPR-TC12 benchmark) demonstrate that our approach effectively retrieves semantically meaningful images, and outperforms baseline methods.
Fine-grain annotation of cricket videos
RAHUL ANAND SHARMA,Pramod Sankar K.,Jawahar C V
Asian Conference on Pattern Recognition, ACPR, 2015
@inproceedings{bib_Fine_2015, AUTHOR = {RAHUL ANAND SHARMA, Pramod Sankar K., Jawahar C V}, TITLE = {Fine-grain annotation of cricket videos}, BOOKTITLE = {Asian Conference on Pattern Recognition}. YEAR = {2015}}
The recognition of human activities is one of the key problems in video understanding. Action recognition is challenging even for specific categories of videos, such as sports, that contain only a small set of actions. Interestingly, sports videos are accompanied by detailed commentaries available online, which could be used to perform action annotation in a weakly-supervised setting. For the specific case of Cricket videos, we address the challenge of temporal segmentation and annotation of actions with semantic descriptions. Our solution consists of two stages. In the first stage, the video is segmented into "scenes", by utilizing the scene category information extracted from text-commentary. The second stage consists of classifying videoshots as well as the phrases in the textual description into various categories. The relevant phrases are then suitably mapped to the video-shots. The novel aspect of this work is the …
Tennisvid2text: Fine-grained descriptions for domain specific videos
MOHAK KUMAR SUKHWANI,Jawahar C V
Technical Report, arXiv, 2015
@inproceedings{bib_Tenn_2015, AUTHOR = {MOHAK KUMAR SUKHWANI, Jawahar C V}, TITLE = {Tennisvid2text: Fine-grained descriptions for domain specific videos}, BOOKTITLE = {Technical Report}. YEAR = {2015}}
Automatically describing videos has ever been fascinating. In this work, we attempt to describe videos from a specific domain-broadcast videos of lawn tennis matches. Given a video shot from a tennis match, we intend to generate a textual commentary similar to what a human expert would write on a sports website. Unlike many recent works that focus on generating short captions, we are interested in generating semantically richer descriptions. This demands a detailed low-level analysis of the video content, specially the actions and interactions among subjects. We address this by limiting our domain to the game of lawn tennis. Rich descriptions are generated by leveraging a large corpus of human created descriptions harvested from Internet. We evaluate our method on a newly created tennis video data set. Extensive analysis demonstrate that our approach addresses both semantic correctness as well as readability aspects involved in the task.
Learning metrics for diversity in instance retrieval
P VIDYADHAR RAO,AJITESH GUPTA,Visesh Chari,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2015
@inproceedings{bib_Lear_2015, AUTHOR = {P VIDYADHAR RAO, AJITESH GUPTA, Visesh Chari, Jawahar C V}, TITLE = {Learning metrics for diversity in instance retrieval}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2015}}
Instance retrieval (IR) is the problem of retrieving specific instances of a particular object, like a monument, from a collection of images. Currently, the most popular methods for IR use Bag of words (BoW) features for retrieval. However, a prominent problem for IR remains the tendency of BoW based methods to retrieve near-identical images as most relevant results. In this paper, we define diversity in IR as variation of physical properties among most relevant retrieved results for a query image. To achieve this, we propose both an ITML algorithm that re-fashions the BoW feature space into one that appreciates diversity better, and a measure to evaluate diversity in retrieval results for IR applications. Additionally, we also generate 200 hand-labeled images from the Paris dataset, for use in further research in this area. Experiments on the popular Paris dataset show that our method outperforms the standard BoW model …
Generic action recognition from egocentric videos
SURIYA SINGH,Chetan Arora,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2015
@inproceedings{bib_Gene_2015, AUTHOR = {SURIYA SINGH, Chetan Arora, Jawahar C V}, TITLE = {Generic action recognition from egocentric videos}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2015}}
Egocentric cameras are wearable cameras mounted on a person's head or shoulder. With their ability to have first person view, such cameras are spawning new set of exciting applications in computer vision. Recognising activity of the wearer from an egocentric video is an important but challenging problem. The task is made especially difficult because of unavailability of wearer's pose as well as extreme camera shake due to motion of wearer's head. Solutions suggested so far for the problem, have either focussed on short term actions such as pour, stir etc. or long term activities such as walking, driving etc. The features used in both the styles are very different and the technique developed for one style often fail miserably on other kind. In this paper we propose a technique to identify if a long term or a short term action is present in an egocentric video segment. This allows us to have a generic first-person action …
Active learning based image annotation
PRIYAM BAKLIWAL,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2015
@inproceedings{bib_Acti_2015, AUTHOR = {PRIYAM BAKLIWAL, Jawahar C V}, TITLE = {Active learning based image annotation}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2015}}
Automatic image annotation is the computer vision task of assigning a set of appropriate textual tags to a novel image. The aim is to eventually bridge the semantic gap of visual and textual representations with the help of these tags. This also has applications in designing scalable image retrieval systems and providing multilingual interfaces. Though a wide varieties of powerful machine learning algorithms have been explored for the image annotation problem in the recent past, nearest neighbor techniques still yield superior results to them. A challenge ahead of the present day annotation schemes is the lack of sufficient training data. In this paper, an active Learning based image annotation model is proposed. We leverage the image-to-image and image-to-tag similarities to decide the best set of tags describing the semantics of an image. The advantages of the proposed model includes: (a). It is able to output the …
Relative parts: Distinctive parts for learning relative attributes
RAMACHANDRUNI N SANDEEP,YASHASWI VERMA,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2014
@inproceedings{bib_Rela_2014, AUTHOR = {RAMACHANDRUNI N SANDEEP, YASHASWI VERMA, Jawahar C V}, TITLE = {Relative parts: Distinctive parts for learning relative attributes}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2014}}
The notion of relative attributes as introduced by Parikh and Grauman (ICCV, 2011) provides an appealing way of comparing two images based on their visual properties (or attributes) such as “smiling” for face images, “naturalness” for outdoor images, etc. For learning such attributes, a Ranking SVM based formulation was proposed that uses globally represented pairs of annotated images. In this paper, we extend this idea towards learning relative attributes using local parts that are shared across categories. First, instead of using a global representation, we introduce a part-based representation combining a pair of images that specifically compares corresponding parts. Then, with each part we associate a locally adaptive “significancecoefficient” that represents its discriminative ability with respect to a particular attribute. For each attribute, the significance-coefficients are learned simultaneously with a max-margin ranking model in an iterative manner. Compared to the baseline method, the new method is shown to achieve significant improvement in relative attribute prediction accuracy. Additionally, it is also shown to improve relative feedback based interactive image search.
Parsing world's skylines using shape-constrained mrfs
RASHMI VILAS TONGE,Subhransu Maji,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2014
@inproceedings{bib_Pars_2014, AUTHOR = {RASHMI VILAS TONGE, Subhransu Maji, Jawahar C V}, TITLE = {Parsing world's skylines using shape-constrained mrfs}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2014}}
We propose an approach for segmenting the individual buildings in typical skyline images. Our approach is based on a Markov Random Field (MRF) formulation that exploits the fact that such images contain overlapping objects of similar shapes exhibiting a “tiered” structure. Our contributions are the following: (1) A dataset of 120 highresolution skyline images from twelve different cities with over 4,000 individually labeled buildings that allows us to quantitatively evaluate the performance of various segmentation methods, (2) An analysis of low-level features that are useful for segmentation of buildings, and (3) A shapeconstrained MRF formulation that enforces shape priors over the regions. For simple shapes such as rectangles, our formulation is significantly faster to optimize than a standard MRF approach, while also being more accurate. We experimentally evaluate various MRF formulations and demonstrate the effectiveness of our approach in segmenting skyline images
Optimizing average precision using weakly supervised data
ASEEM BEHL,Jawahar C V,M. Pawan Kumar
Computer Vision and Pattern Recognition, CVPR, 2014
@inproceedings{bib_Opti_2014, AUTHOR = {ASEEM BEHL, Jawahar C V, M. Pawan Kumar}, TITLE = {Optimizing average precision using weakly supervised data}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2014}}
The performance of binary classification tasks, such as action classification and object detection, is often measured in terms of the average precision (AP). Yet it is common practice in computer vision to employ the support vector machine (SVM) classifier, which optimizes a surrogate 0-1 loss. The popularity of SVM can be attributed to its empirical performance. Specifically, in fully supervised settings, SVM tends to provide similar accuracy to the AP-SVM classifier, which directly optimizes an AP-based loss. However, we hypothesize that in the significantly more challenging and practically useful setting of weakly supervised learning, it becomes crucial to optimize the right accuracy measure. In order to test this hypothesis, we propose a novel latent AP-SVM that minimizes a carefully designed upper bound on the AP-based loss function over weakly supervised samples. Using publicly available datasets, we demonstrate the advantage of our approach over standard loss-based binary classifiers on two challenging problems: action classification and character recognition.
Efficient optimization for average precision svm
PRITISH MOHAPATRA,Jawahar C V,M. Pawan Kumar
Neural Information Processing Systems, NeurIPS, 2014
@inproceedings{bib_Effi_2014, AUTHOR = {PRITISH MOHAPATRA, Jawahar C V, M. Pawan Kumar}, TITLE = {Efficient optimization for average precision svm}, BOOKTITLE = {Neural Information Processing Systems}. YEAR = {2014}}
The accuracy of information retrieval systems is often measured using average precision (AP). Given a set of positive (relevant) and negative (non-relevant) samples, the parameters of a retrieval system can be estimated using the AP-SVM framework, which minimizes a regularized convex upper bound on the empirical AP loss. However, the high computational complexity of loss-augmented inference, which is required for learning an AP-SVM, prohibits its use with large training datasets. To alleviate this deficiency, we propose three complementary approaches. The first approach guarantees an asymptotic decrease in the computational complexity of loss-augmented inference by exploiting the problem structure. The second approach takes advantage of the fact that we do not require a full ranking during loss-augmented inference. This helps us to avoid the expensive step of sorting the negative samples according to their individual scores. The third approach approximates the AP loss over all samples by the AP loss over difficult samples (for example, those that are incorrectly classified by a binary SVM), while ensuring the correct classification of the remaining samples. Using the PASCAL VOC action classification and object detection datasets, we show that our approaches provide significant speed-ups during training without degrading the test accuracy of AP-SVM.
Large scale document image retrieval by automatic word annotation
K. Pramod Sankar,R. Manmatha,Jawahar C V
International Journal on Document Analysis and Recognition, IJDAR, 2014
@inproceedings{bib_Larg_2014, AUTHOR = {K. Pramod Sankar, R. Manmatha, Jawahar C V}, TITLE = {Large scale document image retrieval by automatic word annotation}, BOOKTITLE = {International Journal on Document Analysis and Recognition}. YEAR = {2014}}
In this paper, we present a practical and scalable retrieval framework for large-scale document image collections, for an Indian language script that does not have a robust OCR. OCR-based methods face difficulties in character segmentation and recognition, especially for the complex Indian language scripts. We realize that character recognition is only an intermediate step toward actually labeling words. Hence, we re-pose the problem as one of directly performing word annotation. This new approach has better recognition performance, as well as easier segmentation requirements. However, the number of classes in word annotation is much larger than those for character recognition, making such a classification scheme expensive to train and test. To address this issue, we present a novel framework that replaces naive classification with a carefully designed mixture of indexing and classification schemes. This enables us to build a search system over a large collection of 1,000 books of Telugu, consisting of 120K document images or 36M individual words. This is the largest searchable document image collection for a script without an OCR that we are aware of. Our retrieval system performs significantly well with a mean average precision of 0.8.
Towards a robust ocr system for indic scripts
PRAVEEN KRISHNAN A,S NAVEEN KUMAR,AJEET KUMAR SINGH,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2014
@inproceedings{bib_Towa_2014, AUTHOR = {PRAVEEN KRISHNAN A, S NAVEEN KUMAR, AJEET KUMAR SINGH, Jawahar C V}, TITLE = {Towards a robust ocr system for indic scripts}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2014}}
The current Optical Character Recognition OCR systems for Indic scripts are not robust enough for recognizing arbitrary collection of printed documents. Reasons for this limitation includes the lack of resources (e.g. not enough examples with natural variations, lack of documentation available about the possible font/style variations) and the architecture which necessitates hard segmentation of word images followed by an isolated symbol recognition. Variations among scripts, latent symbol to UNICODE conversion rules, non-standard fonts/styles and large degradations are some of the major reasons for the unavailability of robust solutions. In this paper, we propose a web based OCR system which (i) follows a unified architecture for seven Indian languages, (ii) is robust against popular degradations, (iii) follows a segmentation free approach, (iv) addresses the UNICODE re-ordering issues, and (v) can enable continuous learning with user inputs and feedbacks. Our system is designed to aid the continuous learning while being usable i.e., we capture the user inputs (say example images) for further improving the OCRs. We use the popular BLSTM based transcription scheme to achieve our target. This also enables incremental training and refinement in a seamless manner. We report superior accuracy rates in comparison with the available OCRs for the seven Indian languages.
Enhancing word image retrieval in presence of font variations
VIRESH RANJAN,Gaurav Harit,Jawahar C V
International conference on Pattern Recognition, ICPR, 2014
@inproceedings{bib_Enha_2014, AUTHOR = {VIRESH RANJAN, Gaurav Harit, Jawahar C V}, TITLE = {Enhancing word image retrieval in presence of font variations}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2014}}
This paper investigates the problem of cross document image retrieval, i.e. use of query images from one style (say font) to perform retrieval from a collection which is in a different style (say a different set of books). We present two approaches to tackle this problem. We propose an effective style independent retrieval scheme using a nonlinear style-content separation model. We also propose a semi-supervised style transfer strategy to expand the query into multiple styles. We validate both these approaches on a collection of word images which vary in fonts/styles
Efficient Evaluation of SVM Classifiers using Error Space Encoding
Nisarg Raval,RASHMI VILAS TONGE,Jawahar C V
International conference on Pattern Recognition, ICPR, 2014
@inproceedings{bib_Effi_2014, AUTHOR = {Nisarg Raval, RASHMI VILAS TONGE, Jawahar C V}, TITLE = {Efficient Evaluation of SVM Classifiers using Error Space Encoding}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2014}}
Many computer vision tasks require efficient evaluation of Support Vector Machine (SVM) classifiers on large image databases. Our goal is to efficiently evaluate SVM classifiers on a large number of images. We propose a novel Error Space Encoding (ESE) scheme for SVM evaluation which utilizes large number of classifiers already evaluated on the similar data set. We model this problem as an encoding of a novel classifier (query) in terms of the existing classifiers (query logs). With sufficiently large query logs, we show that ESE performs far better than any other existing encoding schemes. With this method we are able to retrieve nearly 100% correct top-k images from a dataset of 1 Million images spanning across 1000 categories. We also demonstrate application of our method in terms of relevance feedback and query expansion mechanism and show that our method achieves the same accuracy 90 times faster than exhaustive SVM evaluations.
Currency recognition on mobile phones
SURIYA SINGH,Shushman Choudhury,KUMAR VISHAL,Jawahar C V
International conference on Pattern Recognition, ICPR, 2014
@inproceedings{bib_Curr_2014, AUTHOR = {SURIYA SINGH, Shushman Choudhury, KUMAR VISHAL, Jawahar C V}, TITLE = {Currency recognition on mobile phones}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2014}}
In this paper, we present an application for recognizing currency bills using computer vision techniques, that can run on a low-end smartphone. The application runs on the device without the need for any remote server. It is intended for robust, practical use by the visually impaired. Though we use the paper bills of Indian National Rupee (|) as a working example, our method is generic and scalable to multiple domains including those beyond the currency bills. Our solution uses a visual Bag of Words (BoW) based method for recognition. To enable robust recognition in a cluttered environment, we first segment the bill from the background using an algorithm based on iterative graph cuts. We formulate the recognition problem as an instance retrieval task. This is an example of fine-grained instance retrieval that can run on mobile devices. We evaluate the performance on a set of images captured in diverse natural environments, and report an accuracy of 96.7% on 2584 images
Identifying ragas in indian music
Vijay Kumar.R,HARIT PANDYA,Jawahar C V
International conference on Pattern Recognition, ICPR, 2014
@inproceedings{bib_Iden_2014, AUTHOR = {Vijay Kumar.R, HARIT PANDYA, Jawahar C V}, TITLE = {Identifying ragas in indian music}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2014}}
In this work, we propose a method to identify the ragas of an Indian Carnatic music signal. This has several interesting applications in digital music indexing, recommendation and retrieval. However, this problem is hard due to (i) the absence of a fixed frequency for a note (ii) relative scale of notes (iii) oscillations around a note, and (iv) improvisations. In this work, we attempt the raga classification problem in a non-linear SVM framework using a combination of two kernels that represent the similarities of a music signal using two different featurespitch-class profile and n-gram distribution of notes. This differs from the previous pitch-class profile based approaches where the temporal information of notes is ignored. We evaluated the proposed approach on our own raga dataset and CompMusic dataset and show an improvement of 10.19% by combining the information from two features relevant to Indian Carnatic music.
Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval.
YASHASWI VERMA,Jawahar C V
British Machine Vision Conference, BMVC, 2014
@inproceedings{bib_Im2T_2014, AUTHOR = {YASHASWI VERMA, Jawahar C V}, TITLE = {Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval.}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2014}}
Building bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary cross-modal prediction tasks: (i) predicting text(s) given an image (“Im2Text”), and (ii) predicting image(s) given a piece of text (“Text2Im”). We make no assumption on the specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We pose both these tasks in a retrieval framework. For Im2Text, given a query image, our goal is to retrieve a ranked list of semantically relevant texts from an independent textcorpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query text, we aim to retrieve a ranked list of semantically relevant images from a collection of unannotated images (i.e., images without any associated textual meta-data). We propose a novel Structural SVM based unified formulation for these two tasks. For both visual and textual data, two types of representations are investigated. These are based on: (1) unimodal probability distributions over topics learned using latent Dirichlet allocation, and (2) explicitly learned multi-modal correlations using canonical correlation analysis. Extensive experiments on three popular datasets (two medium and one web-scale) demonstrate that our framework gives promising results compared to existing models under various settings, thus confirming its efficacy for both the tasks.
Learning to rank using high-order information
Puneet Kumar Dokani,ASEEM BEHL,Jawahar C V,M. Pawan Kumar
European Conference on Computer Vision, ECCV, 2014
@inproceedings{bib_Lear_2014, AUTHOR = {Puneet Kumar Dokani, ASEEM BEHL, Jawahar C V, M. Pawan Kumar}, TITLE = {Learning to rank using high-order information}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2014}}
The problem of ranking a set of visual samples according to their relevance to a query plays an important role in computer vision. The traditional approach for ranking is to train a binary classifier such as a support vector machine (svm). Binary classifiers suffer from two main deficiencies: (i) they do not optimize a ranking-based loss function, for example, the average precision (ap) loss; and (ii) they cannot incorporate high-order information such as the a priori correlation between the relevance of two visual samples (for example, two persons in the same image tend to perform the same action). We propose two novel learning formulations that allow us to incorporate high-order information for ranking. The first framework, called high-order binary svm (hob-svm), allows for a structured input. The parameters of hob-svm are learned by minimizing a convex upper bound on a surrogate 0-1 loss function. In order to obtain the ranking of the samples that form the structured input, hobsvm sorts the samples according to their max-marginals. The second framework, called high-order average precision svm (hoap-svm), also allows for a structured input and uses the same ranking criterion. However, in contrast to hob-svm, the parameters of hoap-svm are learned by minimizing a difference-of-convex upper bound on the ap loss. Using a standard, publicly available dataset for the challenging problem of action classification, we show that both hob-svm and hoap-svm outperform the baselines that ignore high-order information.
Learning partially shared dictionaries for domain adaptation
VIRESH RANJAN,Gaurav Harit,Jawahar C V
Asian Conference on Computer Vision, ACCV, 2014
@inproceedings{bib_Lear_2014, AUTHOR = {VIRESH RANJAN, Gaurav Harit, Jawahar C V}, TITLE = {Learning partially shared dictionaries for domain adaptation}, BOOKTITLE = {Asian Conference on Computer Vision}. YEAR = {2014}}
Real world applicability of many computer vision solutions is constrained by the mismatch between the training and test domains. This mismatch might arise because of factors such as change in pose, lighting conditions, quality of imaging devices, intra-class variations inherent in object categories etc. In this work, we present a dictionary learning based approach to tackle the problem of domain mismatch. In our approach, we jointly learn dictionaries for the source and the target domains. The dictionaries are partially shared, i.e. some elements are common across both the dictionaries. These shared elements can represent the information which is common across both the domains. The dictionaries also have some elements to represent the domain specific information. Using these dictionaries, we separate the domain specific information and the information which is common across the domains. We use the latter for training cross-domain classifiers i.e., we build classifiers that work well on a new target domain while using labeled examples only in the source domain. We conduct cross-domain object recognition experiments on popular benchmark datasets and show improvement in results over the existing state of art domain adaptation approaches.
Scene text recognition and retrieval for large lexicons
UDIT ROY,ANAND MISHRA,Karteek Alahari,Jawahar C V
Asian Conference on Computer Vision, ACCV, 2014
@inproceedings{bib_Scen_2014, AUTHOR = {UDIT ROY, ANAND MISHRA, Karteek Alahari, Jawahar C V}, TITLE = {Scene text recognition and retrieval for large lexicons}, BOOKTITLE = {Asian Conference on Computer Vision}. YEAR = {2014}}
In this paper we propose a framework for recognition and retrieval tasks in the context of scene text images. In contrast to many of the recent works, we focus on the case where an image-specific list of words, known as the small lexicon setting, is unavailable. We present a conditional random field model defined on potential character locations and the interactions between them. Observing that the interaction potentials computed in the large lexicon setting are less effective than in the case of a small lexicon, we propose an iterative method, which alternates between finding the most likely solution and refining the interaction potentials. We evaluate our method on public datasets and show that it improves over baseline and state-of-the-art approaches. For example, we obtain nearly 15% improvement in recognition accuracy and precision for our retrieval task over baseline methods on the IIIT-5K word dataset, with a large lexicon containing 0.5 million words.
Optimizing storage intensive vision applications to device capacity
ROHIT GIRDHAR,Jayaguru Panda,Jawahar C V
Asian Conference on Computer Vision, ACCV, 2014
@inproceedings{bib_Opti_2014, AUTHOR = {ROHIT GIRDHAR, Jayaguru Panda, Jawahar C V}, TITLE = {Optimizing storage intensive vision applications to device capacity}, BOOKTITLE = {Asian Conference on Computer Vision}. YEAR = {2014}}
Computer vision applications today run on a wide range of mobile devices. Even though these devices are becoming more ubiquitous and general purpose, we continue to see a whole spectrum of processing and storage capabilities within this class. Moreover, even as the processing and storage capacity of devices are increasing, the complexity of vision solutions and the variety of use cases create greater demands on these resources. This requires appropriate adaptation of the mobile vision applications with minimal changes in the algorithm or implementation. In this work, we focus on optimizing the memory usage for storage intensive vision applications. In this paper, we propose a framework to configure memory requirements of vision applications. We start from a gold standard desktop application, and reduce the the size for a given the memory constraint. We formulate the storage optimization problem as mixed integer programming (mip) based optimization to select the most relevant subset of data to be retained. For large data sets, we use a greedy approximate solution which is empirically comparable to the optimal mip solution. We demonstrate the method in two different use cases: (a) Instance retrieval task where an image of a query object is looked up for instant recognition/annotation, and (b) Augmented reality where computational requirement is minimized by rendering and storing precomputed views. In both the cases, we show that our method allows a reduction in storage by almost 5× with no significant performance loss.
Monocular vision based road marking recognition for driver assistance and safety
MOHAK KUMAR SUKHWANI,SURIYA SINGH,ANIRUDH GOYAL,ASEEM BEHL,PRITISH MOHAPATRA,Brijendra Kumar Bharti,Jawahar C V
International Conference on Vehicular Electronics and Safety, ICVES, 2014
@inproceedings{bib_Mono_2014, AUTHOR = {MOHAK KUMAR SUKHWANI, SURIYA SINGH, ANIRUDH GOYAL, ASEEM BEHL, PRITISH MOHAPATRA, Brijendra Kumar Bharti, Jawahar C V}, TITLE = {Monocular vision based road marking recognition for driver assistance and safety}, BOOKTITLE = {International Conference on Vehicular Electronics and Safety}. YEAR = {2014}}
In this paper, we present a solution to generate semantically richer descriptions and instructions for driver assistance and safety. Our solution builds upon a set of computer vision and machine learning modules. We start with low-level image processing and finally generate high-level descriptions. We do this by combining the results of the image pattern recognition module with the prior knowledge on traffic rules and larger context present in the video sequence. For recognition of road markings, we use a SVM based classifier and HOG based classifier. We test our method on real data captured in urban settings, and report impressive performance. Qualitative and quantitative performance of various modules are presented.
Estimating Floor Regions in Cluttered Indoor Scenes from First Person Camera View
SANCHIT AGGARWAL,Anoop Namboodiri,Jawahar C V
International conference on Pattern Recognition, ICPR, 2014
@inproceedings{bib_Esti_2014, AUTHOR = {SANCHIT AGGARWAL, Anoop Namboodiri, Jawahar C V}, TITLE = {Estimating Floor Regions in Cluttered Indoor Scenes from First Person Camera View}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2014}}
The ability to detect floor regions from an image enables a variety of applications such as indoor scene understanding, mobility assessment, robot navigation, path planning and surveillance. In this work, we propose a framework for estimating floor regions in cluttered indoor environments. The problem of floor detection and segmentation is challenging in situations where floor and non-floor regions have similar appearances. It is even harder to segment floor regions when clutter, specular reflections, shadows and textured floors are present within the scene. Our framework utilizes a generic classifier trained from appearance cues as well as floor density estimates, both trained from a variety of indoor images. The results of the classifier is then adapted to a specific test image where we integrate appearance, position and geometric cues in an iterative framework. A Markov Random Field framework is used to integrate the cues to segment floor regions. In contrast to previous settings that relied on optical flow, depth sensors or multiple images in a calibrated setup, our method can work on a single image. It is also more flexible as we avoid assumptions like Manhattan world scene or restricting clutter only to wall-floor boundaries. Experimental results on the public MIT Scene dataset as well as a more challenging dataset that we acquired, demonstrate the robustness and efficiency of our framework on the above mentioned complex situations.
Face recognition in videos by label propagation
N VIJAY KUMAR,Anoop Namboodiri,Jawahar C V
International conference on Pattern Recognition, ICPR, 2014
@inproceedings{bib_Face_2014, AUTHOR = {N VIJAY KUMAR, Anoop Namboodiri, Jawahar C V}, TITLE = {Face recognition in videos by label propagation}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2014}}
We consider the problem of automatic identification of faces in videos such as movies, given a dictionary of known faces from a public or an alternate database. This has applications in video indexing, content based search, surveillance, and real time recognition on wearable computers. We propose a two stage approach for this problem. First, we recognize the faces in a video using a sparse representation framework using l1-minimization and select a few key-frames based on a robust confidence measure.We then use transductive learning to propagate the labels from the key-frames to the remaining frames by incorporating constraints simultaneously in temporal and feature spaces. This is in contrast to some of the previous approaches where every test frame/track is identified independently, ignoring the correlation between the faces in video tracks. Having a few key frames belonging to few subjects for label propagation rather than a large dictionary of actors reduces the amount of confusion. We evaluate the performance of our algorithm on Movie Trailer face dataset and five movie clips, and achieve a significant improvement in labeling accuracy compared to previous approaches
Providing Services on Demand By User Action Modeling on Smart Phones
KUMAR VISHAL,ROMIL BANSAL,Anoop Namboodiri,Jawahar C V
international joint conference on pervasive and ubiquitous computing, Ubicomp, 2014
@inproceedings{bib_Prov_2014, AUTHOR = {KUMAR VISHAL, ROMIL BANSAL, Anoop Namboodiri, Jawahar C V}, TITLE = {Providing Services on Demand By User Action Modeling on Smart Phones}, BOOKTITLE = {international joint conference on pervasive and ubiquitous computing}. YEAR = {2014}}
Reactionless visual servoing of a dual-arm space robo
A. H. Abdul Hafez,V V ANURAG,Suril v shah,K Madhava Krishna,Jawahar C V
International Conference on Robotics and Automation, ICRA, 2014
@inproceedings{bib_Reac_2014, AUTHOR = {A. H. Abdul Hafez, V V ANURAG, Suril V Shah, K Madhava Krishna, Jawahar C V}, TITLE = {Reactionless visual servoing of a dual-arm space robo}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2014}}
This paper presents a novel visual servoing controller for a satellite mounted dual-arm space robot. The controller is designed to complete the task of servoing the robot's endeffectors to the desired pose, while regulating orientation of the base-satellite. Task redundancy approach is utilized to coordinate the servoing process and attitude of the base satellite. The visual task is defined as a primary task, while regulating attitude of the base satellite to zero is defined as a secondary task. The secondary task is formulated as an optimization problem in such a way that it does not affect the primary task, and simultaneously minimizes its cost function. A set of numerical experiments are carried out on a dual-arm space robot showing efficacy of the proposed control methodology.
Indian Movie Face Database: A Benchmark for Face Recognition Under Wide Variations
Shankar Setty,Moula Husain,Parisa Beham,Jyothi Gudavalli,Menaka Kandasamy,Radhesyam Vaddi,Vidyagouri Hemadri,J C Karure,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2013
@inproceedings{bib_Indi_2013, AUTHOR = {Shankar Setty, Moula Husain, Parisa Beham, Jyothi Gudavalli, Menaka Kandasamy, Radhesyam Vaddi, Vidyagouri Hemadri, J C Karure, Jawahar C V}, TITLE = {Indian Movie Face Database: A Benchmark for Face Recognition Under Wide Variations}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2013}}
Recognizing human faces in the wild is emerging as a critically important, and technically challenging computer vision problem. With a few notable exceptions, most previous works in the last several decades have focused on recognizing faces captured in a laboratory setting. However, with the intro- duction of databases such as LFW and Pubfigs, face recogni- tion community is gradually shifting its focus on much more challenging unconstrained settings. Since its introduction, LFW verification benchmark is getting a lot of attention with various researchers contributing towards state-of-the-results. To further boost the unconstrained face recognition research, we introduce a more challenging Indian Movie Face Database (IMFDB) that has much more variability compared to LFW and Pubfigs. The database consists of 34512 faces of 100 known actors collected from approximately 103 Indian movies. Unlike LFW and Pubfigs which used face detectors to automatically detect the faces from the web collection, faces in IMFDB are detected manually from all the movies. Manual selection of faces from movies resulted in high degree of variability (in scale, pose, expression, illumination, age, occlusion, makeup) which one could ever see in natural world. IMFDB is the first face database that provides a detailed annotation in terms of age, pose, gender, expression, amount of occlusion, for each face which may help other face related
Stable hybrid visual servo control by a weighted combination of image-based and position-based algorithms
A. H. Abdul Hafez,Enric Cervera,Jawahar C V
International Journal of Control and Automation, IJCA, 2013
@inproceedings{bib_Stab_2013, AUTHOR = {A. H. Abdul Hafez, Enric Cervera, Jawahar C V}, TITLE = {Stable hybrid visual servo control by a weighted combination of image-based and position-based algorithms}, BOOKTITLE = {International Journal of Control and Automation}. YEAR = {2013}}
In this paper, we present a novel stable hybrid vision-based robot control algorithm. This method utilizes probabilistic integration of image-based and position-based visual servo con-trollers to produce an improved vision-based robot control. A probabilistic framework is employed to derive the integration scheme. Appropriate probabilistic importance functionsare de ned for the basic two algorithms to characterize their suitability in the task. The integrated algorithm has superior performance both in image and Cartesian spaces. Exper-iments validate this claim.
Learning multiple non-linear sub-spaces using k-rbms
SIDDHARTHA CHANDRA,Shailesh Kumar,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2013
@inproceedings{bib_Lear_2013, AUTHOR = {SIDDHARTHA CHANDRA, Shailesh Kumar, Jawahar C V}, TITLE = {Learning multiple non-linear sub-spaces using k-rbms}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2013}}
Understanding the nature of data is the key to build-ing good representations. In domains such as natural im-ages, the data comes from very complex distributions which are hard to capture. Feature learning intends to discover or best approximate these underlying distributions and use their knowledge to weed out irrelevant information, pre-serving most of the relevant information. Feature learning can thus be seen as a form of dimensionality reduction. In this paper, we describe a feature learning scheme for nat-ural images. We hypothesize that image patches do not all come from the same distribution, they lie in multiple non-linear subspaces. We propose a framework that uses KRestricted Boltzmann Machines (K-RBMS) to learn mul-tiple non-linear subspaces in the raw image space. Pro-jections of the image patches into these subspaces gives us features, which we use to build image representations.Our algorithm solves the coupled problem of finding the right non-linear subspaces in the input space and associ-ating image patches with those subspaces in an iterative EM like algorithm to minimize the overall reconstruction error. Extensive empirical results over several popular im-age classification datasets show that representations based on our framework outperform the traditional feature repre-sentations such as the SIFT based Bag-of-Words (BoW) and convolutional deep belief networks.
Blocks that shout: Distinctive parts for scene classification
MAYANK JUNEJA,Andrea Vedaldi,Jawahar C V,Andrew Zisserman
Computer Vision and Pattern Recognition, CVPR, 2013
@inproceedings{bib_Bloc_2013, AUTHOR = {MAYANK JUNEJA, Andrea Vedaldi, Jawahar C V, Andrew Zisserman}, TITLE = {Blocks that shout: Distinctive parts for scene classification}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2013}}
The automatic discovery of distinctive parts for an ob-ject or scene class is challenging since it requires simulta-neously to learn the part appearance and also to identify the part occurrences in images. In this paper, we propose a simple, efficient, and effective method to do so. We ad-dress this problem by learning parts incrementally, starting from a single part occurrence with an Exemplar SVM. In this manner, additional part instances are discovered and aligned reliably before being considered as training exam-ples. We also propose entropy-rank curves as a means of evaluating the distinctiveness of parts shareable between categories and use them to select useful parts out of a set of candidates.We apply the new representation to the task of scene cat-egorisation on the MIT Scene 67 benchmark. We show that our method can learn parts which are significantly more in-formative and for a fraction of the cost, compared to previ-ous part-learning methods such as Singhet al. [28]. We also show that a well constructed bag of words or Fisher vector model can substantially outperform the previous state-of-the-art classification performance on this data.
Generating image descriptions using semantic similarities in the output space
YASHASWI VERMA,ANKUSH GUPTA,PRASHANTH REDDY MANNEM,Jawahar C V
Computer Vision and Pattern Recognition Conference workshops, CVPR-W, 2013
@inproceedings{bib_Gene_2013, AUTHOR = {YASHASWI VERMA, ANKUSH GUPTA, PRASHANTH REDDY MANNEM, Jawahar C V}, TITLE = {Generating image descriptions using semantic similarities in the output space}, BOOKTITLE = {Computer Vision and Pattern Recognition Conference workshops}. YEAR = {2013}}
Automatically generating meaningful descriptions for images has recently emerged as an important area of re-search. In this direction, a nearest-neighbour based gener-ative phrase prediction model (PPM) proposed by (Guptaet al. 2012) was shown to achieve state-of-the-art results on PASCAL sentence dataset, thanks to the simultaneous use of three different sources of information (i.e. visual clues,corpus statistics and available descriptions). However, they do not utilize semantic similarities among the phrases that might be helpful in relating semantically similar phrases during phrase relevance prediction. In this paper, we ex-tend their model by considering inter-phrase semantic sim-ilarities. To compute similarity between two phrases, we consider similarities among their constituent words deter-mined using Word Net. We also re-formulate their objective function for parameter learning by penalizing each pair of phrases unevenly, in a manner similar to that in structured predictions. Various automatic and human evaluations are performed to demonstrate the advantage of our “semantic phrase prediction model” (SPPM) over PPM.
Efficient Category Mining by Leveraging Instance Retrieval
ABHINAV GOEL,MAYANK JUNEJA,Jawahar C V
Computer Vision and Pattern Recognition Conference workshops, CVPR-W, 2013
@inproceedings{bib_Effi_2013, AUTHOR = {ABHINAV GOEL, MAYANK JUNEJA, Jawahar C V}, TITLE = {Efficient Category Mining by Leveraging Instance Retrieval}, BOOKTITLE = {Computer Vision and Pattern Recognition Conference workshops}. YEAR = {2013}}
We focus on the problem of mining object categories from large datasets like Google Street View images. Mining ob-ject categories in these unannotated datasets is an impor-tant and useful step to extract meaningful information. Of-ten the location and spatial extent of an object in an im-age is unknown. Mining objects in such a setting is hard.Recent methods model this problem as learning a separate classifier for each category. This is computationally expen-sive since a large number of classifiers are required to be trained and evaluated, before one can mine a concise set of meaningful objects. On the other hand, fast and efficient solutions have been proposed for the retrieval of instances(same object) from large databases. We borrow, from in-stance retrieval pipeline, its strengths and adapt it to speedup category mining. For this, we explore objects which are“near-instances”.We mine several near-instance object categories from images. Using an instance retrieval based solution, we are able to mine certain categories of near-instance objects much faster than an Exemplar SVM based solution.
Decomposing bag of words histograms
ANKIT GANDHI,Karteek Alahari,Jawahar C V
International Conference on Computer Vision, ICCV, 2013
@inproceedings{bib_Deco_2013, AUTHOR = {ANKIT GANDHI, Karteek Alahari, Jawahar C V}, TITLE = {Decomposing bag of words histograms}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2013}}
We aim to decompose a global histogram representation of an image into histograms of its associated objects and re-gions. This task is formulated as an optimization problem,given a set of linear classifiers, which can effectively dis-criminate the object categories present in the image. Our decomposition bypasses harder problems associated with accurately localizing and segmenting objects. We evaluate our method on a wide variety of composite histograms, and also compare it with MRF-based solutions. In addition to merely mea suring the accuracy of decomposition, we also show the utility of the estimated object and background his-tograms for the task of image classification on thePASCALVOC2007dataset.
Image retrieval using textual cues
ANAND MISHRA,Karteek Alahari,Jawahar C V
International Conference on Computer Vision, ICCV, 2013
@inproceedings{bib_Imag_2013, AUTHOR = {ANAND MISHRA, Karteek Alahari, Jawahar C V}, TITLE = {Image retrieval using textual cues}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2013}}
We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, de-spite being based on state-of-the-art methods, is insufficient,and propose a method, where we do not rely on an exact lo-calization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial con-straints to generate a ranked list of images in the database.The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.
Offline mobile instance retrieval with a small memory footprint
Jayaguru Panda,Michael S. Brown,Jawahar C V
International Conference on Computer Vision, ICCV, 2013
@inproceedings{bib_Offl_2013, AUTHOR = {Jayaguru Panda, Michael S. Brown, Jawahar C V}, TITLE = {Offline mobile instance retrieval with a small memory footprint}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2013}}
Existing mobile image instance retrieval applications as-sume a network-based usage where image features are sent to a server to query an online visual database. In this scenario, there are no restrictions on the size of the vi-sual database. This paper, however, examines how to per-form this same task offline, where the entire visual index must reside on the mobile device itself within a small mem-ory footprint. Such solutions have applications on location recognition and product recognition. Mobile instance re-trieval requires a significant reduction in the visual index size. To achieve this, we describe a set of strategies that can reduce the visual index up to 60-80×compared to a standard instance retrieval implementation found on desk-tops or servers. While our proposed reduction steps affect the overall mean Average Precision (mAP), they are ableto maintain a good Precision for the top Kresults (PK).We argue that for such offline application, maintaining agood PKis sufficient. The effectiveness of this approach is demonstrated on several standard databases. A working application designed for a remote historical site is also pre-sented. This application is able to reduce an 50,000 image index structure to 25 MBs while providing a precision of97%forP10and 100%forP1
Error detection in highly inflectional languages
NAVEEN T S,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2013
@inproceedings{bib_Erro_2013, AUTHOR = {NAVEEN T S, Jawahar C V}, TITLE = {Error detection in highly inflectional languages}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2013}}
Error detection in OCR output using dictionaries and statistical language models (SLMs) have become common practice for some time now, while designing post-processors. Mul-tiple strategies have been used successfully in English to achieve this. However, this has not yet translated towards improving error detection performance in many inflectional languages, specially Indian languages. Challenges such as large unique word list, lack of linguistic resources, lack of reliable language models, etc. are some of the reasons for this. In this paper, we investigate the ma-jor challenges in developing error detection techniques for highly inflectional Indian languages. We compare and contrast several attributes of English with inflectional languages such as Telugu and Malayalam. We make observations by analyzing statistics computed from popular corpora and relate these observations to the error detection schemes. We propose a method which can detect errors for Telugu and Malayalam, with an F-Score comparable to some of the less inflectional languages like Hindi.Our method learns from the error patterns and SLMs.
Character n-gram spotting on handwritten documents using weakly-supervised segmentation
UDIT ROY,NAVEEN T S,Pramod Sankar K,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2013
@inproceedings{bib_Char_2013, AUTHOR = {UDIT ROY, NAVEEN T S, Pramod Sankar K, Jawahar C V}, TITLE = {Character n-gram spotting on handwritten documents using weakly-supervised segmentation}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2013}}
n this paper, we present a solution towards building a retrieval system over handwritten document images that i) is recognition-free, ii) allows text-querying, iii) can retrieve at sub-word level, iv) can search for out-of-vocabulary words. Unlike previous approaches that operate at either character or word lev-els, we use character n-gram images (CNG-img) as the retrieval primitive.CNG-img are sequences of character segments, that are represented and matched in the image-space. The word-images are now treated as a bag-of-CNG-img, that can be indexed and matched in the feature space. This allows for recognition-free search (query-by-example), which can retrieve morphologically similar words that have matching sub-words. Further, to enable query-by-keyword, we build an automated scheme to generate labeled exemplars for characters and character n-grams, fromun constrained handwritten documents. We pose this problem as one of weakly-supervised learning, where character/n-gram labeling is obtained automatically from the word labels. There sulting retrieval system can answer queries from an unlimited vocabulary. The approach is demonstrated on the George Wash-ington collection, results show major improvement in retrieval performance as compared to word-recognition and word-spotting methods
Devanagari text recognition: A transcription based formulation
NAVEEN T S,Aman Neelappa,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2013
@inproceedings{bib_Deva_2013, AUTHOR = {NAVEEN T S, Aman Neelappa, Jawahar C V}, TITLE = {Devanagari text recognition: A transcription based formulation}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2013}}
Optical Character Recognition (OCR) problems areoften formulated as isolated character (symbol) classification task followed by a post-classification stage (which contains modules like Unicode generation, error correction etc. ) to generate the textual representation, for most of the Indian scripts. Such approaches are prone to failures due to (i) difficulties in designing reliable word-to-symbol segmentation module that can robustly work in presence of degraded (cut/fused) images and (ii) convert-ing the outputs of the classifiers to a valid sequence of Unicodes.In this paper, we propose a formulation, where the expectations on these two modules is minimized, and the harder recognition task is modelled as learning of an appropriate sequence to sequence translation scheme. We thus formulate the recognition as a direct transcription problem. Given many examples of feature sequences and their corresponding Unicode representations, our objective is to learn a mapping which can convert a word directly in to a Unicode sequence. This formulation has multiple practical advantages: (i) This reduces the number of classes significantly for the Indian scripts. (ii) It removes the need for a reliable word-to-symbol segmentation. (ii) It does not require strong annotation of symbols to design the classifiers, and (iii) It directly generates a valid sequence of Unicodes. We test our method on more than 6000 pages of printed Devanagari documents from multiple sources. Our method consistently outperforms other state of the art implementations.
Document specific sparse coding for word retrieval
RAVI SHANKAR BADRY,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2013
@inproceedings{bib_Docu_2013, AUTHOR = {RAVI SHANKAR BADRY, Jawahar C V}, TITLE = {Document specific sparse coding for word retrieval}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2013}}
Bag of words (BoW) based retrieval is an efficient method to compare the visual similarity between two images.Recognition free methods based on BoW have shown to out-perform OCR based methods. We further improve the perfor-mance by defining a document specific sparse coding scheme for representing visual words (interest points) in document images.Our method is motivated by the successful use of sparsity insignal representation by exploiting the neighbourhood properties.In addition to providing insights into the design of the coding scheme, we also verify the method on two data sets and compare with the recent methods. We have also developed text query based search solution, and we report performance comparable to image based search
Whole is greater than sum of parts: Recognizing scene text words
Vibhor Goel,ANAND MISHRA,Karteek Alahari,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2013
@inproceedings{bib_Whol_2013, AUTHOR = {Vibhor Goel, ANAND MISHRA, Karteek Alahari, Jawahar C V}, TITLE = {Whole is greater than sum of parts: Recognizing scene text words}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2013}}
Recognizing text in images taken in the wild is achallenging problem that has received great attention in recentyears. Previous methods addressed this problem by first detectingindividual characters, and then forming them into words. Suchapproaches often suffer from weak character detections, due tolarge intra-class variations, even more so than characters fromscanned documents. We take a different view of the problemand present a holistic word recognition framework. In this,we first represent the scene text image and synthetic imagesgenerated from lexicon words using gradient-based features. Wethen recognize the text in the image by matching the scene andsynthetic image features with our novel weighted Dynamic TimeWarping (wDTW) approach.We perform experimental analysis on challenging publicdatasets, such as Street View Text and ICDAR 2003. Ourproposed method significantly outperforms our earlier work inMishraet al.(CVPR 2012), as well as many other recent works,such as Novikovaet al.(ECCV 2012), Wanget al.(ICPR 2012),Wanget al.(ICCV 2011)
Bringing semantics in word image retrieval
PRAVEEN KRISHNAN,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2013
@inproceedings{bib_Brin_2013, AUTHOR = {PRAVEEN KRISHNAN, Jawahar C V}, TITLE = {Bringing semantics in word image retrieval}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2013}}
Performance of the recognition free approaches fordocument retrieval, heavily depends on the exact or approximatematching of images (in some feature space) to retrieve documentscontaining the same word. However, the harder problem in infor-mation retrieval is to effectively bring semantics into the retrievalpipeline. This is further challenging when the matching is basedon visual features. In this work, we investigate this problem,and suggest a solution by directly transferring the semanticsfrom the textual domain. Our retrieval framework uses (i) thelanguage resources like WordNet and (ii) an annotated corpus ofdocument images, to retrieve semantically relevant words from alarge word image database. We demonstrate the method on twolanguages — English and Hindi, and quantitatively evaluate theperformance on annotated word image databases of more thana Million images
Detection of cut-and-paste in document images
ANKIT GANDHI,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2013
@inproceedings{bib_Dete_2013, AUTHOR = {ANKIT GANDHI, Jawahar C V}, TITLE = {Detection of cut-and-paste in document images}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2013}}
Many documents are created by Cut-And-Paste(CAP) of existing documents. In this paper, we proposed a noveltechnique to detect CAP in document images. This can helpin detecting unethical CAP in document image collections. Oursolution is recognition free, and scalable to large collection ofdocuments. Our formulation is also independent of the imagingprocess (camera based or scanner based) and does not use anylanguage specific information for matching across documents. Wemodel the solution as finding a mixture of homographies, anddesign a linear programming (LP) based solution to compute thesame. Our method is presently limited by the fact that we do notsupport detection of CAP in documents formed by editing of thetextual content.Our experiments demonstrate that without loss of generality(i.e. without assuming the number of source documents), we cancorrectly detect and match the CAP content in a questioneddocument image by simultaneously comparing with large numberof images in the database. We achieve the CAP detection accuracyof as high as 90%, even when the spatial extent of the CAP contentin a document image is as small as 15% of the entire image area.
Parsing Clothes in Unrestricted Images.
NATARAJ J,MINOCHA AYUSH ARUN,DIGVIJAY SINGH,Jawahar C V
British Machine Vision Conference, BMVC, 2013
@inproceedings{bib_Pars_2013, AUTHOR = {NATARAJ J, MINOCHA AYUSH ARUN, DIGVIJAY SINGH, Jawahar C V}, TITLE = {Parsing Clothes in Unrestricted Images.}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2013}}
Parsing for clothes in images and videos is a critical step towards understanding the human appearance. In this work, we propose a method to segment clothes in settings where there is no restriction on number and type of clothes, pose of the person, viewing angle, occlusion and number of people. This is a challenging task as clothes, even of the same category, have large variations in color and texture. The presence of human joints is the best indicator for cloth types as most of the clothes are consistently worn around the joints. We incorporate the human joint prior by estimating the body joint distributions using the detectors and learning the cloth-joint co-occurrences of different cloth types with respect to body joints. The cloth-joint and cloth-cloth co-occurrences are used as a part of the conditional random field framework to segment the image into different clothing. Our results indicate that we have outperformed the recent attempt [16]on H3D [3], a fairly complex dataset.
Exploring SVM for Image Annotation in Presence of Confusing Labels.
YASHASWI VERMA,Jawahar C V
British Machine Vision Conference, BMVC, 2013
@inproceedings{bib_Expl_2013, AUTHOR = {YASHASWI VERMA, Jawahar C V}, TITLE = {Exploring SVM for Image Annotation in Presence of Confusing Labels.}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2013}}
We address the problem of automatic image annotation in large vocabulary datasets.In such datasets, for a given label, there could be several other labels that act as its confusing labels. Three possible factors for this are (i) incomplete-labeling (“cars”vs.“vehicle”), (ii) label-ambiguity (“flowers”vs.“blooms”), and (iii) structural-overlap(“lion”vs.“tiger”). While previous studies in this domain have mostly focused on nearest-neighbour based models, we show that even the conventional one-vs-rest SVM significantly outperforms several benchmark models. We also demonstrate that with a simple modification in the hinge-loss of SVM, it is possible to significantly improve its performance. In particular, we introduce a tolerance-parameter in the hinge-loss. Thismakes the new model more tolerant against the errors in the classification of samples tagged with confusing labels as compared to other samples. This tolerance parameter is automatically determined using visual similarity and dataset statistics. Experimental evaluations demonstrate that our method (referred to as SVM with Variable Tolerance or SVM-VT) shows promising results on the task of image annotation on three challenging datasets, and establishes a baseline for such models in this domain.
Learning support order for manipulation in clutter
SWAGATIKA PANDA,A.H. Abdul Hazel,Jawahar C V
International Conference on Intelligent Robots and Systems, IROS, 2013
@inproceedings{bib_Lear_2013, AUTHOR = {SWAGATIKA PANDA, A.H. Abdul Hazel, Jawahar C V}, TITLE = {Learning support order for manipulation in clutter}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2013}}
Understanding positional semantics of the environment plays an important role in manipulating an object in clutter. The interaction with surrounding objects in the environment must be considered in order to perform the task without causing the objects fall or get damaged. In this paper, we learn the semantics in terms of support relationship among different objects in a cluttered environment by utilizing various photometric and geometric properties of the scene. To manipulate an object of interest, we use the inferred support relationship to derive a sequence in which its surrounding objects should be removed while causing minimal damage to the environment. We believe, this work can push the boundary of robotic applications in grasping, object manipulation and picking-from-bin, towards objects of generic shape and size and scenarios with physical contact and overlap. We have created an RGBD dataset that consists of various objects used in day-to-day life present in clutter. We explore many different settings involving different kind of object-object interaction. We successfully learn support relationships and predict support order in these settings.
Compacting Large and Loose Communities
V. CHANDRASHEKAR,Shailesh Kumar,Jawahar C V
Asian Conference on Pattern Recognition, ACPR, 2013
@inproceedings{bib_Comp_2013, AUTHOR = {V. CHANDRASHEKAR, Shailesh Kumar, Jawahar C V}, TITLE = {Compacting Large and Loose Communities}, BOOKTITLE = {Asian Conference on Pattern Recognition}. YEAR = {2013}}
Detecting compact overlapping communities in large networks is an important pattern recognition proble mwith applications in many domains. Most community de-tection algorithms trade-off between community sizes, their compactness and the scalability of finding communities. Clique Percolation Method (CPM) [1] and Local Fitness Maximization(LFM) [2] are two prominent and commonly used overlapping community detection methods that scale with large networks.However, significant number of communities found by the mare large, noisy, and loose. In this paper, we propose a general algorithm that takes such large and loose communities generated by any method and refines them into compact com-munities in a systematic fashion. We define a new measure of community-ness based on eigenvector centrality, identify loose communities using this measure and propose an algorithm for partitioning such loose communities into compact communities.We refine the communities found by CPM and LFM using our method and show their effectiveness compared to the original communities in a recommendation engine task.
Efficient and rich annotations for large photo collections
Jayaguru Panda,Jawahar C V
Asian Conference on Pattern Recognition, ACPR, 2013
@inproceedings{bib_Effi_2013, AUTHOR = {Jayaguru Panda, Jawahar C V}, TITLE = {Efficient and rich annotations for large photo collections}, BOOKTITLE = {Asian Conference on Pattern Recognition}. YEAR = {2013}}
Large unstructured photo collections from Inter-net usually have distinguishable keyword tagging associated with the images. Photos from tourist and heritage sites can be described with detailed and part-wise annotations resulting in an improved automatic search and enhanced photo browsing experience. Manually annotating a large community photo collection is a costly and redundant process as similar images share the same annotations. We demonstrate an interactive web-based annotation tool that allows multiple users to add,view, edit and suggest rich annotations for images in com-munity photo collections. Since, distinct annotations could befew, we have an easy and efficient batch annotation approach using an image similarity graph, pre-computed with instance retrieval and matching. This helps in seamlessly propagating annotations of the same objects or similar images across the entire dataset. We use a database of20Kimages (Heritage-20K) taken from a world-famous heritage site to demonstrate and evaluate our annotation approach.
Image Annotation in Presense of Noisy Lables
V. CHANDRASHEKAR,Shailesh Kumar,Jawahar C V
Conference on Pattern Recognition and Machine Intelligence, PReMI, 2013
@inproceedings{bib_Imag_2013, AUTHOR = {V. CHANDRASHEKAR, Shailesh Kumar, Jawahar C V}, TITLE = {Image Annotation in Presense of Noisy Lables}, BOOKTITLE = {Conference on Pattern Recognition and Machine Intelligence}. YEAR = {2013}}
Labels associated with social images are valuable source of information for tasks of image annotation, understanding and retrieval.These labels are often found to be noisy, mainly due to the collabo-rative tagging activities of users. Existing methods on annotation havebeen developed and verified on noise free labels of images. In this paper,we propose a novel and generic framework that exploits the collective knowledge embedded in noisy label co-occurrence pairs to derive robust annotations. We compare our method with a well-known image annota-tion algorithm and show its superiority in terms of annotation accuracy on benchmark Corel5K and ESP datasets in presence of noisy labels.
Learning semantic interaction among graspable objects
SWAGATIKA PANDA,A.H.Abdul Hafez,Jawahar C V
Conference on Pattern Recognition and Machine Intelligence, PReMI, 2013
@inproceedings{bib_Lear_2013, AUTHOR = {SWAGATIKA PANDA, A.H.Abdul Hafez, Jawahar C V}, TITLE = {Learning semantic interaction among graspable objects}, BOOKTITLE = {Conference on Pattern Recognition and Machine Intelligence}. YEAR = {2013}}
In this work, we aim at understanding semantic interaction among graspable objects in both direct and indirect physical contact for robotic manipulation tasks. Given an object of interest, its support relationship with other graspable objects is inferred hierarchically. The support relationship is used to predict the “support order” or the order in which the surrounding objects need to be removed in order to manipulate the target object. We believe, this can extend the scope of robotic manipulation tasks to typical clutter involving physical contact, overlap and objects of generic shapes and sizes. We have created an RGBD dataset consisting of various objects present in clutter using Kinect. We conducted our experimentation and analysed the performance of our work on the images from the same dataset.
Semi-Supervised Clustering by Selecting Informative Constraints
VIDYADHAR RAO PANGA,Jawahar C V
Conference on Pattern Recognition and Machine Intelligence, PReMI, 2013
@inproceedings{bib_Semi_2013, AUTHOR = {VIDYADHAR RAO PANGA, Jawahar C V}, TITLE = {Semi-Supervised Clustering by Selecting Informative Constraints}, BOOKTITLE = {Conference on Pattern Recognition and Machine Intelligence}. YEAR = {2013}}
Traditional clustering algorithms use a predefined metric and no supervision in identifying the partition. Existing semi-supervised clustering approaches either learn a metric from randomly chosen constraints or actively select informative constraints using a generic distance measure like Euclidean norm. We tackle the problem of identifying constraints that are informative to learn appropriate metric for semi-supervised clustering. We propose an approach to simultaneously find out appropriate constraints and learn a metric to boost the clustering performance. We evaluate clustering quality of our approach using the learned metric on the MNIST handwritten digits, Caltech-256 and MSRC2 object image datasets. Our results on these datasets have significant improvements over the baseline methods like MPCK-MEANS.
Image annotation in presence of noisy labels
V. CHANDRASHEKAR, Shailesh Kumar,Jawahar C V
Conference on Pattern Recognition and Machine Intelligence, PReMI, 2013
@inproceedings{bib_Imag_2013, AUTHOR = {V. CHANDRASHEKAR, Shailesh Kumar, Jawahar C V}, TITLE = {Image annotation in presence of noisy labels}, BOOKTITLE = {Conference on Pattern Recognition and Machine Intelligence}. YEAR = {2013}}
Labels associated with social images are valuable source of information for tasks of image annotation, understanding and retrieval. These labels are often found to be noisy, mainly due to the collaborative tagging activities of users. Existing methods on annotation have been developed and verified on noise free labels of images. In this paper, we propose a novel and generic framework that exploits the collective knowledge embedded in noisy label co-occurrence pairs to derive robust annotations. We compare our method with a well-known image annotation algorithm and show its superiority in terms of annotation accuracy on benchmark Corel5K and ESP datasets in presence of noisy labels.
Near real-time face parsing
MINOCHA AYUSH ARUN,DIGVIJAY SINGH,NATARAJ. J,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2013
@inproceedings{bib_Near_2013, AUTHOR = {MINOCHA AYUSH ARUN, DIGVIJAY SINGH, NATARAJ. J, Jawahar C V}, TITLE = {Near real-time face parsing}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2013}}
Commercial applications like driver assistance programs in cars, smile detection softwares in cameras typically require reliable facial landmark points like the location of eyes, lips etc. and face pose at near real-time. Current methods are often unreliable, very cumbersome or computationally intensive. In this work, we focus on implementing a reliable and real-time method which parses an image and detects faces, estimates their pose and locates landmark points on the face. Our method builds on the existing literature. The method can work both for images and videos.
Sparse document image coding for restoration
Vijay Kumar,AMIT KUMAR BANSAL,GOUTAM HARI TULSIYAN,ANAND MISHRA,Anoop Namboodiri,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2013
@inproceedings{bib_Spar_2013, AUTHOR = {Vijay Kumar, AMIT KUMAR BANSAL, GOUTAM HARI TULSIYAN, ANAND MISHRA, Anoop Namboodiri, Jawahar C V}, TITLE = {Sparse document image coding for restoration}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2013}}
Sparse representation based image restoration tech-niques have shown to be successful in solving various inverse problems such as denoising, in painting, and super-resolution,etc. on natural images and videos. In this paper, we explore the use of sparse representation based methods specifically to restore the degraded document images. While natural images form a very small subset of all possible images admitting the possibility of sparse representation, document images are significantly more restricted and are expected to be ideally suited for such a representation. However, the binary nature of textual document images makes dictionary learning and coding techniques unsuitable to be applied directly. We leverage the fact that different characters possess similar strokes, curves, and edges, and learn a dictionary that gives sparse decomposition for patches. Experimental results show significant improvement in image quality and OCR performance on documents collected from a variety of sources such as magazines and books. This method is therefore, ideally suited for restoring highly degraded images in repositories such as digital libraries.
Sparse representation based face recognition with limited labeled samples
Vijay Kumar,Anoop Namboodiri,Jawahar C V
Asian Conference on Pattern Recognition, ACPR, 2013
@inproceedings{bib_Spar_2013, AUTHOR = {Vijay Kumar, Anoop Namboodiri, Jawahar C V}, TITLE = {Sparse representation based face recognition with limited labeled samples}, BOOKTITLE = {Asian Conference on Pattern Recognition}. YEAR = {2013}}
Sparse representations have emerged as a powerful approach for encoding images in a large class of machine recognition problems including face recognition. These methods rely on the use of an over-complete basis set for representing an image. This often assumes the availability of a large number of labeled training images, especially for high dimensional data. In many practical problems, the number of labeled training samples are very limited leading to significant degradations in classification performance. To address the problem of lack of training samples, we propose a semi-supervised algorithm that labels the unlabeled samples through a multi-stage label propagation combined with sparse representation. In this representation, each image is decomposed as a linear combination of its nearest basis images, which has the advantage of both locality and sparsity.Extensive experiments on publicly available face databases show that the results are significantly better compared to state-of-the-art face recognition methods in semi-supervised setting and are on par with fully supervised techniques
Multibody vslam with relative scale solution for curvilinear motion reconstruction
RAHUL KUMAR NAMDEV,K Madhava Krishna,Jawahar C V
International Conference on Robotics and Automation, ICRA, 2013
@inproceedings{bib_Mult_2013, AUTHOR = {RAHUL KUMAR NAMDEV, K Madhava Krishna, Jawahar C V}, TITLE = {Multibody vslam with relative scale solution for curvilinear motion reconstruction}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2013}}
A solution to the relative scale problem where reconstructed moving objects and the stationary world are represented in a unified common scale has proven equivalent to a conjecture. Motion reconstruction from a moving monocular camera is considered ill posed due to known problems of observability. We show for the first time several significant motion reconstruction of outdoor vehicles moving along non-holonomic curves and straight lines. The reconstructed motion is represented in the unified frame which also depicts the estimated camera trajectory and the reconstructed stationary world. This is possible due to our Multibody VSLAM framework with a novel solution for relative scale proposed in the current paper. Two solutions that compute the relative scale are proposed. The solutions provide for a unified representation within four views of reconstruction of the moving object and are thus immediate. In one, the solution for the scale is that which satisfies the planarity constraint of the object motion. The assumption of planar object motion while being generic enough is subject to stringent degenerate situations that are more widespread. To circumvent such degeneracies we assume that the object motion to be locally circular or linear and find the relative scale solution for such object motions. Precise reconstruction is achieved in synthetic data. The fidelity of reconstruction is further vindicated with reconstructions of moving cars and vehicles in uncontrolled outdoor scenes.
Depth really Matters: Improving Visual Salient Region Detection with Depth.
KARTHIK. D,K Madhava Krishna,Deepu Rajan,Jawahar C V
British Machine Vision Conference, BMVC, 2013
@inproceedings{bib_Dept_2013, AUTHOR = {KARTHIK. D, K Madhava Krishna, Deepu Rajan, Jawahar C V}, TITLE = {Depth really Matters: Improving Visual Salient Region Detection with Depth.}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2013}}
Depth information has been shown to affect identification of visually salient regions in images. In this paper, we investigate the role of depth in saliency detection in the presence of (i) competing saliencies due to appearance,(ii) depth-induced blur and (iii) centre-bias. Having established through experiments that depth continues to be a significant contributor to saliency in the presence of these cues, we propose a 3D-saliency formulation that takes into account structural features of objects in an indoor setting to identify regions at salient depth levels. Computed 3D saliency is used in conjunction with 2D saliency models through non-linear regression using SVM to improve saliency maps. Experiments on benchmark datasets containing depth information show that the proposed fusion of 3D saliency with 2D saliency models results in an average improvement in ROC scores of about 9% over state-of-the-art 2D saliency models.
Visual localization in highly crowded urban environments
A. H. Abdul Hafez,MANPREET SINGH,K Madhava Krishna,Jawahar C V
International Conference on Intelligent Robots and Systems, IROS, 2013
@inproceedings{bib_Visu_2013, AUTHOR = {A. H. Abdul Hafez, MANPREET SINGH, K Madhava Krishna, Jawahar C V}, TITLE = {Visual localization in highly crowded urban environments}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2013}}
Visual localization in crowded dynamic environments requires information about static and dynamic objects. This paper presents a robust method that learns the useful features from multiple runs in highly crowded urban environments. Useful features are identified as distinctive ones that are also reliable to extract in diverse imaging conditions. Relative importance of features is used to derive the weight for each feature. The popular Bag-of-words model is used for image retrieval and localization, where query image is the current view of the environment and database contains the visual experience from previous runs. Based on the reliability, features are augmented and eliminated over runs. This reduces the size of representation, and makes it more reliable in crowded scenes. We tested the proposed method on data sets collected from highly crowded Indian urban outdoor settings. Experiments have shown that with the help of a small subset (10%) of the detected features, we can reliably localize the camera. We achieve superior results in terms of localization accuracy even when more than 90% of the pixels are occluded or dynamic.
Robust Recognition of Degraded Documents Using Character N-Grams
Shrey Dutta,NAVEEN T S, Pramod Sankar K.,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2012
@inproceedings{bib_Robu_2012, AUTHOR = {Shrey Dutta, NAVEEN T S, Pramod Sankar K., Jawahar C V}, TITLE = {Robust Recognition of Degraded Documents Using Character N-Grams}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2012}}
In this paper we present a novel recognition approach that results in a 15% decrease in word error rate on heavily degraded Indian language document images. OCRs have considerably good performance on good quality documents, but fail easily in presence of degradations. Also, classical OCR approaches perform poorly over complex scripts such as those for Indian languages. We address these issues by proposing to recognize character n-gram images, which are basically groupings of consecutive character/component segments. Our approach is unique, since we use the character n-grams as a primitive for recognition rather than for postprocessing. By exploiting the additional context present in the character n-gram images, we enable better disambiguation between confusing characters in the recognition phase. The labels obtained from recognizing the constituent n-grams are then fused to obtain a label for the word that emitted them. Our method is inherently robust to degradations such as cuts and merges which are common in digital libraries of scanned documents. We also present a reliable and scalable scheme for recognizing character n-gram images. Tests on English and Malayalam document images show considerable improvement in recognition in the case of heavily degraded documents.
Word image retrieval using bag of visual words
RAVI SHEKHAR,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2012
@inproceedings{bib_Word_2012, AUTHOR = {RAVI SHEKHAR, Jawahar C V}, TITLE = {Word image retrieval using bag of visual words}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2012}}
This paper presents a Bag of Visual Words (BoVW) based approach to retrieve similar word images from a large database, efficiently and accurately. We show that a text retrieval system can be adapted to build a word image retrieval solution. This helps in achieving scalability. We demonstrate the method on more than 1 Million word images with a sub-second retrieval time. We validate the method on four Indian languages, and report a mean average precision of more than 0.75. We represent the word images as histogram of visual words present in the image. Visual words are quantized representation of local regions, and for this work, SIFT descriptors at interest points are used as feature vectors. To address the lack of spatial structure in the BoVW representation, we re-rank the retrieved list. This significantly improves the performance.
Video retrieval by mimicking poses
NATARAJ. J,Andrew Zisserman,Marcin Eichner,Vittorio Ferrari,Jawahar C V
International Conference on Multimedia Retrieval, ICMR, 2012
@inproceedings{bib_Vide_2012, AUTHOR = {NATARAJ. J, Andrew Zisserman, Marcin Eichner, Vittorio Ferrari, Jawahar C V}, TITLE = {Video retrieval by mimicking poses}, BOOKTITLE = {International Conference on Multimedia Retrieval}. YEAR = {2012}}
We describe a method for real time video retrieval where the task is to match the 2D human pose of a query. A user can form a query by (i) interactively controlling a stickman on a web based GUI, (ii) uploading an image of the desired pose, or (iii) using the Kinect and acting out the query himself. The method is scalable and is applied to a dataset of 18 films totaling more than three million frames. The real time performance is achieved by searching for approximate nearest neighbors to the query using a random forest of K-D trees. Apart from the query modalities, we introduce two other areas of novelty. First, we show that pose retrieval can proceed using a low dimensional representation. Second, we show that the precision of the results can be improved substantially by combining the outputs of independent human pose estimation algorithms. The performance of the system is assessed quantitatively over a range of pose queries.
Top-down and bottom-up cues for scene text recognition
ANAND MISHRA,Karteek Alahari,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2012
@inproceedings{bib_Top-_2012, AUTHOR = {ANAND MISHRA, Karteek Alahari, Jawahar C V}, TITLE = {Top-down and bottom-up cues for scene text recognition}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2012}}
Scene text recognition has gained significant attention from the computer vision community in recent years. Recognizing such text is a challenging problem, even more so than the recognition of scanned documents. In this work, we focus on the problem of recognizing text extracted from street images. We present a framework that exploits both bottom-up and top-down cues. The bottom-up cues are derived from individual character detections from the image. We build a Conditional Random Field model on these detections to jointly model the strength of the detections and the interactions between them. We impose top-down cues obtained from a lexicon-based prior, i.e. language statistics, on the model. The optimal word represented by the text image is obtained by minimizing the energy function corresponding to the random field model. We show significant improvements in accuracies on two challenging public datasets, namely Street View Text (over 15%) and ICDAR 2003 (nearly 10%).
Cats and dogs
PARKHI OMKAR MORESHWAR,Andrea Vedaldi,Andrew Zisserman,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2012
@inproceedings{bib_Cats_2012, AUTHOR = {PARKHI OMKAR MORESHWAR, Andrea Vedaldi, Andrew Zisserman, Jawahar C V}, TITLE = {Cats and dogs}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2012}}
We investigate the fine grained object categorization problem of determining the breed of animal from an image. To this end we introduce a new annotated dataset of pets covering 37 different breeds of cats and dogs. The visual problem is very challenging as these animals, particularly cats, are very deformable and there can be quite subtle differences between the breeds. We make a number of contributions: first, we introduce a model to classify a pet breed automatically from an image. The model combines shape, captured by a deformable part model detecting the pet face, and appearance, captured by a bag-of-words model that describes the pet fur. Fitting the model involves automatically segmenting the animal in the image. Second, we compare two classification approaches: a hierarchical one, in which a pet is first assigned to the cat or dog family and then to a breed, and a flat one, in which the breed is obtained directly. We also investigate a number of animal and image orientated spatial layouts. These models are very good: they beat all previously published results on the challenging ASIRRA test (cat vs dog discrimination). When applied to the task of discriminating the 37 different breeds of pets, the models obtain an average accuracy of about 59%, a very encouraging result considering the difficulty of the problem.
Choosing linguistics over vision to describe images
ANKUSH GUPTA,YASHASWI VERMA,Jawahar C V
American Association for Artificial Intelligence, AAAI, 2012
@inproceedings{bib_Choo_2012, AUTHOR = {ANKUSH GUPTA, YASHASWI VERMA, Jawahar C V}, TITLE = {Choosing linguistics over vision to describe images}, BOOKTITLE = {American Association for Artificial Intelligence}. YEAR = {2012}}
In this paper, we address the problem of automatically generating human-like descriptions for unseen images, given a collection of images and their corresponding human-generated descriptions. Previous attempts for this task mostly rely on visual clues and corpus statistics, but do not take much advantage of the semantic information inherent in the available image descriptions. Here, we present a generic method which benefits from all these three sources (i.e. visual clues, corpus statistics and available descriptions) simultaneously, and is capable of constructing novel descriptions. Our approach works on syntactically and linguistically motivated phrases extracted from the human descriptions. Experimental evaluations demonstrate that our formulation mostly generates lucid and semantically correct descriptions, and significantly outperforms the previous methods on automatic evaluation metrics. One of the significant advantages of our approach is that we can generate multiple interesting descriptions for an image. Unlike any previous work, we also test the applicability of our method on a large dataset containing complex images with rich descriptions.
Scene text recognition using higher order language priors
ANAND MISHRA,Karteek Alahari,Jawahar C V
British Machine Vision Conference, BMVC, 2012
@inproceedings{bib_Scen_2012, AUTHOR = {ANAND MISHRA, Karteek Alahari, Jawahar C V}, TITLE = {Scene text recognition using higher order language priors}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2012}}
The problem of recognizing text in images taken in the wild has gained significant attention from the computer vision community in recent years. Contrary to recognition of printed documents, recognizing scene text is a challenging problem. We focus on the problem of recognizing text extracted from natural scene images and the web. Significant attempts have been made to address this problem in the recent past. However, many of these works benefit from the availability of strong context, which naturally limits their applicability. In this work we present a framework that uses a higher order prior computed from an English dictionary to recognize a word, which may or may not be a part of the dictionary. We show experimental results on publicly available datasets. Furthermore, we introduce a large challenging word dataset with five thousand words to evaluate various steps of our method exhaustively. The main contributions of this work are: (1) We present a framework, which incorporates higher order statistical language models to recognize words in an unconstrained manner (i.e. we overcome the need for restricted word lists, and instead use an English dictionary to compute the priors). (2) We achieve significant improvement (more than 20%) in word recognition accuracies without using a restricted word list. (3) We introduce a large word recognition dataset (atleast 5 times larger than other public datasets) with character level annotation and benchmark it.
Has my algorithm succeeded? an evaluator for human pose estimators
NATARAJ. J,Andrew Zisserman,Marcin Eichner,Vittorio Ferrari,Jawahar C V
European Conference on Computer Vision, ECCV, 2012
@inproceedings{bib_Has__2012, AUTHOR = {NATARAJ. J, Andrew Zisserman, Marcin Eichner, Vittorio Ferrari, Jawahar C V}, TITLE = {Has my algorithm succeeded? an evaluator for human pose estimators}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2012}}
Most current vision algorithms deliver their output ‘as is’, without indicating whether it is correct or not. In this paper we propose evaluator algorithms that predict if a vision algorithm has succeeded. We illustrate this idea for the case of Human Pose Estimation (HPE). We describe the stages required to learn and test an evaluator, including the use of an annotated ground truth dataset for training and testing the evaluator (and we provide a new dataset for the HPE case), and the development of auxiliary features that have not been used by the (HPE) algorithm, but can be learnt by the evaluator to predict if the output is correct or not. Then an evaluator is built for each of four recently developed HPE algorithms using their publicly available implementations: Eichner and Ferrari [5], Sapp et al. [16], Andriluka et al. [2] and Yang and Ramanan [22]. We demonstrate that in each case the evaluator is able to predict if the algorithm has correctly estimated the pose or not.
Image annotation using metric learning in semantic neighbourhoods
YASHASWI VERMA,Jawahar C V
European Conference on Computer Vision, ECCV, 2012
@inproceedings{bib_Imag_2012, AUTHOR = {YASHASWI VERMA, Jawahar C V}, TITLE = {Image annotation using metric learning in semantic neighbourhoods}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2012}}
Automatic image annotation aims at predicting a set of textual labels for an image that describe its semantics. These are usually taken from an annotation vocabulary of few hundred labels. Because of the large vocabulary, there is a high variance in the number of images corresponding to different labels (“class-imbalance”). Additionally, due to the limitations of manual annotation, a significant number of available images are not annotated with all the relevant labels (“weaklabelling”). These two issues badly affect the performance of most of the existing image annotation models. In this work, we propose 2PKNN, a two-step variant of the classical K-nearest neighbour algorithm, that addresses these two issues in the image annotation task. The first step of 2PKNN uses “image-to-label” similarities, while the second step uses “image-to-image” similarities; thus combining the benefits of both. Since the performance of nearest-neighbour based methods greatly depends on how features are compared, we also propose a metric learning framework over 2PKNN that learns weights for multiple features as well as distances together. This is done in a large margin set-up by generalizing a well-known (single-label) classification metric learning algorithm for multi-label prediction. For scalability, we implement it by alternating between stochastic sub-gradient descent and projection steps. Extensive experiments demonstrate that, though conceptually simple, 2PKNN alone performs comparable to the current state-of-the-art on three challenging image annotation datasets, and shows significant improvements after metric learning.
Towards exhaustive pairwise matching in large image collections
KUMAR SRIJAN,Jawahar C V
European Conference on Computer Vision, ECCV, 2012
@inproceedings{bib_Towa_2012, AUTHOR = {KUMAR SRIJAN, Jawahar C V}, TITLE = {Towards exhaustive pairwise matching in large image collections}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2012}}
Exhaustive pairwise matching on large datasets presents serious practical challenges, and has mostly remained an unexplored domain. We make a step in this direction by demonstrating the feasibility of scalable indexing and fast retrieval of appearance and geometric information in images. We identify unification of database filtering and geometric verification steps as a key step for doing this. We devise a novel inverted indexing scheme, based on Bloom filters, to scalably index high order features extracted from pairs of nearby features. Unlike a conventional inverted index, we can adapt the size of the inverted index to maintain adequate sparsity of the posting lists. This ensures constant time query retrievals. We are thus able to implement an exhaustive pairwise matching scheme, with linear time complexity, using the ‘query each image in turn’ technique. We find the exhaustive nature of our approach to be very useful in mining small clusters of images, as demonstrated by a 73.2% recall on the UKBench dataset. In the Oxford Buildings dataset, we are able to discover all the query buildings. We also discover interesting overlapping images connecting distant images.
Learning hierarchical bag of words using naive bayes clustering
SIDDHARTHA CHANDRA,Shailesh Kumar,Jawahar C V
Asian Conference on Computer Vision, ACCV, 2012
@inproceedings{bib_Lear_2012, AUTHOR = {SIDDHARTHA CHANDRA, Shailesh Kumar, Jawahar C V}, TITLE = {Learning hierarchical bag of words using naive bayes clustering}, BOOKTITLE = {Asian Conference on Computer Vision}. YEAR = {2012}}
Image analysis tasks such as classification, clustering, detection, and retrieval are only as good as the feature representation of the images they use. Much research in computer vision is focused on finding better or semantically richer image representations. Bag of visual Words (BoW) is a representation that has emerged as an effective one for a variety of computer vision tasks. BoW methods traditionally use low level features. We have devised a strategy to use these low level features to create “higher level” features by making use of the spatial context in images. In this paper, we propose a novel hierarchical feature learning framework that uses a Naive Bayes Clustering algorithm to convert a 2-D symbolic image at one level to a 2-D symbolic image at the next level with richer features. On two popular datasets, Pascal VOC 2007 and Caltech 101, we empirically show that classification accuracy obtained from the hierarchical features computed using our approach is significantly higher than the traditional SIFT based BoW representation of images even though our image representations are more compact.
Image retrieval using eigen queries
RAVAL NISARG JAGDISHBHAI,RASHMI VILAS TONGE,Jawahar C V
Asian Conference on Computer Vision, ACCV, 2012
@inproceedings{bib_Imag_2012, AUTHOR = {RAVAL NISARG JAGDISHBHAI, RASHMI VILAS TONGE, Jawahar C V}, TITLE = {Image retrieval using eigen queries}, BOOKTITLE = {Asian Conference on Computer Vision}. YEAR = {2012}}
. Category based image search, where the goal is to retrieve images of a specific category from a large database, is becoming increasingly popular. In such a setting, the query is often a classifier. However, the complexity of the classifiers (often SVMs) used for this purpose hinders the use of such a solution in practice. Problem becomes paramount when the database is huge and/or the dimensionality of the feature representation is also very large. In this paper, we address this issue by proposing a novel method which decomposes the query classifier into set of known eigen queries. We use their precomputed results (or scores) for computing the ranked list corresponding to novel queries. We also propose an approximate algorithm which accesses only a fraction of the data to perform fast retrieval. Experiments on various datasets show that our method reports high accuracy and efficiency. Apart from retrieval, the proposed method can also be used to discover interesting new concepts from the given dataset.
Action recognition using canonical correlation kernels
NAGENDAR. G,B. SAI GANESH,TANDARPALLY MAHESH GOUD,Jawahar C V
Asian Conference on Computer Vision, ACCV, 2012
@inproceedings{bib_Acti_2012, AUTHOR = {NAGENDAR. G, B. SAI GANESH, TANDARPALLY MAHESH GOUD, Jawahar C V}, TITLE = {Action recognition using canonical correlation kernels}, BOOKTITLE = {Asian Conference on Computer Vision}. YEAR = {2012}}
. In this paper, we propose the canonical correlation kernel (CCK), that seamlessly integrates the advantages of lower dimensional representation of videos with a discriminative classifier like SVM. In the process of defining the kernel, we learn a low-dimensional (linear as well as nonlinear) representation of the video data, which is originally represented as a tensor. We densely compute features at single (or two) frame level, and avoid any explicit tracking. Tensor representation provides the holistic view of the video data, which is the starting point of computing the CCK. Our kernel is defined in terms of the principal angles between the lower dimensional representations of the tensor, and captures the similarity of two videos in an efficient manner. We test our approach on four public data sets and demonstrate consistent superior results over the state of the art methods, including those that use canonical correlations.
Partial Least Squares kernel for computing similarities between video sequences
SIDDHARTHA CHANDRA,Jawahar C V
International conference on Pattern Recognition, ICPR, 2012
@inproceedings{bib_Part_2012, AUTHOR = {SIDDHARTHA CHANDRA, Jawahar C V}, TITLE = {Partial Least Squares kernel for computing similarities between video sequences}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2012}}
Computing similarities between data samples is a fundamental step in most Pattern Recognition (PR) tasks. Better similarity measures lead to more accurate prediction of labels. Computing similarities between video sequences has been a challenging problem for the PR community for long because videos have both spatial and temporal context which are hard to capture. We describe a novel approach that employs Partial Least Squares (PLS) regression to derive a measure of similarity between two tensors (videos). We demonstrate the use of this tensor similarity measure along with SVM classifiers to solve the tasks of hand gesture recognition and action classification. We show that our methods significantly outperform the state of the art approaches on two popular datasets: Cambridge hand gesture dataset and UCF sports action dataset. Our method requires no parameter tuning.
Recognition of printed Devanagari text using BLSTM Neural Network
NAVEEN T S,Jawahar C V
International conference on Pattern Recognition, ICPR, 2012
@inproceedings{bib_Reco_2012, AUTHOR = {NAVEEN T S, Jawahar C V}, TITLE = {Recognition of printed Devanagari text using BLSTM Neural Network}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2012}}
In this paper, we propose a recognition scheme for the Indian script of Devanagari. Recognition accuracy of Devanagari script is not yet comparable to its Roman counterparts. This is mainly due to the complexity of the script, writing style etc. Our solution uses a Recurrent Neural Network known as Bidirectional LongShort Term Memory (BLSTM). Our approach does not require word to character segmentation, which is one of the most common reason for high word error rate. We report a reduction of more than 20% in word error rate and over 9% reduction in character error rate while comparing with the best available OCR system.
Logical itemset mining
Shailesh Kumar,V. CHANDRASHEKAR,Jawahar C V
International Conference on Data Mining Workshops, ICDM-W, 2012
@inproceedings{bib_Logi_2012, AUTHOR = {Shailesh Kumar, V. CHANDRASHEKAR, Jawahar C V}, TITLE = {Logical itemset mining}, BOOKTITLE = {International Conference on Data Mining Workshops}. YEAR = {2012}}
Frequent Itemset Mining (FISM) attempts to find large and frequent itemsets in bag-of-items data such as retail market baskets. Such data has two properties that are not naturally addressed by FISM: (i) a market basket might contain items from more than one customer intent (mixture property) and (ii) only a subset of items related to a customer intent are present in most market baskets (projection property). We propose a simple and robust framework called LOGICAL ITEMSET MINING (LISM) that treats each market basket as a mixture-of, projections-of, latent customer intents. LISM attempts to discover logical itemsets from such bagof-items data. Each logical itemset can be interpreted as a latent customer intent in retail or semantic concept in text tagsets. While the mixture and projection properties are easy to appreciate in retail domain, they are present in almost all types of bag-of-items data. Through experiments on two large datasets, we demonstrate the quality, novelty, and actionability of logical itemsets discovered by the simple, scalable, and aggressively noise-robust LISM framework. We conclude that while FISM discovers a large number of noisy, observed, and frequent itemsets, LISM discovers a small number of high quality, latent logical itemsets.
A non-local MRF model for heritage architectural image completion
DEEPAN GUPTA,VAIDEHI CHHAJER,ANAND MISHRA,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2012
@inproceedings{bib_A_no_2012, AUTHOR = {DEEPAN GUPTA, VAIDEHI CHHAJER, ANAND MISHRA, Jawahar C V}, TITLE = {A non-local MRF model for heritage architectural image completion}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2012}}
MRF models have shown state-of-the-art performance for many computer vision tasks. In this work, we propose a non-local MRF model for image completion problem. The goal of image completion is to fill user specified “target” region with patches of “source” regions in a way that is visually plausible to an observer. We represent the patches in the target region of the image as random variables in an MRF, and introduce a novel energy function on these variables. Each variable takes a label from a label set which is a collection of patches of the source region. The quality of the image completion is determined by the value of the energy function. The non-locality in the MRF is achieved through long range pairwise potentials. These long range pairwise potentials are defined to capture the inherent repeating patterns present in heritage architectural images. We minimize this energy function using Belief Propagation to obtain globally optimal image completion. We have tested our method on a wide variety of images and shown superior performance over previously published results for this task.
Sparse discriminative Fisher vectors in visual classification
VINAY GARG,SIDDHARTHA CHANDRA,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2012
@inproceedings{bib_Spar_2012, AUTHOR = {VINAY GARG, SIDDHARTHA CHANDRA, Jawahar C V}, TITLE = {Sparse discriminative Fisher vectors in visual classification}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2012}}
Constructing global image representations from local fea-ture descriptors is a common step in most visual classifica-tion tasks. Traditionally, the Bag of Features (BoF) rep-resentations involving hard vector quantization have been used ubiquitously for such tasks. Recent works have demon-strated superior performance of soft assignments over hard assignments. Fisher vector representations have been shown to outperform other global representations on most bench-mark datasets. Fisher vectors (i) use soft assignments, and(ii) reduce information loss due to quantization by captur-ing the deviations from the mean. However, the Fisher vec-tor representations are huge and the representation size in-creases linearly with the vocabulary size. Recent findings report that the classification performance of Fisher vectors is proportional to the vocabulary size. Computational and storage requirements, however, discourage the use of arbi-trarily large vocabularies. Also, Fisher vectors are not in-herently discriminative. In this paper, we devise a novel strategy to compute sparse Fisher representations. This al-lows us to increase the vocabulary size with little computa-tion and storage overhead and still attain the performance ofa larger vocabulary. Further, we describe an approach to en-code class-discriminative information in the Fisher vectors.We evaluate our method on four popular datasets. Empir-ical results show that our representations consistently out-perform the traditional Fisher Vector representations andare comparable to the state of art approaches.
Automatic localization and correction of line segmentation errors
ANAND MISHRA,NAVEEN T S,VIRESH RANJAN,Jawahar C V
Document Analysis and Recognition, DAR, 2012
@inproceedings{bib_Auto_2012, AUTHOR = {ANAND MISHRA, NAVEEN T S, VIRESH RANJAN, Jawahar C V}, TITLE = {Automatic localization and correction of line segmentation errors}, BOOKTITLE = {Document Analysis and Recognition}. YEAR = {2012}}
Text line segmentation is a basic step in any OCRsys-tem. Its failure deteriorates the performance of OCRen-gines. This is especially true for the Indian languages due to the nature of scripts. Many segmentation algorithms are proposed in literature. Often these algorithms fail to adapt dynamically to a given page and thus tend to yield poor seg-mentation for some specific regions or some specific pages. In this work we design a text line segmentation post processor which automatically localizes and corrects the segmentation errors. The proposed segmentation post processor, which works in a “learning by examples” framework, is not only independent to segmentation algorithms but also robust to the diversity of scanned pages.We show over 5% improvement in text line segmentation on a large dataset of scanned pages for multiple Indian lan-guages
Are Buildings Only Instances?
ABHINAV GOEL,MAYANK JUNEJA,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2012
@inproceedings{bib_Are__2012, AUTHOR = {ABHINAV GOEL, MAYANK JUNEJA, Jawahar C V}, TITLE = {Are Buildings Only Instances?}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2012}}
Instance retrieval has emerged as a promising research area with buildings as the popular test subject. Given a query im-age or region, the objective is to find images in the database containing the same object or scene. There has been a recents urge in efforts in finding instances of the same building in challenging datasets such as the Oxford 5k dataset[19], Ox-ford 100k dataset and the Paris dataset[20].We ascend one level higher and pose the question:Are Buildings Only Instances?Buildings located in the same geographical region or constructed in a certain time period in history often follow a specific method of construction.These architectural styles are characterized by certain fea-tures which distinguish them from other styles of architec-ture. We explore, beyond the idea of buildings as instances,the possibility that buildings can be categorized based on the architectural style. Certain characteristic features distin-guish an architectural style from others. We perform exper-iments to evaluate how characteristic information obtained from low-level feature configurations can help in classifica-tion of buildings into architectural style categories. Encour-aged by our observations, we mine characteristic features with semantic utility for different architectural styles from our dataset of European monuments. These mined features are of various scales, and provide an insight into what makes a particular architectural style category distinct. The utility of the mined characteristics is verified from Wikipedia
Content level access to digital library of india pages
PRAVEEN KRISHNAN,RAVI SHEKHAR,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2012
@inproceedings{bib_Cont_2012, AUTHOR = {PRAVEEN KRISHNAN, RAVI SHEKHAR, Jawahar C V}, TITLE = {Content level access to digital library of india pages}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2012}}
In this paper, we propose a framework for content level ac-cess to the scanned pages of Digital Library of India (DLI).The current Optical Character Recognition (OCR) systems are not robust and reliable enough for generating accurate text from DLI pages. We propose a search scheme which fuses noisy OCR output and holistic visual features for con-tent level access to the DLI pages. Visual content is cap-tured using Bag of Visual Words (BoVW) approach. We show that our fusion scheme improves over the individual methods in terms of mean Average Precision (mAP) and mean precision at 10 (mPrec@10). We exploit the fact that OCR has a high precision while BoVW has a high recall.We use a modified edit distance to improve the order of re-sults ranked by BoVW. Experiments are carried out on largedatasets of DLI pages in Hindi and Telugu languages. Wevalidate our method on more than 10,000 pages and 4 Mil-lion words, and report a mAP of around 0.8 and mPrec@10of more than 0.9. We show improvements over BoVW by introducing query expansion. We also demonstrate a textual query interface for the search system.
Neti Neti: in search of deity
YASHASWI VERMA,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2012
@inproceedings{bib_Neti_2012, AUTHOR = {YASHASWI VERMA, Jawahar C V}, TITLE = {Neti Neti: in search of deity}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2012}}
A wide category of objects and scenes can be effectively searched and classified using the modern descriptors and classifiers. With the performance on many popular cate-gories becoming satisfactory, we explore into the issues as-sociated with much harder recognition problems.We address the problem of searching specific images in In-dian stone-carvings and sculptures in an unsupervised set-up. For this, we introduce a new dataset of 524 images containing sculptures and carvings of eight different Indian deities and three other subjects popular in the Indian sce-nario. We perform a thorough analysis to investigate var-ious challenges associated with this task. A new image-representation is proposed using a sequence of discriminative patches mined in an unsupervised manner. For each image,these patches are identified based on their ability to distin-guish the given image from the image most dissimilar to it. Then a rejection-based re-ranking scheme is formulated based on both similarity as well as dissimilarity between two images. This new scheme is experimentally compared with two baselines using state-of-the-art descriptors on the pro-posed dataset. Empirical evaluations demonstrate that our proposed method of image-representation and rejection cas-cade improves the retrieval performance on this hard prob-lem as compared to the baseline descriptors.
Heritage app: annotating images on mobile phones
Jayaguru Panda,Shashank Sharma,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2012
@inproceedings{bib_Heri_2012, AUTHOR = {Jayaguru Panda, Shashank Sharma, Jawahar C V}, TITLE = {Heritage app: annotating images on mobile phones}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2012}}
n this paper, we demonstrate a computer vision applica-tion on mobile phones. One can take a picture at a her-itage site/monument and obtain associated annotations on a mid-end mobile phone instantly. This does not require any communication of images or features with a remote server,and all the necessary computations take place on the phone itself. We demonstrate the app on two Indian heritage sites:Golkonda Fort and Hampi Temples. Underlying our appli-cation, we have a Bag of visual Words (BoW) image retrieval system, and an annotated database of images.In the process of developing this mobile app, we extend the performance, scope and applicability of computer vision techniques: (i) we do a BoW-based image retrieval on mo-bile phones from a database of 10K images within 50 MB of storage and 10 MB of RAM. (ii) we introduce a vocabulary pruning method for reducing the vocabulary size. (iii) wede sign a simple method of database pruning, that helps in reducing the size of the inverted index by removing seman-tically similar images. In (ii) and (iii), we demonstrate how memory(RAM) and computational speed can be optimized without any loss in performance.
Motion segmentation of multiple objects from a freely moving monocular camera
RAHUL KUMAR NAMDEV,ABHIJIT KUNDU,K Madhava Krishna,Jawahar C V
International Conference on Robotics and Automation, ICRA, 2012
@inproceedings{bib_Moti_2012, AUTHOR = {RAHUL KUMAR NAMDEV, ABHIJIT KUNDU, K Madhava Krishna, Jawahar C V}, TITLE = {Motion segmentation of multiple objects from a freely moving monocular camera}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2012}}
Motion segmentation is an inevitable component for mobile robotic systems such as the case with robots performing SLAM and collision avoidance in dynamic worlds. This paper proposes an incremental motion segmentation system that efficiently segments multiple moving objects and simultaneously build the map of the environment using visual SLAM modules. Multiple cues based on optical flow and two view geometry are integrated to achieve this segmentation. A dense optical flow algorithm is used for dense tracking of features. Motion potentials based on geometry are computed for each of these dense tracks. These geometric potentials along with the optical flow potentials are used to form a graph like structure. A graph based segmentation algorithm then clusters together nodes of similar potentials to form the eventual motion segments. Experimental results of high quality segmentation on different publicly available datasets demonstrate the effectiveness of our method.
AXES at TRECVid 2011
Kevin McGuinness,Robin Aly,Shu Chen,Mathieu Frappier,Martijn Kleppe,Hyowon Lee,Roeland Ordelman,Relja Arandjelović,Jawahar C V
Text Retrieval Conference Video Retrieval Evaluation, TRECVID, 2011
@inproceedings{bib_AXES_2011, AUTHOR = {Kevin McGuinness, Robin Aly, Shu Chen, Mathieu Frappier, Martijn Kleppe, Hyowon Lee, Roeland Ordelman, Relja Arandjelović, Jawahar C V}, TITLE = {AXES at TRECVid 2011}, BOOKTITLE = {Text Retrieval Conference Video Retrieval Evaluation}. YEAR = {2011}}
The AXES project participated in the interactive known-item search task (KIS) and the interactive instance search task (INS) for TRECVid 2011. We used the same system architecture and a nearly identical user interface for both the KIS and INS tasks. Both systems made use of text search on ASR, visual concept detectors, and visual similarity search. The user experiments were carried out with media professionals and media students at the Netherlands Institute for Sound and Vision, with media professionals performing the KIS task and media students participating in the INS task. This paper describes the results and findings of our experiments.
Video Scene Segmentation with a Semantic Similarity.
NIRAJ KUMAR,PIYUSH RAI,LAKSHMI CHANDRIKA PULLA,Jawahar C V
Indian International Conference on Artificial Intelligence, IICAI, 2011
@inproceedings{bib_Vide_2011, AUTHOR = {NIRAJ KUMAR, PIYUSH RAI, LAKSHMI CHANDRIKA PULLA, Jawahar C V}, TITLE = {Video Scene Segmentation with a Semantic Similarity.}, BOOKTITLE = {Indian International Conference on Artificial Intelligence}. YEAR = {2011}}
Video Scene Segmentation is an important problem in computer vision as it helps in efficient storage, indexing and retrieval of videos. Significant amount of work has been done in this area in the form of shot segmentation techniques and they often give reasonably good results. However, shots are not of much importance for the semantic analysis of the videos. For semantic and meaningful analysis of the videos (e.g. movies), scene is more important since it captures one complete unit of action. People have tried different approaches in scene segmentation but almost all of them use color, texture etc. to compute scene boundaries. In this paper, we propose a new algorithm based on a Bag of Words(BoW) representation which computes semantic similarity between shots using a Bipartite Graph Model (BGM). Based on semantic similarity, we detect the scene boundaries in the movie. We have tested our algorithm on a multiple Hollywood movies, and the proposed method is found to give good results.
On multifont character classification in Telugu
KOMATIREDDY VENKAT RASAGNA REDDY,K. J. JINESH,Jawahar C V
International Conference on Information systems for Indian Languages, ICISIL, 2011
@inproceedings{bib_On_m_2011, AUTHOR = {KOMATIREDDY VENKAT RASAGNA REDDY, K. J. JINESH, Jawahar C V}, TITLE = {On multifont character classification in Telugu}, BOOKTITLE = {International Conference on Information systems for Indian Languages}. YEAR = {2011}}
A major requirement in the design of robust OCRs is the invariance of feature extraction scheme with the popular fonts used in the print. Many statistical and structural features have been tried for character classification in the past. In this paper, we get motivated by the recent successes in object category recognition literature and use a spatial extension of the histogram of oriented gradients (HOG) for character classification. Our experiments are conducted on 1453950 Telugu character samples in 359 classes and 15 fonts. On this data set, we obtain an accuracy of 96-98% with an SVM classifier.
Experiences of integration and performance testing of multilingual OCR for printed Indian scripts
Deepak Arya,Tushar Patnaik ,Santanu Chaudhury ,Jawahar C V,B.B.Chaudhuri ,A.G.Ramakrishna ,Chakravorty Bhagvati ,G. S. Lehal
Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, MOCR_AND, 2011
@inproceedings{bib_Expe_2011, AUTHOR = {Deepak Arya, Tushar Patnaik , Santanu Chaudhury , Jawahar C V, B.B.Chaudhuri , A.G.Ramakrishna , Chakravorty Bhagvati , G. S. Lehal }, TITLE = {Experiences of integration and performance testing of multilingual OCR for printed Indian scripts}, BOOKTITLE = {Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data}. YEAR = {2011}}
This paper presents integration and testing scheme for managing a large Multilingual OCR Project. The project is an attempt to implement an integrated platform for OCR of different Indian languages. Software engineering, workflow management and testing processes have been discussed in this paper. The OCR has now been experimentally deployed for some specific applications and currently is being enhanced for handling the space and time constraints, achieving higher recognition accuracies and adding new functionalities.
Automatic localization of page segmentation errors
Dheeraj Mundhra,ANAND MISHRA,Jawahar C V
Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, MOCR_AND, 2011
@inproceedings{bib_Auto_2011, AUTHOR = {Dheeraj Mundhra, ANAND MISHRA, Jawahar C V}, TITLE = {Automatic localization of page segmentation errors}, BOOKTITLE = {Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data}. YEAR = {2011}}
Page segmentation is a basic step in any character recognition system. Its failure is one of the major causes for deteriorating overall accuracy of the current Indian language OCR engines. Many segmentation algorithms are proposed in literature. Often these algorithms fail to adapt dynamically to a given page and thus tend to yield poor segmentation for some specific regions or some specific pages. Given the ground truth, locating page segmentation errors is a straight foreword problem and merely useful for comparing segmentation algorithms. In this work, we locate segmentation errors without directly using the ground truth. Such automatic localization of page segmentation errors can be considered a major step towards improving page segmentation errors. In this work, we focus on localizing line level segmentation errors. We perform experiments on more than 18000 scanned pages of 109 books belonging to four prominent south Indian languages.
An MRF model for binarization of natural scene text
ANAND MISHRA,Karteek Alahari,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2011
@inproceedings{bib_An_M_2011, AUTHOR = {ANAND MISHRA, Karteek Alahari, Jawahar C V}, TITLE = {An MRF model for binarization of natural scene text}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2011}}
Inspired by the success of MRF models for solving object segmentation problems, we formulate the binarization problem in this framework. We represent the pixels in a document image as random variables in an MRF, and introduce a new energy (or cost) function on these variables. Each variable takes a foreground or background label, and the quality of the binarization (or labelling) is determined by the value of the energy function. We minimize the energy function, i.e. find the optimal binarization, using an iterative graph cut scheme. Our model is robust to variations in foreground and background colours as we use a Gaussian Mixture Model in the energy function. In addition, our algorithm is efficient to compute, and adapts to a variety of document images. We show results on word images from the challenging ICDAR 2003 dataset, and compare our performance with previously reported methods. Our approach shows significant improvement in pixel level accuracy as well as OCR accuracy.
Character n-gram spotting in document images
SUDHA PRAVEEN MAREMANDA,Pramod Sankar Kompalli,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2011
@inproceedings{bib_Char_2011, AUTHOR = {SUDHA PRAVEEN MAREMANDA, Pramod Sankar Kompalli, Jawahar C V}, TITLE = {Character n-gram spotting in document images}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2011}}
In this paper, we present a novel approach to search and retrieve from document image collections, without explicit recognition. Existing recognition-free approaches such as word-spotting cannot scale to arbitrarily large vocabulary and document image collections. In this paper we put forth a framework that overcomes three issues of word-spotting: i) retrieving word images not labeled during indexing, ii) allow for query and retrieval of morphological variations of words and iii) scale the retrieval to large collections. We propose a character n-gram spotting framework, where word-images are considered as a bag of visual n-grams. The character n-grams are represented in a visual-feature space and indexed for quick retrieval. In the retrieval phase, the query word is expanded to its constituent n-grams, which are used to query the previously built index. A ranking mechanism is proposed that combines the retrieval results from the multiple lists corresponding to each n-gram. The approach is demonstrated on a size-able collection of English and Malayalam books. With a mean AP of 0.64, the performance of the retrieval system was found to be very promising.
BLSTM neural network based word retrieval for Hindi documents
RAMAN JAIN,Volkmar Frinken,Jawahar C V, R. Manmatha
International Conference on Document Analysis and Recognition, ICDAR, 2011
@inproceedings{bib_BLST_2011, AUTHOR = {RAMAN JAIN, Volkmar Frinken, Jawahar C V, R. Manmatha}, TITLE = {BLSTM neural network based word retrieval for Hindi documents}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2011}}
Retrieval from Hindi document image collections is a challenging task. This is partly due to the complexity of the script, which has more than 800 unique ligatures. In addition, segmentation and recognition of individual characters often becomes difficult due to the writing style as well as degradations in the print. For these reasons, robust OCRs are non existent for Hindi. Therefore, Hindi document repositories are not amenable to indexing and retrieval. In this paper, we propose a scheme for retrieving relevant Hindi documents in response to a query word. This approach uses BLSTM neural networks. Designed to take contextual information into account, these networks can handle word images that can not be robustly segmented into individual characters. By zoning the Hindiwords, we simplify the problem and obtain high retrieval rates. Our simplification suits the retrieval problem, while it does not apply to recognition. Our scalable retrieval scheme avoids explicit recognition of characters. An experimental evaluation on a dataset of word images gathered from two complete books demonstrates good accuracy even in the presence of printing variations and degradations. The performance is compared with baseline methods.
The truth about cats and dogs
PARKHI OMKAR MORESHWAR,Andrea Vedaldi,Jawahar C V,Andrew Zisserman
International Conference on Computer Vision, ICCV, 2011
@inproceedings{bib_The__2011, AUTHOR = {PARKHI OMKAR MORESHWAR, Andrea Vedaldi, Jawahar C V, Andrew Zisserman}, TITLE = {The truth about cats and dogs}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2011}}
Template-based object detectors such as the deformable parts model of Felzenszwalb et al. [11] achieve state-ofthe-art performance for a variety of object categories, but are still outperformed by simpler bag-of-words models for highly flexible objects such as cats and dogs. In these cases we propose to use the template-based model to detect a distinctive part for the class, followed by detecting the rest of the object via segmentation on image specific information learnt from that part. This approach is motivated by two observations: (i) many object classes contain distinctive parts that can be detected very reliably by template-based detectors, whilst the entire object cannot; (ii) many classes (e.g. animals) have fairly homogeneous coloring and texture that can be used to segment the object once a sample is provided in an image. We show quantitatively that our method substantially outperforms whole-body template-based detectors for these highly deformable object categories, and indeed achieves accuracy comparable to the state-of-the-art on the PASCAL VOC competition, which includes other models such as bagof-words.
Interpolation based tracking for fast object detection in videos
RAHUL JAIN,Pramod Sankar Kompalli,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2011
@inproceedings{bib_Inte_2011, AUTHOR = {RAHUL JAIN, Pramod Sankar Kompalli, Jawahar C V}, TITLE = {Interpolation based tracking for fast object detection in videos}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2011}}
Detecting objects in images and videos is very challenging due to i) large intra-class variety and ii) pose/scale variations. It is hard to build strong recognition engines for generic object categories, while applying them to large video collections is computationally infeasible (due to the explosion of frames to test). In this paper, we present a detection-byinterpolation framework, where object-tracking is achieved by interpolating between candidate object detections in a subset of the video frames. Given the location of an object in two frames of a video-shot, our algorithm tries to identify the locations of the object in the intermediate frames. We evaluate two tracking solutions based on greedy and dynamic programming approaches, and observe that a hybrid method gives significant performance boost as well as speedup in detection. On 6 hours of HD quality video, we were able to cut-down the detection time from 10000 hours to 1500 hours, while simultaneously improving the detection accuracy from 54% (of [1]) to 68%. As a result of this work, we build a dataset of 100,000 car images, spanning a wide range of viewpoints, scale and make; about 100 times larger than existing collections
Segmentation of degraded malayalam words: methods and evaluation
Devendra Sachan, Shrey Dutta,NAVEEN T S,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2011
@inproceedings{bib_Segm_2011, AUTHOR = {Devendra Sachan, Shrey Dutta, NAVEEN T S, Jawahar C V}, TITLE = {Segmentation of degraded malayalam words: methods and evaluation}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2011}}
In most of the Optical Character Recognition softwares, a substantial percentage of errors are caused by the incorrect segmentation of degraded words. This is especially true for recognizing old books, newspapers and historical manuscripts. In this paper, we propose multiple segmentation methods which address the problem of cuts and merges in degraded words. We have created an annotated dataset of 1034 word images with pixel level ground truth for quantitative evaluation of the methods. We compare the methods with a baseline implementation based on connected component analysis. We report substantial improvement in accuracy both at character and at word level.
Bag of visual words: A soft clustering based exposition
VINAY GARG,V SREEKANTH,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2011
@inproceedings{bib_Bag__2011, AUTHOR = {VINAY GARG, V SREEKANTH, Jawahar C V}, TITLE = {Bag of visual words: A soft clustering based exposition}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2011}}
In this paper, we explain the bag of words representation from a soft computing perspective. The traditional Bag of word representation describes an image as a bag of discrete visual codewords. Where histogram of the number of occurrences of these codewords is used for image classification tasks. The drawback of the approach is that every visual feature in an image is assigned to single codeword, which leads to the loss of information regarding the other relevant codewords that can represent the same feature. In this paper, we show how fuzzy and possibilistic codeword assignment improves the classification performance on Scene-15-dataset.
Whose Album is this?
ABHINAV GOEL,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2011
@inproceedings{bib_Whos_2011, AUTHOR = {ABHINAV GOEL, Jawahar C V}, TITLE = {Whose Album is this?}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2011}}
We present a method to identify the owner of a photo album taken off a social networking site. We consider this as a problem of prominent person mining. We introduce a new notion of prominent persons, and propose a greedy solution based on an eigenface representation. We mine prominent persons in a subset of dimensions in the eigenface space. We present excellent results on multiple datasets downloaded from the Internet. Index Terms—Eigenface, Clustering, Frequent Itemset Mining, Apriori, Prominent Person Mining
Privacy preserving outlier detection using locality sensitive hashing
RAVAL NISARG JAGDISHBHAI,P. MADHUCHAND RUSHI,PIYUSH BANSAL,Srinathan Kannan,Jawahar C V
International Conference on Data Mining Workshops, ICDM-W, 2011
@inproceedings{bib_Priv_2011, AUTHOR = {RAVAL NISARG JAGDISHBHAI, P. MADHUCHAND RUSHI, PIYUSH BANSAL, Srinathan Kannan, Jawahar C V}, TITLE = {Privacy preserving outlier detection using locality sensitive hashing}, BOOKTITLE = {International Conference on Data Mining Workshops}. YEAR = {2011}}
In this paper, we give approximate algorithms for privacy preserving distance based outlier detection for both horizontal and vertical distributions, which scale well to large datasets of high dimensionality in comparison with the existing techniques. In order to achieve efficient private algorithms, we introduce an approximate outlier detection scheme for the centralized setting which is based on the idea of Locality Sensitive Hashing. We also give theoretical and empirical bounds on the level of approximation of the proposed algorithms.
LSH based outlier detection and its application in distributed setting
P. MADHUCHAND RUSHI,RAVAL NISARG JAGDISHBHAI,PIYUSH BANSAL,Srinathan Kannan,Jawahar C V
International Conference on Information and Knowledge Management, CIKM, 2011
@inproceedings{bib_LSH__2011, AUTHOR = {P. MADHUCHAND RUSHI, RAVAL NISARG JAGDISHBHAI, PIYUSH BANSAL, Srinathan Kannan, Jawahar C V}, TITLE = {LSH based outlier detection and its application in distributed setting}, BOOKTITLE = {International Conference on Information and Knowledge Management}. YEAR = {2011}}
In this paper, we give an approximate algorithm for distance based outlier detection using Locality Sensitive Hashing (LSH) technique. We propose an algorithm for the centralized case wherein the entire dataset is locally available for processing. However, in case of very large datasets collected from various input sources, often the data is distributed across the network. Accordingly, we show that our algorithm can be effectively extended to a constant round protocol with low communication costs, in a distributed setting with horizontal partitioning.
Large scale visual localization in urban environments
SUPREETH ACHAR,Jawahar C V,K Madhava Krishna
International Conference on Robotics and Automation, ICRA, 2011
@inproceedings{bib_Larg_2011, AUTHOR = {SUPREETH ACHAR, Jawahar C V, K Madhava Krishna}, TITLE = {Large scale visual localization in urban environments}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2011}}
This paper introduces a vision based localization method for large scale urban environments. The method is based upon Bag-of-Words image retrieval techniques and handles problems that arise in urban environments due to repetitive scene structure and the presence of dynamic objects like vehicles. The localization system was experimentally verified it localization experiments along a 5km long path in an urban environment.
Realtime multibody visual SLAM with a smoothly moving monocular camera
ABHIJIT KUNDU,K Madhava Krishna,Jawahar C V
International Conference on Computer Vision, ICCV, 2011
@inproceedings{bib_Real_2011, AUTHOR = {ABHIJIT KUNDU, K Madhava Krishna, Jawahar C V}, TITLE = {Realtime multibody visual SLAM with a smoothly moving monocular camera}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2011}}
This paper presents a realtime, incremental multibody visual SLAM system that allows choosing between full 3D reconstruction or simply tracking of the moving objects. Motion reconstruction of dynamic points or objects from a monocular camera is considered very hard due to well known problems of observability. We attempt to solve the problem with a Bearing only Tracking (BOT) and by integrating multiple cues to avoid observability issues. The BOT is accomplished through a particle filter, and by integrating multiple cues from the reconstruction pipeline. With the help of these cues, many real world scenarios which are considered unobservable with a monocular camera is solved to reasonable accuracy. This enables building of a unified dynamic 3D map of scenes involving multiple moving objects. Tracking and reconstruction is preceded by motion segmentation and detection which makes use of efficient geometric constraints to avoid difficult degenerate motions, where objects move in the epipolar plane. Results reported on multiple challenging real world image sequences verify the efficacy of the proposed framework.
Design and Evaluation of Omnifont Tamil OCR
Tushar Patnaik,Shalu Gupta,Jawahar C V,Santanu Choudhury,A G Ramakrishnan
@inproceedings{bib_Desi_2010, AUTHOR = {Tushar Patnaik, Shalu Gupta, Jawahar C V, Santanu Choudhury, A G Ramakrishnan}, TITLE = {Design and Evaluation of Omnifont Tamil OCR}, BOOKTITLE = {Tamil Internet}. YEAR = {2010}}
IISc Bangalore has developed a recognition engine for Tamil printed text, which has been tested on 1000 document images of pages scanned from books printed between 1950 and 2000. IIIT Hyderabad has developed a XML based annotated database for storing the 5000 images of scanned pages and the corresponding typed text in Unicode. CDAC, Noida has developed an efficient evaluation tool, which compares the OCR output text to the reference typed text (ground truth) and flashes the substitution, deletion and insertion errors in different colours on the screen, so that the design team can quickly identify the issues with the OCR and make corrective steps for improving the performance. IIT Delhi has proposed and developed a novel scheme for segmenting only the text regions from document images containing pictures. The OCRs uses Karhunen-Louve transform (KLT) as features and a support vector machine (SVM) classifier with RBF kernel in a discriminative directed acyclic graph (DDAG) configuration. They assume an uncompressed input image of the document page, scanned at a minimum of 300 dpi with 256 gray levels (not binary or two-level). Tamil OCR currently gives over 94% recognition accuracy at the Unicode level, evaluated on over 1000 printed pages, some of them also containing old Tamil letters.
Image-based walkthroughs from incremental and partial scene reconstructions.
KUMAR SRIJAN,SYED AHSAN ISHTIAQUE,Sudipta N. Sinha,Jawahar C V
British Machine Vision Conference, BMVC, 2010
@inproceedings{bib_Imag_2010, AUTHOR = {KUMAR SRIJAN, SYED AHSAN ISHTIAQUE, Sudipta N. Sinha, Jawahar C V}, TITLE = {Image-based walkthroughs from incremental and partial scene reconstructions.}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2010}}
We present a scalable and incremental approach for creating interactive image-based walkthroughs from a dynamically growing collection of photographs of a scene. Prior approaches, such as [16], perform a global scene reconstruction as they require the knowledge of all the camera poses. These are recovered via batch processing involving pairwise image matching and structure from motion (Sfm), on collections of photographs. Both steps can become computational bottlenecks for large image collections. Instead of computing a global reconstruction and all the camera poses, our system utilizes several partial reconstructions, each of which is computed from only a small subset of overlapping images. These subsets are efficiently determined using a Bag of Words-based matching technique. Our framework easily allows an incoming stream of new photographs to be incrementally inserted into an existing reconstruction. We demonstrate that an imagebased rendering framework based on only partial scene reconstructions can be used to navigate large collections containing thousands of images without sacrificing the navigation experience. As our system is designed for incremental construction from a stream of photographs, it is well suited for processing the ever-growing photo collections.
Tripartite graph models for multi modal image retrieval
LAKSHMI CHANDRIKA PULLA,Jawahar C V
British Machine Vision Conference, BMVC, 2010
@inproceedings{bib_Trip_2010, AUTHOR = {LAKSHMI CHANDRIKA PULLA, Jawahar C V}, TITLE = {Tripartite graph models for multi modal image retrieval}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2010}}
Most of the traditional image retrieval methods use either low level visual features or embedded text for representation and indexing. In recent years, there has been significant interest in combining these two different modalities for effective retrieval. In this paper, we propose a tri-partite graph based representation of the multi model data for image retrieval tasks. Our representation is ideally suited for dynamically changing or evolving datasets, where repeated semantic indexing is practically impossible. We employ a graph partitioning algorithm for retrieving semantically relevant images from the database of images represented using the tripartite graph. Being “just in time semantic indexing”, our method is computationally light and less resource intensive. Experimental results show that the data structure used is scalable. We also show that the performance of our method is comparable with other multi model approaches, with significantly lower computational and resources requirements.
Nearest neighbor based collection OCR
Pramod Sankar Kompalli,Jawahar C V,R. Manmatha
International Workshop on Document Analysis Systems, DAS, 2010
@inproceedings{bib_Near_2010, AUTHOR = {Pramod Sankar Kompalli, Jawahar C V, R. Manmatha}, TITLE = {Nearest neighbor based collection OCR}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2010}}
Conventional optical character recognition (OCR) systems operate on individual characters and words, and do not normally exploit document or collection context. We describe a Collection OCR which takes advantage of the fact that multiple examples of the same word (often in the same font) may occur in a document or collection. The idea here is that an OCR or a reCAPTCHA like process generates a partial set of recognized words. In the second stage, a nearest neighbor algorithm compares the remaining word-images to those already recognized and propagates labels from the nearest neighbors. It is shown that by using an approximate fast nearest neighbor algorithm based on Hierarchical K-Means (HKM), we can do this accurately and efficiently. It is also shown that profile based features perform much better than SIFT and Pyramid Histogram of Gradient (PHOG) features. We believe that this is because profile features are more robust to word degradations (common in our documents). This approach is applied to a collection of Telugu books - a language for which no commercial OCR exists. We show from a selection of 33 Telugu books that starting with OCR labels for only 30% of the collection we can recognize the remaining 70% of the words in the collection with 70% accuracy using this approach. Since the approach makes no language specific assumptions, it should be applicable to a large number of languages. In particular we are interested in its applicability to Indic languages and scripts.
A post-processing scheme for malayalam using statistical sub-character language models
Karthika Mohan,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2010
@inproceedings{bib_A_po_2010, AUTHOR = {Karthika Mohan, Jawahar C V}, TITLE = {A post-processing scheme for malayalam using statistical sub-character language models}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2010}}
Most of the Indian scripts do not have any robust commercial OCRs. Many of the laboratory prototypes report reasonable results at recognition/classification stage. However, word level accuracies are still poor. It is well known that word accuracy decreases as the number of characters in a word increase. For Malayalam, the average number of characters in a word is almost twice that of English. Moreover, the number of words required to cover 80% of the Malayalam language is more than forty times that of other Indian languages such as Hindi. Hence a direct dictionary based post-processing scheme is not suitable for Malayalam. In this paper, we propose a post-processing scheme which uses statistical language models at the sub-character level to boost word level recognition results. We use a multi-stage graph representation and formulate the recognition task as an optimization problem. Edges of the graph encode the language information and nodes represent the visual similarities. An optimal path from source node to destination node represents the recognized text. We validate our method on more than 10,000 words from a Malayalam corpus.
Towards more effective distance functions for word image matching
RAMAN JAIN,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2010
@inproceedings{bib_Towa_2010, AUTHOR = {RAMAN JAIN, Jawahar C V}, TITLE = {Towards more effective distance functions for word image matching}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2010}}
Matching word images has many applications in document recognition and retrieval systems. Dynamic Time Warping (DTW) is popularly used to estimate the similarity between word images. Word images are represented as sequences of feature vectors, and the cost associated with dynamic programming based alignment is considered as the dissimilarity between them. However, such approaches are computationally costly when compared to fixed length matching schemes. In this paper, we explore systematic methods for identifying appropriate distance metrics for a given database or language. This is achieved by learning query specific distance functions which can be computed online efficiently. We show that a weighted Euclidean distance can outperform DTW for matching word images. This class of distance functions are also ideal for scalability and large scale matching. Our results are validated with mean Average Precision (mAP) on a fully annotated data set of 160K word images. We then show that the learnt distance functions can even be extended to a new database to obtain accurate retrieval.
Multi modal semantic indexing for image retrieval
LAKSHMI CHANDRIKA PULLA,Jawahar C V
International Conference on Image and Video Retrieval, CIVR, 2010
@inproceedings{bib_Mult_2010, AUTHOR = {LAKSHMI CHANDRIKA PULLA, Jawahar C V}, TITLE = {Multi modal semantic indexing for image retrieval}, BOOKTITLE = {International Conference on Image and Video Retrieval}. YEAR = {2010}}
Popular image retrieval schemes generally rely only on a single mode, (either low level visual features or embedded text) for searching in multimedia databases. Many popular image collections (eg. those emerging over Internet) have associated tags, often for human consumption. A natural extension is to combine information from multiple modes for enhancing effectiveness in retrieval. In this paper, we propose two techniques: Multi-modal Latent Semantic Indexing (MMLSI) and Multi-Modal Probabilistic Latent Semantic Analysis (MMpLSA). These methods are obtained by directly extending their traditional single mode counter parts. Both these methods incorporate visual features and tags by generating simultaneous semantic contexts. The experimental results demonstrate an improved accuracy over other single and multi-modal methods
Generalized RBF feature maps for Efficient Detection.
V SREEKANTH,Andrea Vedaldi,Andrew Zisserman,Jawahar C V
British Machine Vision Conference, BMVC, 2010
@inproceedings{bib_Gene_2010, AUTHOR = {V SREEKANTH, Andrea Vedaldi, Andrew Zisserman, Jawahar C V}, TITLE = {Generalized RBF feature maps for Efficient Detection.}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2010}}
Kernel methods yield state-of-the-art performance in certain applications such as image classification and object detection. However, large scale problems require machine learning techniques of at most linear complexity and these are usually limited to linear kernels. This unfortunately rules out gold-standard kernels such as the generalized RBF kernels (e.g. exponential-χ 2 ). Recently, Maji and Berg [13] and Vedaldi and Zisserman [20] proposed explicit feature maps to approximate the additive kernels (intersection, χ 2 , etc.) by linear ones, thus enabling the use of fast machine learning technique in a non-linear context. An analogous technique was proposed by Rahimi and Recht [14] for the translation invariant RBF kernels. In this paper, we complete the construction and combine the two techniques to obtain explicit feature maps for the generalized RBF kernels. Furthermore, we investigate a learning method using l 1 regularization to encourage sparsity in the final vector representation, and thus reduce its dimension. We evaluate this technique on the VOC 2007 detection challenge, showing when it can improve on fast additive kernels, and the trade-offs in complexity and accuracy
Fast and spatially-smooth terrain classification using monocular camera
CHETAN JAKKOJU,K Madhava Krishna,Jawahar C V
International conference on Pattern Recognition, ICPR, 2010
@inproceedings{bib_Fast_2010, AUTHOR = {CHETAN JAKKOJU, K Madhava Krishna, Jawahar C V}, TITLE = {Fast and spatially-smooth terrain classification using monocular camera}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2010}}
In this paper, we present a monocular camera based terrain classification scheme. The uniqueness of the proposed scheme is that it inherently incorporates spatial smoothness while segmenting a image, without requirement of post-processing smoothing methods. The algorithm is extremely fast because it is build on top of a Random Forest classifier. The baseline algorithm uses color, texture and their combination with classifiers such as SVM and Random Forests. We present comparison across features and classifiers. We further enhance the algorithm through a label transfer method. The efficacy of the proposed solution can be seen as we reach a low error rates on both our dataset and other publicly available datasets.
Efficient semantic indexing for image retrieval
LAKSHMI CHANDRIKA PULLA,PATHAPATI SUMAN KARTHIK,Jawahar C V
International conference on Pattern Recognition, ICPR, 2010
@inproceedings{bib_Effi_2010, AUTHOR = {LAKSHMI CHANDRIKA PULLA, PATHAPATI SUMAN KARTHIK, Jawahar C V}, TITLE = {Efficient semantic indexing for image retrieval}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2010}}
Semantic analysis of a document collection can be viewed as an unsupervised clustering of the constituent words and documents around hidden or latent concepts. This has shown to improve the performance of visual bag of words in image retrieval. However, the enhancement in performance depends heavily on the right choice of number of semantic concepts. Most of the semantic indexing schemes are also computationally costly. In this paper, we employ a bipartite graph model (BGM) for image retrieval. BGM is a scalable datastructure that aids semantic indexing in an efficient manner. It can also be incrementally updated. BGM uses tf-idf values for building a semantic bipartite graph. We also introduce a graph partitioning algorithm that works on the BGM to retrieve semantically relevant images from a database. We demonstrate the properties as well as performance of our semantic indexing scheme through a series of experiments. We also compare our methods with incremental pLSA
Multiple plane tracking using unscented kalman filter
Visesh Chari,Jawahar C V
International Conference on Intelligent Robots and Systems, IROS, 2010
@inproceedings{bib_Mult_2010, AUTHOR = {Visesh Chari, Jawahar C V}, TITLE = {Multiple plane tracking using unscented kalman filter}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2010}}
An important pre-requisite for many tasks like Visual Servoing and visual SLAM is the task of tracking the underlying features. The use of planar features for these purposes has gained importance recently. Complementing current planar tracking works in the robotics literature, which use multiple features, we formulate the tracking problem using multiple planes. Inspired by the maturity in understanding of geometric quantities like the homography in computer vision, we develop a system based on the Unscented Kalman Filter (UKF) that localizes the camera and estimates the plane parameters of a scene, using homographies as measurement. Homographies are estimated using tracked feature points. We show that this framework provides significant robustness and stability to the system under significant changes of illumination, occlusion etc. Finally, we also propose a Convex optimization based solution for the initialization of this system, which is capable of producing globally optimal estimates, and is a useful algorithm in its own right. Several synthetic and real results are presented to demonstrate the efficacy of our approach.
Realtime motion segmentation based multibody visual SLAM
ABHIJIT KUNDU,K Madhava Krishna,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2010
@inproceedings{bib_Real_2010, AUTHOR = {ABHIJIT KUNDU, K Madhava Krishna, Jawahar C V}, TITLE = {Realtime motion segmentation based multibody visual SLAM}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2010}}
In this paper, we present a practical vision based Simultaneous Localization and Mapping (SLAM) system for a highly dynamic environment. We adopt a multibody Structure from Motion (SfM) approach, which is the generalization of classical SfM to dynamic scenes with multiple rigidly moving objects. The proposed framework of multibody visual SLAM allows choosing between full 3D reconstruction or simply tracking of the moving objects, which adds flexibility to the system, for scenes containing non-rigid objects or objects having insufficient features for reconstruction. The solution demands a motion segmentation framework that can segment feature points belonging to different motions and maintain the segmentation with time. We propose a realtime incremental motion segmentation algorithm for this purpose. The motion segmentation is robust and is capable of segmenting difficult degenerate motions, where the moving objects is followed by a moving camera in the same direction. This robustness is attributed to the use of efficient geometric constraints and a probability framework which propagates the uncertainty in the system. The motion segmentation module is tightly coupled with feature tracking and visual SLAM, by exploring various feed-backs in between these modules. The integrated system can simultaneously perform realtime visual SLAM and tracking of multiple moving objects using only a single monocular camera.
Characteristic pattern discovery in videos
Mihir Jain,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2010
@inproceedings{bib_Char_2010, AUTHOR = {Mihir Jain, Jawahar C V}, TITLE = {Characteristic pattern discovery in videos}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2010}}
In this paper, we present an approach to discover characteristic patterns in videos. We characterize the videos based on frequently occurring patterns like scenes, characters, sequence of frames in an unsupervised setting. With our approach, we are able to detect the representative scenes and characters of movies. We also present a method for detecting video stop-words in broadcast news videos based on the frequency of occurrence of sequence of frames. These are analogous to stop-words in text classification and search. We employ two different video mining schemes; both aimed at detecting frequent and representative patterns. For one of our mining approaches, we use an efficient frequent pattern mining algorithm over a quantized feature space. Our second approach uses a Random Forest to first represent video data as sequences, and then mine the frequent patterns. We validate the proposed approaches on broadcast news videos and our database of 81 Oscar winning movies.
An indexing approach for speeding-up image classification
RAHUL JAIN,SUDHA PRAVEEN MAREMANDA,Pramod Sankar Kompalli,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2010
@inproceedings{bib_An_i_2010, AUTHOR = {RAHUL JAIN, SUDHA PRAVEEN MAREMANDA, Pramod Sankar Kompalli, Jawahar C V}, TITLE = {An indexing approach for speeding-up image classification}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2010}}
One of the most common computer vision tasks is that of recognizing the category of objects present in a given image. Previous work has mostly focused on building accurate classifiers based on carefully selected features. Classification is often carried on individual test images, while most of the practical situations, such as webscale image indexing, demand the simultaneous classification of a large collection of images. This is especially true for real-world datasets, that already contain numerous un-indexed images and videos. In this paper, we work towards developing a computationally efficient approach towards object recognition, that is inspired by retrieval schemes. We perform an offline indexing of the features from the collection, so that the classifier only needs to work on a small subset of the entire feature set. Over a set of 2 Million features extracted from 7000 images, classification against 5 object categories using a standard SVM would require more than 260 hours. Over the same test case, the classification time using our indexing based approach is reduced to less than 13 hours. The compromise on the accuracy is less than 7% for the 20X speedup achieved.
Blind authentication: a secure crypto-biometric verification protocol
MANEESH UPMANYU,Anoop Namboodiri,Srinathan Kannan,Jawahar C V
IEEE Transactions on Information Forensics and Security, TIFS, 2010
@inproceedings{bib_Blin_2010, AUTHOR = {MANEESH UPMANYU, Anoop Namboodiri, Srinathan Kannan, Jawahar C V}, TITLE = {Blind authentication: a secure crypto-biometric verification protocol}, BOOKTITLE = {IEEE Transactions on Information Forensics and Security}. YEAR = {2010}}
Concerns on widespread use of biometric authentication systems are primarily centered around template security, revocability, and privacy. The use of cryptographic primitives to bolster the authentication process can alleviate some of these concerns as shown by biometric cryptosystems. In this paper, we propose a provably secure and blind biometric authentication protocol, which addresses the concerns of user’s privacy, template protection, and trust issues. The protocol is blind in the sense that it reveals only the identity, and no additional information about the user or the biometric to the authenticating server or vice-versa. As the protocol is based on asymmetric encryption of the biometric data, it captures the advantages of biometric authentication as well as the security of public key cryptography. The authentication protocol can run over public networks and provide nonrepudiable identity verification. The encryption also provides template protection, the ability to revoke enrolled templates, and alleviates the concerns on privacy in widespread use of biometrics. The proposed approach makes no restrictive assumptions on the biometric data and is hence applicable to multiple biometrics. Such a protocol has significant advantages over existing biometric cryptosystems, which use a biometric to secure a secret key, which in turn is used for authentication. We analyze the security of the protocol under various attack scenarios. Experimental results on four biometric datasets (face, iris, hand geometry, and fingerprint) show that carrying out the authentication in the encrypted domain does not affect the accuracy, while the encryption key acts as an additional layer of security.
Efficient privacy preserving k-means clustering
MANEESH UPMANYU,Anoop Namboodiri,Srinathan Kannan,Jawahar C V
Pacific Asia Workshop on Intelligence and Security Informatics., PAISI, 2010
@inproceedings{bib_Effi_2010, AUTHOR = {MANEESH UPMANYU, Anoop Namboodiri, Srinathan Kannan, Jawahar C V}, TITLE = {Efficient privacy preserving k-means clustering}, BOOKTITLE = {Pacific Asia Workshop on Intelligence and Security Informatics.}. YEAR = {2010}}
This paper introduces an efficient privacy-preserving protocol for distributed K-means clustering over an arbitrary partitioned data, shared among N parties. Clustering is one of the fundamental algorithms used in the field of data mining. Advances in data acquisition methodologies have resulted in collection and storage of vast quantities of user’s personal data. For mutual benefit, organizations tend to share their data for analytical purposes, thus raising privacy concerns for the users. Over the years, numerous attempts have been made to introduce privacy and security at the expense of massive additional communication costs. The approaches suggested in the literature make use of the cryptographic protocols such as Secure Multiparty Computation (SMC) and/or homomorphic encryption schemes like Paillier’s encryption. Methods using such schemes have proven communication overheads. And in practice are found to be slower by a factor of more than 106. In light of the practical limitations posed by privacy using the traditional approaches, we explore a paradigm shift to side-step the expensive protocols of SMC. In this work, we use the paradigm of secret sharing, which allows the data to be divided into multiple shares and processed separately at different servers. Using the paradigm of secret sharing, allows us to design a provably-secure, cloud computing based solution which has negligible communication overhead compared to SMC and is hence over a million times faster than similar SMC based protocols
An adaptive outdoor terrain classification methodology using monocular camera
CHETAN JAKKOJU,K Madhava Krishna,Jawahar C V
International Conference on Intelligent Robots and Systems, IROS, 2010
@inproceedings{bib_An_a_2010, AUTHOR = {CHETAN JAKKOJU, K Madhava Krishna, Jawahar C V}, TITLE = {An adaptive outdoor terrain classification methodology using monocular camera}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2010}}
An adaptive partition based Random Forests classifier for outdoor terrain classification is presented in this paper. The classifier is a combination of two underlying classifiers. One of which is a random forest learnt over bootstrapped or offline dataset, the second is another random forest that adapts to changes on the fly. Posterior probabilities of both the static and changing/online classifiers are fused to assign the eventual label for the online image data. The online classifier learns at frequent intervals of time through a sparse and stable set of tracked patches, which makes it lightweight and real-time friendly. The learning which is actuated at frequent intervals during the sojourn significantly improves the performance of the classifier vis-a-vis a scheme that only uses the classifier learnt offline or at bootstrap. The method is well suited and finds immediate applications for outdoor autonomous driving where the classifier needs to be updated frequently based on what shows up recently on the terrain and without largely deviating from those learnt at bootstrapping. The role of the partition based classifier to enhance the performance of a regular multi class classifier such as random forests and multi class SVMs is also summarized in this paper.
Realtime Moving Object Detection from a Freely Moving Monocular Camera
ABHIJIT KUNDU,Jawahar C V,K Madhava Krishna
International Conference on Robotics and Biomimetics, ROBIO, 2010
@inproceedings{bib_Real_2010, AUTHOR = {ABHIJIT KUNDU, Jawahar C V, K Madhava Krishna}, TITLE = {Realtime Moving Object Detection from a Freely Moving Monocular Camera}, BOOKTITLE = {International Conference on Robotics and Biomimetics}. YEAR = {2010}}
Detection of moving objects is a key component in mobile robotic perception and understanding of the environment. In this paper, we describe a realtime independent motion detection algorithm for this purpose. The method is robust and is capable of detecting difficult degenerate motions, where the moving objects is followed by a moving camera in the same direction. This robustness is attributed to the use of efficient geometric constraints and a probability framework which propagates the uncertainty in the system. The proposed independent motion detection framework integrates seamlessly with existing visual SLAM solutions. The system consists of multiple modules which are tightly coupled so that one module benefits from another. The integrated system can simultaneously detect multiple moving objects in realtime from a freely moving monocular camera.
Oxford-IIIT TRECVID 2009 – Notebook Paper
V SREEKANTH,Mihir Jain,PARKHI OMKAR MORESHWAR,Jawahar C V
Text Retrieval Conference Video Retrieval Evaluation, TRECVID, 2009
@inproceedings{bib_Oxfo_2009, AUTHOR = {V SREEKANTH, Mihir Jain, PARKHI OMKAR MORESHWAR, Jawahar C V}, TITLE = {Oxford-IIIT TRECVID 2009 – Notebook Paper}, BOOKTITLE = {Text Retrieval Conference Video Retrieval Evaluation}. YEAR = {2009}}
1. Oxford-IIIT combined: a spatial pyramid intersection kernel SVM image classifier, a sliding-window randomforest object detector, a sliding-window intersection kernel SVM object detector, and a discriminative constellation model facial feature extractor. For each of the twenty features, methods were ranked based on their performance on a validation set and associated to successive runs by decreasing performance. For training, TRECVID annotations were manually corrected and augmented with object bounding boxes, and additional training data was used for under-represented features such as Airplane flying. 2. The different methods yielded a significantly different performance depending on the feature, as expected by their design. 3. The image classifier worked better for scene-level features such as Cityscape, Classroom, Doorway, while the object detectors worked better for Boat or ship, Bus, Person riding a bicycle, and the face feature extractor worked well for Female face closeup. 4. Three conclusions can be drawn: (i) different features are addressed better by specialised methods, (ii) removal of noise from TRECVID annotations (iii) additional data for under-represented features significantly improve performance.
Empirical evaluation of character classification schemes
NEEBA N.V.,Jawahar C V
International Conference on Applied Pattern Recognition, ICAPR, 2009
@inproceedings{bib_Empi_2009, AUTHOR = {NEEBA N.V., Jawahar C V}, TITLE = {Empirical evaluation of character classification schemes}, BOOKTITLE = {International Conference on Applied Pattern Recognition}. YEAR = {2009}}
In this paper, we empirically study the performance of a set of pattern classification schemes for character classification problems. We argue that with a rich feature space, this class of problems can be solved with reasonable success using a set of statistical feature extraction schemes. Experimental validation is done on a data set (of more than 5,00,000 characters) collected and annotated from books printed primarily in Malayalam. Scope of this study include (a) applicability of a spectrum of classifiers and features (b) scalability of classifiers (c) sensitivity of features to degradation (d) generalization across fonts and (e) applicability across scripts
Incremental on-line semantic indexing for image retrieval in dynamic databases
PATHAPATI SUMAN KARTHIK,LAKSHMI CHANDRIKA PULLA,Jawahar C V
Computer Vision and Pattern Recognition Conference workshops, CVPR-W, 2009
@inproceedings{bib_Incr_2009, AUTHOR = {PATHAPATI SUMAN KARTHIK, LAKSHMI CHANDRIKA PULLA, Jawahar C V}, TITLE = {Incremental on-line semantic indexing for image retrieval in dynamic databases}, BOOKTITLE = {Computer Vision and Pattern Recognition Conference workshops}. YEAR = {2009}}
Contemporary approaches to semantic indexing for bag of words image retrieval do not adapt well when the image or video collections dynamically get modified. In this paper, We propose an on-line incremental semantic indexing scheme for image retrieval in dynamic image collections. Our main contributions are in the form of a method and a data structure that tackle representation of the term document matrix and on-line semantic indexing where the database changes. We introduce a bipartite graph model (BGM) which is a scalable data structure that aids in online semantic indexing. It can also be incrementally updated. BGM uses tf-idf values for building a semantic bipartite graph. We also introduce a cash flow algorithm that works on the BGM to retrieve semantically relevant images from the database. We examine the properties of both BGM and cash flow algorithm through a series of experiments. Finally, we demonstrate how they can be effectively implemented to build large scale image retrieval systems in an incremental manner.
Example based video filters
Mihir Jain,V SREEKANTH,LAKSHMI CHANDRIKA PULLA,Jawahar C V
International Conference on Image and Video Retrieval, CIVR, 2009
@inproceedings{bib_Exam_2009, AUTHOR = {Mihir Jain, V SREEKANTH, LAKSHMI CHANDRIKA PULLA, Jawahar C V}, TITLE = {Example based video filters}, BOOKTITLE = {International Conference on Image and Video Retrieval}. YEAR = {2009}}
Many of the successful multimedia retrieval systems focus on developing efficient and effective video retrieval solutions with the help of appropriate index structures. In these systems, the query is an example video and the retrieved results are similar video clips which are available apriori in the database. In this paper, we address a complementary problem of filtering a video stream based on a set of given examples. By filtering, we mean to detect, accept or reject the part of a video stream matching any of the given examples. This requires matching of example videos with the on-line video stream. Since the concepts of interest could be complex, we avoid explicit learning of a representation from the example videos to characterize the visual event present in the examples. We model the problem as simultaneous on-line spotting of multiple examples in a video stream. We employ a vocabulary trie for the filtering purpose and demonstrate the applicability of the technique in a variety of situations.
Managing multilingual OCR project using XML
Gaurav Harit,K. J. JINESH,Ritu Garg,Jawahar C V,Santanu Chaudhury
nternational Workshop on Multilingual OCR., MOCR-W, 2009
@inproceedings{bib_Mana_2009, AUTHOR = {Gaurav Harit, K. J. JINESH, Ritu Garg, Jawahar C V, Santanu Chaudhury}, TITLE = {Managing multilingual OCR project using XML}, BOOKTITLE = {nternational Workshop on Multilingual OCR.}. YEAR = {2009}}
This paper presents an XML-based scheme for managing a large multilingual OCR project. In particular we describe how a new XML based tagging scheme has been exploited to achieve the objectives of the project. Managing a large multi-lingual OCR project involving multiple research groups, developing script specific and script independent technologies in a collaborative fashion is a challenging problem. In this paper, we present some of the software and data management strategies designed for the project aimed at developing OCR for 11 scripts of Indian origin for which mature OCR technology was not available.
Robust recognition of documents by fusing results of word clusters
KOMATIREDDY VENKAT RASAGNA REDDY,ANAND KUMAR,Jawahar C V,R. Manmatha
International Conference on Document Analysis and Recognition, ICDAR, 2009
@inproceedings{bib_Robu_2009, AUTHOR = {KOMATIREDDY VENKAT RASAGNA REDDY, ANAND KUMAR, Jawahar C V, R. Manmatha}, TITLE = {Robust recognition of documents by fusing results of word clusters}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2009}}
The word error rate of any optical character recognition system (OCR) is usually substantially below its component or character error rate. This is especially true of Indic languages in which a word consists of many components. Current OCRs recognize each character or word separately and do not take advantage of document level constraints. We propose a document level OCR which incorporates information from the entire document to reduce word error rates. Word images are first clustered using a locality sensitive hashing technique. Individual words are then recognized using a (regular) OCR. The OCR outputs of word images in a cluster are then corrected probabilistically by comparing with the OCR outputs of other members of the same cluster. The approach may be applied to improve the accuracy of any OCR run on documents in any language. In particular, we demonstrate it for Telugu, where the use of language models for post-processing is not promising. We show a relative improvement of 28% for long words and 12% for all words which appear at least twice in the corpus.
Subtitle-free Movie to Script Alignment
Pramod Sankar Kompalli,Jawahar C V,Andrew Zisserman
British Machine Vision Conference, BMVC, 2009
@inproceedings{bib_Subt_2009, AUTHOR = {Pramod Sankar Kompalli, Jawahar C V, Andrew Zisserman}, TITLE = {Subtitle-free Movie to Script Alignment}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2009}}
A standard solution for aligning scripts to movies is to use dynamic time warping with the subtitles (Everingham et al., BMVC 2006). We investigate the problem of aligning scripts to TV video/movies in cases where subtitles are not available, e.g. in the case of silent films or for film passages which are non-verbal. To this end we identify a number of “modes of alignment” and train classifiers for each of these. The modes include visual features, such as locations and face recognition, and audio features such as speech. In each case the feature gives some alignment information, but is too noisy when used independently. We show that combining the different features into a single cost function and optimizing this using dynamic programming, leads to a performance superior to each of the individual features. The method is assessed on episodes from the situation comedy Seinfeld, and on Charlie Chaplin and Indian movies.
Planar scene modeling from quasiconvex subproblems
Visesh Chari,ANIL KUMAR NELAKANTI,CHETAN JAKKOJU,Jawahar C V
Asian Conference on Computer Vision, ACCV, 2009
@inproceedings{bib_Plan_2009, AUTHOR = {Visesh Chari, ANIL KUMAR NELAKANTI, CHETAN JAKKOJU, Jawahar C V}, TITLE = {Planar scene modeling from quasiconvex subproblems}, BOOKTITLE = {Asian Conference on Computer Vision}. YEAR = {2009}}
. In this paper, we propose a convex optimization based approach for piecewise planar reconstruction. We show that the task of reconstructing a piecewise planar environment can be set in an L∞ based Homographic framework that iteratively computes scene plane and camera pose parameters. Instead of image points, the algorithm optimizes over inter-image homographies. The resultant objective functions are minimized using Second Order Cone Programming algorithms. Apart from showing the convergence of the algorithm, we also empirically verify its robustness to error in initialization through various experiments on synthetic and real data. We intend this algorithm to be in between initialization approaches like decomposition methods and iterative non-linear minimization methods like Bundle Adjustment.
A Bayesian Approach to Hybrid Image Retrieval
PRADHEE TANDON,Jawahar C V
Conference on Pattern Recognition and Machine Intelligence, PReMI, 2009
@inproceedings{bib_A_Ba_2009, AUTHOR = {PRADHEE TANDON, Jawahar C V}, TITLE = {A Bayesian Approach to Hybrid Image Retrieval}, BOOKTITLE = {Conference on Pattern Recognition and Machine Intelligence}. YEAR = {2009}}
Content based image retrieval (CBIR) has been well studied in the computer vision and multimedia community. Content free image retrieval (CFIR) methods, and their complementary characteristics to CBIR has not received enough attention in the literature. Performance of CBIR is constrained by the semantic gap between the feature representations and user expectations, while CFIR suffers with sparse logs and cold starts. We fuse both of them in a Bayesian framework to design a hybrid image retrieval system by overcoming their shortcomings. We validate our ideas and report experimental results, both qualitatively and quantitatively. We use our indexing scheme to efficiently represent both features and logs, thereby enabling scalability to millions of images.
Efficient Biometric Verification in Encrypted Domain
MANEESH UPMANYU,Anoop Namboodiri,Srinathan Kannan,Jawahar C V
International conference on Biometrics, IJCB, 2009
@inproceedings{bib_Effi_2009, AUTHOR = {MANEESH UPMANYU, Anoop Namboodiri, Srinathan Kannan, Jawahar C V}, TITLE = {Efficient Biometric Verification in Encrypted Domain}, BOOKTITLE = {International conference on Biometrics}. YEAR = {2009}}
Biometric authentication over public networks leads to a variety of privacy issues that needs to be addressed before it can become popular. The primary concerns are that the biometrics might reveal more information than the identity itself, as well as provide the ability to track users over an extended period of time. In this paper, we propose an authentication protocol that alleviates these concerns. The protocol takes care of user privacy, template protection and trust issues in biometric authentication systems. The protocol uses asymmetric encryption, and captures the advantages of biometric authentication. The protocol provides non-repudiable identity verification, while not revealing any additional information about the user to the server or vice versa. We show that the protocol is secure under various attacks. Experimental results indicate that the overall method is efficient to be used in practical scenarios.
Contextual restoration of severely degraded document images
JYOTIRMOY BANERJEE,Anoop Namboodiri,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2009
@inproceedings{bib_Cont_2009, AUTHOR = {JYOTIRMOY BANERJEE, Anoop Namboodiri, Jawahar C V}, TITLE = {Contextual restoration of severely degraded document images}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2009}}
We propose an approach to restore severely degraded document images using a probabilistic context model. Unlike traditional approaches that use previously learned prior models to restore an image, we are able to learn the text model from the degraded document itself, making the approach independent of script, font, style, etc. We model the contextual relationship using an MRF. The ability to work with larger patch sizes allows us to deal with severe degradations including cuts, blobs, merges and vandalized documents. Our approach can also integrate document restoration and super-resolution into a single framework, thus directly generating high quality images from degraded documents. Experimental results show significant improvement in image quality on document images collected from various sources including magazines and books, and comprehensively demonstrate the robustness and adaptability of the approach. It works well with document collections such as books, even with severe degradations, and hence is ideally suited for repositories such as digital libraries
Retrieval of online handwriting by synthesis and matching
Jawahar C V,A BALA SUBRAMANIAN,MILLION MESHESHA,Anoop Namboodiri
Pattern Recognition, PR, 2009
@inproceedings{bib_Retr_2009, AUTHOR = {Jawahar C V, A BALA SUBRAMANIAN, MILLION MESHESHA, Anoop Namboodiri}, TITLE = {Retrieval of online handwriting by synthesis and matching}, BOOKTITLE = {Pattern Recognition}. YEAR = {2009}}
Search and retrieval is gaining importance in the ink domain due to the increase in the availability of online handwritten data. However, the problem is challenging due to variations in handwriting between various writers, digitizers and writing conditions. In this paper, we propose a retrieval mechanism for online handwriting, which can handle different writing styles, specifically for Indian languages. The proposed approach provides a keyboard-based search interface that enables to search handwritten data from any platform, in addition to pen-based and example-based queries. One of the major advantages of this framework is that information retrieval techniques such as ranking relevance, detecting stopwords and controlling word forms are extended to work with search and retrieval in the ink domain. The framework also allows cross-lingual document retrieval across Indian languages
Efficient privacy preserving video surveillance
MANEESH UPMANYU,Anoop Namboodiri,Srinathan Kannan,Jawahar C V
International Conference on Computer Vision, ICCV, 2009
@inproceedings{bib_Effi_2009, AUTHOR = {MANEESH UPMANYU, Anoop Namboodiri, Srinathan Kannan, Jawahar C V}, TITLE = {Efficient privacy preserving video surveillance}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2009}}
Widespread use of surveillance cameras in offices and other business establishments, pose a significant threat to the privacy of the employees and visitors. The challenge of introducing privacy and security in such a practical surveillance system has been stifled by the enormous computational and communication overhead required by the solutions. In this paper, we propose an efficient framework to carry out privacy preserving surveillance. We split each frame into a set of random images. Each image by itself does not convey any meaningful information about the original frame, while collectively, they retain all the information. Our solution is derived from a secret sharing scheme based on the Chinese Remainder Theorem, suitably adapted to image data. Our method enables distributed secure processing and storage, while retaining the ability to reconstruct the original data in case of a legal requirement. The system installed in an office like environment can effectively detect and track people, or solve similar surveillance tasks. Our proposed paradigm is highly efficient compared to Secure Multiparty Computation, making privacy preserving surveillance, practical.
Efficient graph-based image matching for recognition and retrieval
PRAVEEN DASIGI,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2008
@inproceedings{bib_Effi_2008, AUTHOR = {PRAVEEN DASIGI, Jawahar C V}, TITLE = {Efficient graph-based image matching for recognition and retrieval}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2008}}
Graphs can be used for effective representation of images for recognition and retrieval purposes. The problem is often to find a proper structure that can efficiently describe an image and can be matched in reasonably low computational expense. The standard solutions to the graph matching problem are computationally expensive since the search space involves all permutations of the nodesets. We compare two graphical representations called the Nearest-Neighbor Graphs and the Collocation Trees, for the goodness of fit and the computational expense involved in matching. Various schemes to index the graphical structures have also been discussed.
Oxford/IIIT TRECVID 2008-Notebook paper.
James Philbin,Manuel Marin-Jimenez, Siddharth Srinivasan,Andrew Zisserman,Mihir Jain,V SREEKANTH,Pramod Sankar Kompalli,Jawahar C V
Text Retrieval Conference Video Retrieval Evaluation, TRECVID, 2008
@inproceedings{bib_Oxfo_2008, AUTHOR = {James Philbin, Manuel Marin-Jimenez, Siddharth Srinivasan, Andrew Zisserman, Mihir Jain, V SREEKANTH, Pramod Sankar Kompalli, Jawahar C V}, TITLE = {Oxford/IIIT TRECVID 2008-Notebook paper.}, BOOKTITLE = {Text Retrieval Conference Video Retrieval Evaluation}. YEAR = {2008}}
The Oxford/IIIT team participated in the high-level feature extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or audio information. For the high-level feature extraction task, we used two different approaches, both based on a combination of visual features. One used a SVM classifier using a linear combination of kernels, the other used a random forest classifier. For both methods, we trained all high-level features using publicly available annotations [3]. The advantage of the random forest classifier is the speed of training and testing. In addition, for the people feature, we took a more targeted approach. We used a real-time face detector and an upper body detector, in both cases running on every frame. Our best performing submission, C OXVGG 1 1, which used a rank fusion of our random forest and SVM approach, achieved an mAP of 0.101 and was above the median for all but one feature. In the interactive search task, our team came third overall with an mAP of 0.158. The system used was identical to last year with the only change being a source of accurate upper body detections.
VIDEO FRAME ALIGNMENT IN MULTIPLE VIEWS
SUJIT KUTHIRUMMAL,Jawahar C V,Narayanan P J
International Conference on Image Processing, ICIP, 2008
@inproceedings{bib_VIDE_2008, AUTHOR = {SUJIT KUTHIRUMMAL, Jawahar C V, Narayanan P J}, TITLE = {VIDEO FRAME ALIGNMENT IN MULTIPLE VIEWS}, BOOKTITLE = {International Conference on Image Processing}. YEAR = {2008}}
Many events are captured using multiple cameras today. Frames of each video stream have to be synchronized and aligned to a common time axis before processing them. Synchronization of the video streams necessarily needs a hardware based solution that is applied while capturing. The alignment problem between the frames of multiple videos can be posed as a search using traditional measures for image similarity. Multiview relations and constraints developed in Computer Vision recently can provide more elegant solutions to this problem. In this paper, we provide two solutions for the video frame alignment problem using two view and three view constraints. We present solutions to this problem for the case when the videos are taken using affine cameras and for general projective cameras. Excellent experimental results are achieved by our algorithms.
Retrieval from Image Datasets with Repetitive Structures
PRAVEEN DASIGI,Jawahar C V
National Conference on Communications, NCC, 2008
@inproceedings{bib_Retr_2008, AUTHOR = {PRAVEEN DASIGI, Jawahar C V}, TITLE = {Retrieval from Image Datasets with Repetitive Structures}, BOOKTITLE = {National Conference on Communications}. YEAR = {2008}}
This work aims to enhance the matching and retrieval performance over image datasets which have similar spatial structures that occur very frequently. Instead of treating images as bags of features, we try to encode the spatial relationships in the representation. This process would help to resolve the ambiguity when two classes of images have similar sets of features although in different spatial arrangements. To demonstrate the fact a sizeable dataset of license plate images is used. We have proposed a method to use graphs to encode the spatial relationships among features. The problem of image matching thus turns to finding the maximum similarity between labelled graphs. It is shown that the precision of the retrieved results increases with this matching scheme since most of the false matches are eliminated.
Feature Selection for Hand-Geometry based Person Authentication
Vandana ROy,Jawahar C V
IEEE Transactions on Image Processing, TIP, 2008
@inproceedings{bib_Feat_2008, AUTHOR = {Vandana ROy, Jawahar C V}, TITLE = {Feature Selection for Hand-Geometry based Person Authentication}, BOOKTITLE = {IEEE Transactions on Image Processing}. YEAR = {2008}}
Biometrics traits such as fingerprints, hand geometry, face and voice verification provide a reliable alternative for identity verification and are gaining commercial and high user acceptibilty rate. Hand geometry based biometric verification has proven to be the most suitable and acceptable biometric trait for medium and low security application. Geometric measurements of the human hand have been used for identity authentication in a number of commercial systems. However not much research has been done in the area of selection of the optimal discriminating features for hand-gemetry based authentication system. In this paper, We argue that the biometric verification problem can be best posed as the single-class problem. We propose to apply Biased Discriminant Analysis and Nonparametric Discriminant Analysis in order to transform the features into a new space where the samples are well separated.
Hybrid visual servoing by boosting IBVS and PBVS
A.H. Aabdul Hafez,Enric Cervera,Jawahar C V
International Conference on Information and Communication Technologies and Applications, ICTTA, 2008
@inproceedings{bib_Hybr_2008, AUTHOR = {A.H. Aabdul Hafez, Enric Cervera, Jawahar C V}, TITLE = {Hybrid visual servoing by boosting IBVS and PBVS}, BOOKTITLE = {International Conference on Information and Communication Technologies and Applications}. YEAR = {2008}}
In this paper, we present a novel boosted robot vision control algorithm. This method utilizes on-line boosting to produce a strong vision-based robot control starting from two weak algorithms. These weak methods are image-based and position-based visual servoing algorithms. The notion of weak and strong algorithms have been presented in the context of robot vision control. Appropriate error functions are defined for the weak algorithms to evaluate their suitability in the task. The integrated algorithm has superior performance both in image and Cartesian spaces. Experiments validate this claim.
Real Time L∞-based Solution to Multi-view Problems with Application to Visual Servoing
A. H. Abdul Hafez,Jawahar C V
International Conference on Information and Communication Technologies and Applications, ICTTA, 2008
@inproceedings{bib_Real_2008, AUTHOR = {A. H. Abdul Hafez, Jawahar C V}, TITLE = {Real Time L∞-based Solution to Multi-view Problems with Application to Visual Servoing}, BOOKTITLE = {International Conference on Information and Communication Technologies and Applications}. YEAR = {2008}}
In this paper we present a novel real time algorithm to sequentially solve a class of multi-view geometry problems. The triangulation problem is considered as study case. The problem concerns with the estimation of 3D point coordinates given its images as well as the matrices of the concern cameras used in the imaging process. The algorithm has direct application to real time systems like virtual reality, visual SLAM, and visual servoing. Application to visual servoing is considered in detail. Experiments have been carried out for the general triangulation problem as well as the application to visual servoing.
Autonomous image-based exploration for mobile robot navigation
SANTOSH KUMAR D,SUPREETH ACHAR,Jawahar C V
International Conference on Robotics and Automation, ICRA, 2008
@inproceedings{bib_Auto_2008, AUTHOR = {SANTOSH KUMAR D, SUPREETH ACHAR, Jawahar C V}, TITLE = {Autonomous image-based exploration for mobile robot navigation}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2008}}
Image-based navigation paradigms have recently emerged as an interesting alternative to conventional modelbased methods in mobile robotics. In this paper, we augment the existing image-based navigation approaches by presenting a novel image-based exploration algorithm. The algorithm facilitates a mobile robot equipped only with a monocular pan-tilt camera to autonomously explore a typical indoor environment. The algorithm infers frontier information directly from the images and displaces the robot towards regions that are informative for navigation. The frontiers are detected using a geometric context-based segmentation scheme that exploits the natural scene structure in indoor environments. In the due process, a topological graph of the workspace is built in terms of images which can be subsequently utilised for the tasks of localisation, path planning and navigation. Experimental results on a mobile robot in an unmodified laboratory and corridor environments demonstrate the validity of the approach.
Visual servoing based on gaussian mixture models
A.H. Abdul Hafez,SUPREETH ACHAR,Jawahar C V
International Conference on Robotics and Automation, ICRA, 2008
@inproceedings{bib_Visu_2008, AUTHOR = {A.H. Abdul Hafez, SUPREETH ACHAR, Jawahar C V}, TITLE = {Visual servoing based on gaussian mixture models}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2008}}
In this paper we present a novel approach to robust visual servoing. This method removes the feature tracking step from a typical visual servoing algorithm. We do not need correspondences of the features for deriving the control signal. This is achieved by modeling the image features as a Mixture of Gaussians in the current as well as desired images. Using Lyapunov theory, a control signal is derived to minimize a distance function between the two Gaussian mixtures. The distance function is given in a closed form, and its gradient can be efficiently computed and used to control the system. For simplicity, we first consider the 2D motion case. Then, the general case is presented by introducing the depth distribution of the features to control the six degrees of freedom. Experiments are conducted within a simulation framework to validate our proposed method.
Video Completion as Noise Removal
Visesh Chari,Narayanan P J,Jawahar C V
National Conference on Communications, NCC, 2008
@inproceedings{bib_Vide_2008, AUTHOR = {Visesh Chari, Narayanan P J, Jawahar C V}, TITLE = {Video Completion as Noise Removal}, BOOKTITLE = {National Conference on Communications}. YEAR = {2008}}
Video completion algorithms have concentrated on obtaining visually consistent solutions to fill-in the missing portions, without any emphasis on the physical correctness of the video. Resulting solutions thus use texture or image structure based cues and are limited in the situations they can handle. In this paper we take a model based signal processing approach to video completion [1]. Completion of the video is then defined as satisfying the given model by detecting and removing the error (selected parts of the video to be replaced). Given a probabilistic model, video completion then becomes an unsupervised learning algorithm with the input video giving a “noisy” version. Dense completion is the automatic inferencing of the “noise-less” or “true” video from the input. This approach finds a solution that satisfies visual coherence and is applicable to a wide variety of scenarios. We demonstrate the efficacy of our approach and its wide applicability using two scenarios.
Restoration of Document Images using Bayesian Inference
JYOTIRMOY BANERJEE,Jawahar C V
National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, NCVPRIPG, 2008
@inproceedings{bib_Rest_2008, AUTHOR = {JYOTIRMOY BANERJEE, Jawahar C V}, TITLE = {Restoration of Document Images using Bayesian Inference}, BOOKTITLE = {National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics}. YEAR = {2008}}
Restoration of documents has critical applications in document understanding as well as in digital libraries (for example as in book readers). This paper presents a method for restoration of document images, using a Maximum a Posteriori formulation. The advantage of our method is that the prior need not be learned from the training images. The extraction of a single high-quality enhanced text image from a set of degraded images can benefit from a strong prior knowledge, typical of text images. The restoration process should allow for discontinuities but at the same time discourage oscillations. These properties were represented in a total variation based prior model. Results indicate that our method is appropriate for document image restoration, where resolution enhancement is an added gain.
Super-resolution of text images using edge-directed tangent field
JYOTIRMOY BANERJEE,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2008
@inproceedings{bib_Supe_2008, AUTHOR = {JYOTIRMOY BANERJEE, Jawahar C V}, TITLE = {Super-resolution of text images using edge-directed tangent field}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2008}}
This paper presents an edge-directed super-resolution algorithm for document images without using any training set. This technique creates an image with smooth regions in both the foreground and the background, while allowing sharp discontinuities across and smoothness along the edges. Our method preserves sharp corners in text images by using the local edge direction, which is computed first by evaluating the gradient field and then taking its tangent. Super-resolution of document images is characterized by bimodality, smoothness along the edges as well as subsampling consistency. These characteristics are enforced in a Markov Random Field (MRF) framework by defining an appropriate energy function. In our method, subsampling of super-resolution image will return the original lowresolution one, proving the correctness of the method. The super-resolution image, is generated by iteratively reducing this energy function. Experimental results on a variety of input images, demonstrate the effectiveness of our method for document image super-resolution.
Matching word images for content-based retrieval from printed document images
MILLION MESHESHA,Jawahar C V
International Journal on Document Analysis and Recognition, IJDAR, 2008
@inproceedings{bib_Matc_2008, AUTHOR = {MILLION MESHESHA, Jawahar C V}, TITLE = {Matching word images for content-based retrieval from printed document images}, BOOKTITLE = {International Journal on Document Analysis and Recognition}. YEAR = {2008}}
As large quantity of document images is getting archived by the digital libraries, there is a need for an efficient search strategies to make them available as per users information need. In this paper, we propose an effective word image matching scheme that achieves high performance in the presence of script variability, printing variation, degradation and word-form variants. A novel partial matching algorithm is designed for morphological matching of word form variants in a language. We formulate feature extraction scheme that extracts local features by scanning vertical strips of the word image and combining them automatically based on their discriminatory potential. We present detailed performance analysis of the proposed approach on English, Amharic and Hindi documents.
Recognition of books by verification and retraining
NEEBA N.V.,Jawahar C V
International conference on Pattern Recognition, ICPR, 2008
@inproceedings{bib_Reco_2008, AUTHOR = {NEEBA N.V., Jawahar C V}, TITLE = {Recognition of books by verification and retraining}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2008}}
The problem of character recognition in a book should be formulated significantly different from that of a single page or word. An ideal approach to design such a recognizer is to adapt the classifier to the font and style of the collection. In this paper, we propose an adaptation framework to recognize characters in a book with a learning framework. In the proposed system, the post processor verifies the output of the recognition module, which is further used for learning and thus to improve the performance over iteration. Experiments are conducted on about 500,000 annotated symbols from five books in Malayalam (an Indian language). We achieve an average improvement of 14% in classification accuracy.
Adaptation and learning for image based navigation
SUPREETH ACHAR,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2008
@inproceedings{bib_Adap_2008, AUTHOR = {SUPREETH ACHAR, Jawahar C V}, TITLE = {Adaptation and learning for image based navigation}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2008}}
Image based methods are a new approach for solving problems in mobile robotics. Instead of building a metric (3D) model of the environment, these methods work directly in the sensor (image) space. The environment is represented as a topological graph in which each node contains an image taken at some pose in the workspace, and edges connect poses between which a simple path exists. This type of representation is highly scalable and is also well suited to handle the data association problems that effect metric model based methods. In this paper, we present an efficient, adaptive method for qualitative localization using content based image retrieval techniques. In addition, we demonstrate an algorithm which can convert this topological graph into a metric model of the environment by incorporating information about loop closures
Document image segmentation as a spectral partitioning problem
PRAVEEN DASIGI,RAMAN JAIN,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2008
@inproceedings{bib_Docu_2008, AUTHOR = {PRAVEEN DASIGI, RAMAN JAIN, Jawahar C V}, TITLE = {Document image segmentation as a spectral partitioning problem}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2008}}
State of art document segmentation algorithms employ adhoc solutions which use some document properties and iteratively segment the document image. These solutions need to be adapted frequently and sometimes fail to perform well for complex scripts. This calls for a generalized solution that achieves a one shot segmentation that is globally optimal. This paper describes one such solution based on the optimization problem of spectral partitioning which makes the decision of proper segmentation based on the spectral properties of the pairwise similarity matrix. The solution described in the paper is shown to be general, global and closed form. The claims have been demonstrated on 142 page images from a Telugu book, in a script set in both poetry and prose layouts. This particular class of scripts has been proved to be challenging for the existing state of the art algorithms, where the proposed solution achieves significant results.
Attention-based super resolution from videos
DILEEP REDDY VAKA,Narayanan P J,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2008
@inproceedings{bib_Atte_2008, AUTHOR = {DILEEP REDDY VAKA, Narayanan P J, Jawahar C V}, TITLE = {Attention-based super resolution from videos}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2008}}
A video from a moving camera produces different number of observations of different scene areas. We can construct an attention map of the scene by bringing the frames to a common reference and counting the number of frames that observed each scene point. Different representations can be constructed from this. The base of the attention map gives the scene mosaic. Super-resolved images of parts of the scene can be obtained using a subset of observations or video frames. We can combine mosaicing with superresolution by using all observations, but the magnification factor will vary across the scene based on the attention received. The height of the attention map indicates the amount of super-resolution for that scene point. We modify the traditional super-resolution framework to generate a varying resolution image for panning cameras in this paper. The varying resolution image uses all useful data available in a video. We introduce the concept of attention-based super-resolution and give the modified framework for it. We also show its applicability on a few indoor and outdoor videos.
Frequency Domain Visual Servoing using Planar Contours
Visesh Chari,AVINASH SHARMA,Anoop Namboodiri,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2008
@inproceedings{bib_Freq_2008, AUTHOR = {Visesh Chari, AVINASH SHARMA, Anoop Namboodiri, Jawahar C V}, TITLE = {Frequency Domain Visual Servoing using Planar Contours}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2008}}
Fourier domain methods have had a long association with geometric vision. In this paper, we introduce Fourier domain methods into the field of visual servoing for the first time. We show how different properties of Fourier transforms may be used to address specific issues in traditional visual servoing methods, giving rise to algorithms that are more flexible. Specifically, we demonstrate how Fourier analysis may be used to obtain straight camera paths in the Cartesian space, do path following and correspondenceless visual servoing. Most importantly, by introducing Fourier techniques, we set a framework into which robust Fourier based geometry processing algorithms may be incorporated to address the various issues in servoing.
Efficient Implementation of SVM for Large Class Problems
P.ILAYARAJA,NEEBA N.V.,Jawahar C V
International conference on Pattern Recognition, ICPR, 2008
@inproceedings{bib_Effi_2008, AUTHOR = {P.ILAYARAJA, NEEBA N.V., Jawahar C V}, TITLE = {Efficient Implementation of SVM for Large Class Problems}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2008}}
Multiclass classification is an important problem in pattern recognition. Hierarchical SVM classifiers such as DAG-SVM and BHC-SVM are popular in solving multiclass problems. However, a bottleneck with these approaches is the number of component classifiers, and the associated time and space requirements. In this paper, we describe a simple, yet effective method for efficiently storing support vectors that exploits the redundancies in them across the classifiers to obtain significant reduction in storage and computational requirements. We also present our extension to an algebraic exact simplification method for simplifying hierarchical classifier solutions.
FISH:A Practical System for Fast Interactive Image Search in Huge Databases
PRADHEE TANDON,PIYUSH NIGAM,Vikram Pudi,Jawahar C V
International Conference on Image and Video Retrieval, CIVR, 2008
@inproceedings{bib_FISH_2008, AUTHOR = {PRADHEE TANDON, PIYUSH NIGAM, Vikram Pudi, Jawahar C V}, TITLE = {FISH:A Practical System for Fast Interactive Image Search in Huge Databases}, BOOKTITLE = {International Conference on Image and Video Retrieval}. YEAR = {2008}}
The problem of search and retrieval of images using relevance feedback has attracted tremendous attention in recent years from the research community. A real-world-deployable interactive image retrieval system must (1) be accurate, (2) require minimal user-interaction, (3) be efficient, (4) be scalable to large collections (millions) of images, and (5) support multi-user sessions. For good accuracy, we need effective methods for learning the relevance of image features based on user feedback, both within a user-session and across sessions. Efficiency and scalability require a good index structure for retrieving results. The index structure must allow for the relevance of image features to continually change with fresh queries and user-feedback. The state-of-the-art methods available today each address only a subset of these issues. In this paper, we build a complete system FISH - Fast Image Search in Huge databases. In FISH, we integrate selected techniques available in the literature, while adding a few of our own. We perform extensive experiments on real datasets to demonstrate the accuracy, efficiency and scalability of FISH. Our results show that the system can easily scale to millions of images while maintaining interactive response time.
Fast and secure real-time video encryption
C NARSIMHA RAJU,UMA DEVI GANUGULA,Srinathan Kannan,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2008
@inproceedings{bib_Fast_2008, AUTHOR = {C NARSIMHA RAJU, UMA DEVI GANUGULA, Srinathan Kannan, Jawahar C V}, TITLE = {Fast and secure real-time video encryption}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2008}}
Advances in digital content transmission have increased in the past few years. Security and privacy issues of the transmitted data have become an important concern in multimedia technology. In this paper, we propose a computationally efficient and secure video encryption algorithm. This makes secure video encryption feasible for real-time applications without any extra dedicated hardware. We achieve computational efficiency by exploiting the frequently occurring patterns in the DCT coefficients of the video data. Computational complexity of the encryption is made proportional to the influence of the DCT coefficients on the visual content. On an average, our algorithm takes only 8.32 ms of encryption time per frame.
A real-time video encryption exploiting the distribution of the DCT coefficients
C NARSIMHA RAJU,Srinathan Kannan,Jawahar C V
IEEE Region 10 Conference, TENCON, 2008
@inproceedings{bib_A_re_2008, AUTHOR = {C NARSIMHA RAJU, Srinathan Kannan, Jawahar C V}, TITLE = {A real-time video encryption exploiting the distribution of the DCT coefficients}, BOOKTITLE = {IEEE Region 10 Conference}. YEAR = {2008}}
Most of the video encryptions algorithms considered in the literature significantly increase the video size independent of encryption algorithm employed. This is because these algorithms encrypt DCTs without considering their characteristics or relationship to the visual content. This adversely affects the transmission throughput. We propose a fast video encryption algorithm that exploits the statistics of the DCTs in the video data sets. Our algorithm reduces the video size significantly without any compromise in security. The proposed algorithm performs encryption followed by permutation of the DCTs. On an average, increase in video size is restricted to 23.41% of the original.
A novel video encryption technique based on secret sharing
C NARSIMHA RAJU,UMA DEVI GANUGULA,Srinathan Kannan,Jawahar C V
International Conference on Image Processing, ICIP, 2008
@inproceedings{bib_A_no_2008, AUTHOR = {C NARSIMHA RAJU, UMA DEVI GANUGULA, Srinathan Kannan, Jawahar C V}, TITLE = {A novel video encryption technique based on secret sharing}, BOOKTITLE = {International Conference on Image Processing}. YEAR = {2008}}
The rapid growth of Internet and digitized content has made video distribution easy. Hence the need for video data protection is on the rise. In this paper, we propose a secure and computationally feasible video encryption algorithm based on the method of Secret Sharing. In an MPEG video, the strength of the DC is distributed among the AC values based on Shamir's Secret Sharing (SSS) scheme. The proposed algorithm guarantees security, speed and error tolerance with a small increase in video size.
Private content based image retrieval
SHASHANK JAGARLAMUDI,Kowshik P,Srinathan Kannan,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2008
@inproceedings{bib_Priv_2008, AUTHOR = {SHASHANK JAGARLAMUDI, Kowshik P, Srinathan Kannan, Jawahar C V}, TITLE = {Private content based image retrieval}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2008}}
For content level access, very often database needs the query as a sample image. However, the image may contain private information and hence the user does not wish to reveal the image to the database. Private content based image retrieval (PCBIR) deals with retrieving similar images from an image database without revealing the content of the query image - not even to the database server. We propose algorithms for PCBIR, when the database is indexed using hierarchical index structure or hash based indexing scheme. Experiments are conducted on real datasets with popular features and state of the art data structures. It is observed that specialty and subjectivity of image retrieval (unlike SQL queries to a relational database) enables in computationally efficient yet private solutions.
On Segmentation of Documents in Complex Scripts
K S SESH KUMAR,SUKESH KUMAR,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2008
@inproceedings{bib_On_S_2008, AUTHOR = {K S SESH KUMAR, SUKESH KUMAR, Jawahar C V}, TITLE = {On Segmentation of Documents in Complex Scripts}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2008}}
Document image segmentation algorithms primarily aim at separating text and graphics in presence of complex layouts. However, for many non-Latin scripts, segmentation becomes a challenge due to the characteristics of the script. In this paper, we empirically demonstrate that successful algorithms for Latin scripts may not be very effective for Indic and complex scripts. We explain this based on the differences in the spatial distribution of symbols in the scripts. We argue that the visual information used for segmentation needs to be enhanced with other information like script models for accurate results.
Robust image registration with illumination, blur and noise variations for super-resolution
HIMANSHU ARORA,Anoop Namboodiri,Jawahar C V
International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2008
@inproceedings{bib_Robu_2008, AUTHOR = {HIMANSHU ARORA, Anoop Namboodiri, Jawahar C V}, TITLE = {Robust image registration with illumination, blur and noise variations for super-resolution}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2008}}
Super-resolution reconstruction algorithms assume the availability of exact registration and blur parameters. Inaccurate estimation of these parameters adversely affects the quality of the reconstructed image. However, traditional approaches for image registration are either sensitive to image degradations such as variations in blur, illumination and noise, or are limited in the class of image transformations that can be estimated. We propose an accurate registration algorithm that uses the local phase information, which is robust to the above degradations. We derive the theoretical error rate of the estimates in presence of non-ideal band-pass behavior of the filter and show that the error converges to zero over iterations. We also show the invariance of local phase to a class of blur kernels. Experimental results on images taken under varying conditions clearly demonstrates the robustness of our approach.
On-line convex optimization based solution for mapping in VSLAM
A.H. Abdul Hafez,Shivudu Bhuvanagiri,K Madhava Krishna,Jawahar C V
International Conference on Intelligent Robots and Systems, IROS, 2008
@inproceedings{bib_On-l_2008, AUTHOR = {A.H. Abdul Hafez, Shivudu Bhuvanagiri, K Madhava Krishna, Jawahar C V}, TITLE = {On-line convex optimization based solution for mapping in VSLAM}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2008}}
This paper presents a novel real-time algorithm to sequentially solve the triangulation problem. The problem addressed is estimation of 3D point coordinates given its images and the matrices of the respective cameras used in the imaging process. The algorithm has direct application to real time systems like visual SLAM. This article demonstrates the application of the proposed algorithm to the mapping problem in visual SLAM. Experiments have been carried out for the general triangulation problem as well as the application to visual SLAM. Results show that the application of the proposed method to mapping in visual SLAM outperforms the state of the art mapping methods.
Machine vision analysis of the energy efficiency of intermodal freight trains
Y-C Lai,C P L Barkan,J Drapa,N Ahuja, J M Hart,Narayanan P J,Jawahar C V, A Kumar, L R Milhon
Journal of Railway and Rapid Transit, JRRT, 2007
@inproceedings{bib_Mach_2007, AUTHOR = {Y-C Lai, C P L Barkan, J Drapa, N Ahuja, J M Hart, Narayanan P J, Jawahar C V, A Kumar, L R Milhon}, TITLE = {Machine vision analysis of the energy efficiency of intermodal freight trains}, BOOKTITLE = {Journal of Railway and Rapid Transit}. YEAR = {2007}}
Intermodal (IM) trains are typically the fastest freight trains operated in North America. The aerodynamic characteristics of many of these trains are often relatively poor resulting in high fuel consumption. However, considerable variation in fuel efficiency is possible depending on how the loads are placed on railcars in the train. Consequently, substantial potential fuel savings are possible if more attention is paid on the loading configurations of trains. A wayside machine vision (MV) system was developed to automatically scan passing IM trains and assess their aerodynamic efficiency. MV algorithms are used to analyse these images, detect and measure gaps between loads. In order to make use of the data, a scoring system was devel- oped based on two attributes – the aerodynamic coefficient and slot efficiency. The aerodynamic coefficient is calculated using the Aerodynamic Subroutine of the Train Energy Model. Slot effi- ciency represents the difference between the actual and ideal loading configuration given the particular set of railcars in the train. This system can provide IM terminal managers feedback on loading performance for trains and be integrated into the software support systems used for loading assignment.
Optimizing image and camera trajectories in robot vision control using on-line boosting
A. H. Abdul Hafez,Enric Cervera,Jawahar C V
International Conference on Intelligent Robots and Systems, IROS, 2007
@inproceedings{bib_Opti_2007, AUTHOR = {A. H. Abdul Hafez, Enric Cervera, Jawahar C V}, TITLE = {Optimizing image and camera trajectories in robot vision control using on-line boosting}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2007}}
In this paper, we present a novel boosted robot vision control algorithm. The method utilizes on-line boosting to produce a strong vision-based robot control starting from two weak algorithms. The notion of weak and strong algorithms has been presented in the context of robot vision control, in presence of uncertainty in the measurement process. Appropriate probabilistic error functions are defined for the weak algorithm to evaluate their suitability in the task. An on-line boosting algorithm is employed to derive a final strong algorithm starting from two weak algorithms. This strong one has superior performance both in image and Cartesian spaces. Experiments justify this claim.
Kernel approach to autoregressive modeling
RANJEETH KUMAR D,Jawahar C V
National Conference on Communications, NCC, 2007
@inproceedings{bib_Kern_2007, AUTHOR = {RANJEETH KUMAR D, Jawahar C V}, TITLE = {Kernel approach to autoregressive modeling}, BOOKTITLE = {National Conference on Communications}. YEAR = {2007}}
A kernel-based approach for nonlinear modeling of time series data is proposed in this paper. Autoregressive modeling is achieved in a feature space defined by a kernel function using a linear algorithm. The method extends the advantages of the conventional autoregressive models to characterization of nonlinear signals through the intelligent use of kernel functions.Experiments with synthetic signals demonstrate that this method seems to be a promising alternative to nonlinear modeling schemes.
Indigenous scripts of African languages
MILLION MESHESHA,Jawahar C V
Indilinga African Journal of Indigenous Knowledge Systems, IAJIKS, 2007
@inproceedings{bib_Indi_2007, AUTHOR = {MILLION MESHESHA, Jawahar C V}, TITLE = {Indigenous scripts of African languages}, BOOKTITLE = {Indilinga African Journal of Indigenous Knowledge Systems}. YEAR = {2007}}
In Africa there are a number of languages. Some of these languages have their own indigenous scripts. In this paper we present script analysis for the indigenous scripts of African languages, with particular emphasis to Amharic language. Amharic is the official and working language of Ethiopia, which has its own writing system. This is the first attempt to analyze scripts of African language to ease document analysis and understanding. We believe researchers will continue exploring African indigenous languages and their scripts to be part of the revolving information technology for local development. We also highlighted problems related to the scripts that have bearings on Amharic document analysis and understanding. Especially availability of large number of characters and similarity among characters makes the task of document understanding research much tougher than that of most Latin-based scripts.
Class-Specific Kernel Selection for Verification Problems
RANJEETH KUMAR D,Jawahar C V
Advances In Pattern Recognition,, APR, 2007
@inproceedings{bib_Clas_2007, AUTHOR = {RANJEETH KUMAR D, Jawahar C V}, TITLE = {Class-Specific Kernel Selection for Verification Problems}, BOOKTITLE = {Advances In Pattern Recognition,}. YEAR = {2007}}
The single-class verification framework is gaining increasing attention for problems involving authentication and retrieval. In this paper, nonlinear features are extracted using the kernel trick. The class of interest is modeled by using all the available samples rather than a single representative sample. Kernel selection is used to enhance the class specific feature set. A tunable objective function is used to select the kernel which enables the adjustment of the false acceptance and false rejection rates. The errors caused due to the presence of highly similar classes are reduced by using a two-stage hierarchical authentication framework. The performance of the resulting verification system is demonstrated on the hand-geometry based authentication problem with encouraging results.
A system for information retrieval applications on broadcast news videos
TARUN JAIN,SAI RAM KUNALA,RAVI KISHORE KANDALA,Jawahar C V
International Symposium on Data, Information and Knowledge Spectrum, ISDIKS, 2007
@inproceedings{bib_A_sy_2007, AUTHOR = {TARUN JAIN, SAI RAM KUNALA, RAVI KISHORE KANDALA, Jawahar C V}, TITLE = {A system for information retrieval applications on broadcast news videos}, BOOKTITLE = {International Symposium on Data, Information and Knowledge Spectrum}. YEAR = {2007}}
We present a system that is specifically designed for the information retrieval applications on broadcast news videos. The system is directly useful to an end user for easy access to the news stories of interest. It also act as a platform for convenient deployment and experimentation of various video analysis and indexing techniques on real data, and on a large scale. The system is built upon four layer architecture with certain software design choices that makes the system highly scalable, extensible and modular. The system has been in use for 20 months and has processed around 70Tb of broadcast news data till date. Extensive performance analysis of the system was done by deploying 14 state of the art desktop systems. Results of the same are reported here. This system holds immense potential for the emerging video information retrieval applications.
A vision system for monitoring intermodal freight trains
Avinash Kumar,Narendra Ahuja,John M Hart,UDAY KUMAR VISESH,Narayanan P J,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2007
@inproceedings{bib_A_vi_2007, AUTHOR = {Avinash Kumar, Narendra Ahuja, John M Hart, UDAY KUMAR VISESH, Narayanan P J, Jawahar C V}, TITLE = {A vision system for monitoring intermodal freight trains}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2007}}
We describe the design and implementation of a vision based Intermodal Train Monitoring System(ITMS) for extracting various features like length of gaps in an intermodal(IM) train which can later be used for higher level inferences. An intermodal train is a freight train consisting of two basic types of loads - containers and trailers. Our system first captures the video of an IM train, and applies image processing and machine learning techniques developed in this work to identify the various types of loads as containers and trailers. The whole process relies on a sequence of following tasks - robust background subtraction in each frame of the video, estimation of train velocity, creation of mosaic of the whole train from the video and classification of train loads into containers and trailers. Finally, the length of gaps between the loads of the IM train is estimated and is used to analyze the aerodynamic efficiency of the loading pattern of the train, which is a critical aspect of freight trains. This paper focusses on the machine vision aspect of the whole system.
Enhanced video mosaicing using camera motion properties
PARIKH PULKIT TRUSHANT KUMAR,Jawahar C V
IEEE Workshop on Motion and Video Computing, WMVC, 2007
@inproceedings{bib_Enha_2007, AUTHOR = {PARIKH PULKIT TRUSHANT KUMAR, Jawahar C V}, TITLE = {Enhanced video mosaicing using camera motion properties}, BOOKTITLE = {IEEE Workshop on Motion and Video Computing}. YEAR = {2007}}
We propose a video mosaicing scheme which exploits the motion information, implicitly available in the video. The information about the camera motion is propagated to the homographies used for mosaicing. While some of the recent approaches make use of the information stemming from non-overlapping pairs of frames, the smoothness of the camera motion has gone largely under-capitalized. We present a technique which exploits this useful cue for refining homographies. Moreover, a generic framework which exploits the camera motion model, to relate homographies in a video, is also proposed. The analysis and results of the proposed algorithms demonstrate significant promise, in terms of accuracy and robustness.
Modeling Time-Varying Population for Biometric Authentication
P VANDANA,Jawahar C V
International Conference on Computer Theory and Applications, ICCTA, 2007
@inproceedings{bib_Mode_2007, AUTHOR = {P VANDANA, Jawahar C V}, TITLE = {Modeling Time-Varying Population for Biometric Authentication}, BOOKTITLE = {International Conference on Computer Theory and Applications}. YEAR = {2007}}
Population size plays a major role in determining the performance of any biometric authentication system, particularly when such systems are used for civilian applications. In this paper, we propose to improve the performance of a biometric authentication system by modeling the variation in the population participating in the process. We show that this technique is helpful when the number of users enrolled into the system is very large as compared to the number of users which actually participate in the process. We show results on aperiodically and periodically varying population using Markov models.
Visual servoing by optimization of a 2D/3D hybrid objective function
A. H. Abdul Hafez,Jawahar C V
International Conference on Robotics and Automation, ICRA, 2007
@inproceedings{bib_Visu_2007, AUTHOR = {A. H. Abdul Hafez, Jawahar C V}, TITLE = {Visual servoing by optimization of a 2D/3D hybrid objective function}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2007}}
In this paper, we present a new hybrid visual servoing algorithm for robot arm positioning task. Hybrid methods in visual servoing partially combine the 2D and 3D visual information to improve the performance of the traditional image-based and position-based visual servoing. Our algorithm is superior to the state of the art hybrid methods. The objective function has been designed to include the full 2D and 3D information available either from the CAD model or from the partial reconstruction process by decomposing the homography matrix between two views. Here, each of 2D and 3D error functions is used to control the six degrees of freedom. We call this method 5D visual servoing. The positioning task has been formulated as a minimization problem. Gradient decent as a first order approximation and Gauss-Newton as a second order approximation are considered in this paper. Simulation results show that these two methods provide an efficient solution to the camera retreat and features visibility problems. The camera trajectory in the Cartesian space is also shown to be satisfactory.
Visual servoing in non-rigid environments: A space-time approach
SANTOSH KUMAR D,Jawahar C V
International Conference on Robotics and Automation, ICRA, 2007
@inproceedings{bib_Visu_2007, AUTHOR = {SANTOSH KUMAR D, Jawahar C V}, TITLE = {Visual servoing in non-rigid environments: A space-time approach}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2007}}
Most robotic vision algorithms are proposed by envisaging robots operating in structured environments where the world is assumed to be rigid. These algorithms fail to provide optimum behavior when the robot has to be controlled with respect to active non-rigid targets. This paper presents a new framework for visual servoing that accomplishes the robot positioning task even in non-rigid environments. We introduce a space-time representation scheme for modeling the deformations of a non-rigid object and propose a model-free hybrid approach that exploits the two-view geometry induced by the space-time features to perform the servoing task. Our formulation can address a variety of non-rigid motions and can tackle large camera displacements without being affected by the degeneracies in the task space. Experimental results validate our approach and demonstrate the robust and stable behavior.
Combining texture and edge planar trackers based on a local quality metric
A. H. Abdul Hafez,UDAY KUMAR VISESH,Jawahar C V
International Conference on Robotics and Automation, ICRA, 2007
@inproceedings{bib_Comb_2007, AUTHOR = {A. H. Abdul Hafez, UDAY KUMAR VISESH, Jawahar C V}, TITLE = {Combining texture and edge planar trackers based on a local quality metric}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2007}}
A new probabilistic tracking framework for integrating information available from various visual cues is presented in this paper. The framework allows selection of “good” features for each cue, along with factors of their “goodness” to select the best combination form. Two particle filter based trackers, which use edge and texture features, run independently. The output of the master tracker is computed using democratic integration using the “goodness” weights. The final output is used as apriori for both tracker in the next iteration. Finally, particle filters are used to deal with non-Gaussian errors in feature extraction / prior computation. Results are shown for planar object tracking.
Probabilistic reverse annotation for large scale image retrieval
Pramod Sankar Kompalli,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2007
@inproceedings{bib_Prob_2007, AUTHOR = {Pramod Sankar Kompalli, Jawahar C V}, TITLE = {Probabilistic reverse annotation for large scale image retrieval}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2007}}
Automatic annotation is an elegant alternative to explicit recognition in images. In annotation, the image is matched with keyword models, and the most relevant keywords are assigned to the image. Using existing techniques, the annotation time for large collections is very high, while the annotation performance degrades with increase in number of keywords. Towards the goal of large scale annotation, we present an approach called “Reverse Annotation”. Unlike traditional annotationwhere keywords are identified for a given image, in Reverse Annotation, the relevant images are identified for each keyword. With this seemingly simple shift in perspective, the annotation time is reduced significantly. To be able to rank relevant images, the approach is extended to Probabilistic Reverse Annotation. Our framework is applicable to a wide variety of multimedia documents, and scalable to large collections. Here, we demonstrate the framework over a large collection of 75,000 document images, containing 21 million word segments, annotated by 35000 keywords. Our image retrieval system replicates text-based search engines, in response time.
Optical character recognition of amharic documents
MILLION MESHESHA,Jawahar C V
African Journal of Information and Communication Technology, AJICT, 2007
@inproceedings{bib_Opti_2007, AUTHOR = {MILLION MESHESHA, Jawahar C V}, TITLE = {Optical character recognition of amharic documents}, BOOKTITLE = {African Journal of Information and Communication Technology}. YEAR = {2007}}
In Africa around 2,500 languages are spoken. Some of these languages have their own indigenous scripts. Accordingly, there is a bulk of printed documents available in libraries, information centers, museums and offices. Digitization of these documents enables to harness already available language technologies to local information needs and developments. This paper presents an Optical Character Recognition (OCR) system for converting digitized documents in local languages. An extensive literature survey reveals that this is the first attempt that reports the challenges towards the recognition of indigenous African scripts and a possible solution for Amharic script. Research in the recognition of African indigenous scripts faces major challenges due to (i) the use of large number characters in the writing and (ii) existence of large set of visually similar characters. In this paper, we employ a novel feature extraction scheme using principal component and linear discriminant analysis, followed by a decision directed acyclic graph based support vector machine classifier. Recognition results are presented on real-life degraded documents such as books, magazines and newspapers to demonstrate the performance of the recognizer
Content-level annotation of large collection of printed document images
ANAND KUMAR,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2007
@inproceedings{bib_Cont_2007, AUTHOR = {ANAND KUMAR, Jawahar C V}, TITLE = {Content-level annotation of large collection of printed document images}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2007}}
A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is laborious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed document images. We align document images with independently keyed-in text. The method is model-driven and is intended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation information. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other document understanding tasks.
Path planning approach to visual servoing with feature visibility constraints: A convex optimization based solution
A.H. Abdul Hafez,ANIL KUMAR NELAKANTI,Jawahar C V
International Conference on Intelligent Robots and Systems, IROS, 2007
@inproceedings{bib_Path_2007, AUTHOR = {A.H. Abdul Hafez, ANIL KUMAR NELAKANTI, Jawahar C V}, TITLE = {Path planning approach to visual servoing with feature visibility constraints: A convex optimization based solution}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2007}}
This paper explores the possibility of using convex optimization to address a class of problems in visual servoing. This work is motivated by the recent success of convex optimization methods in solving geometric inference problems in computer vision. We formulate the visual servoing problem with feature visibility constraints as a convex optimization of a function of the camera position i.e. the translation of the camera. First, the path is planned using potential field method that produces unconstrained but straight line path from the initial to the desired camera position. The problem is then converted to a constrained convex optimization problem by introducing the visibility constraints to the minimization problem. The objective of the minimization process is to find for each camera position the closest alternate position from which all features are visible. This algorithm ensures that the solution is optimal. This formulation allows the introduction of more constraints, like joint limits of the arm, into the visual servoing process. The results have been illustrated in a simulation framework.
Efficient search in document image collections
ANAND KUMAR,Jawahar C V,R. Manmatha
Asian Conference on Computer Vision, ACCV, 2007
@inproceedings{bib_Effi_2007, AUTHOR = {ANAND KUMAR, Jawahar C V, R. Manmatha}, TITLE = {Efficient search in document image collections}, BOOKTITLE = {Asian Conference on Computer Vision}. YEAR = {2007}}
This paper presents an efficient indexing and retrieval scheme for searching in document image databases. In many non-European languages, optical character recognizers are not very accurate. Word spotting - word image matching - may instead be used to retrieve word images in response to a word image query. The approaches used for word spotting so far, dynamic time warping and/or nearest neighbor search, tend to be slow. Here indexing is done using locality sensitive hashing (LSH) - a technique which computes multiple hashes - using word image features computed at word level. Efficiency and scalability is achieved by content-sensitive hashing implemented through approximate nearest neighbor computation. We demonstrate that the technique achieves high precision and recall (in the 90% range), using a large image corpus consisting of seven Kalidasa’s (a well known Indian poet of antiquity) books in the Telugu language. The accuracy is comparable to using dynamic time warping and nearest neighbor search while the speed is orders of magnitude better - 20000 word images can be searched in milliseconds.
Self adaptable recognizer for document image collections
MILLION MESHESHA,Jawahar C V
Conference on Pattern Recognition and Machine Intelligence, PReMI, 2007
@inproceedings{bib_Self_2007, AUTHOR = {MILLION MESHESHA, Jawahar C V}, TITLE = {Self adaptable recognizer for document image collections}, BOOKTITLE = {Conference on Pattern Recognition and Machine Intelligence}. YEAR = {2007}}
This paper presents an architecture that enables the recognizer to learn incrementally and, thereby adapt to document image collections for performance improvement. We argue that the recognition scheme for a book could be considerably different from that designed for isolated pages. We employ learning procedures to capture the relevant information available online, and feed it back to update the knowledge of the system. Experimental results show the effectiveness of our design for improving the performance on-the-fly.
Efficient search with changing similarity measures on large multimedia datasets
NATARAJ J,Vikram Pudi,Jawahar C V
International Conference on MultiMedia Modeling, MMM, 2007
@inproceedings{bib_Effi_2007, AUTHOR = {NATARAJ J, Vikram Pudi, Jawahar C V}, TITLE = {Efficient search with changing similarity measures on large multimedia datasets}, BOOKTITLE = {International Conference on MultiMedia Modeling}. YEAR = {2007}}
In this paper, we consider the problem of finding the k most similar objects given a query object, in large multimedia datasets. We focus on scenarios where the similarity measure itself is not fixed, but is continuously being refined with user feedback. Conventional database techniques for efficient similarity search are not effective in this environment as they take a specific similarity/distance measure as input and build index structures tuned for that measure. Our approach works effectively in this environment as validated by the experimental study where we evaluate it over a wide range of datasets. The experiments show it to be efficient and scalable. In fact, on all our datasets, the response times were within a few seconds, making our approach suitable for interactive applications.
Support Vector Machine based Hierarchical Classifiers for Large Class Problems
CH.TEJO KRISHNA,Anoop Namboodiri,Jawahar C V
International Conference on Applied Pattern Recognition, ICAPR, 2007
@inproceedings{bib_Supp_2007, AUTHOR = {CH.TEJO KRISHNA, Anoop Namboodiri, Jawahar C V}, TITLE = {Support Vector Machine based Hierarchical Classifiers for Large Class Problems}, BOOKTITLE = {International Conference on Applied Pattern Recognition}. YEAR = {2007}}
One of the prime challenges in designing a classifier for large-class problems such as Indian language OCRs is the presence of a large similar looking character set. The nature of the character set introduces problems with accuracy and efficiency of the classifier. Hierarchical classifiers such as Binary Hierarchical Decision Trees (BHDTs) using SVMs as component classifiers have been effectively used to tackle such large-class classification problems. The accuracy and efficiency of a BHDT classifier will depend on: i) the accuracy of the component classifiers, ii) the separability of the clusters at each node in a hierarchical classifier, and iii) the balance of the BHDT. We propose methods to tackle each of the above problems in the case of binary character images. We present a new distance measure, which is intuitively suitable when Support Vector Machines are used as component classifiers. We also propose a novel method for balancing the BHDT to improve its efficiency, while maintaining the accuracy. Finally we propose a method to generate overlapping partitions to improve the accuracy of BHDTs. Comparison of the method with other forms of classifier combination techniques such as 1vs1, 1vsRest and Decision Directed Acyclic Graphs shows that the proposed approach is highly efficient, while being comparable with the more expensive techniques in terms of accuracy. The experiments are focused on the problem of Indian language OCR, while the framework is usable for other problems as well
Accurate image registration from local phase information
HIMANSHU ARORA,Anoop Namboodiri,Jawahar C V
National Conference on Communications, NCC, 2007
@inproceedings{bib_Accu_2007, AUTHOR = {HIMANSHU ARORA, Anoop Namboodiri, Jawahar C V}, TITLE = {Accurate image registration from local phase information}, BOOKTITLE = {National Conference on Communications}. YEAR = {2007}}
Accurate registration of images is essential for many computer vision algorithms for medical image analysis, super-resolution, and image mosaicing. Performance of traditional correspondence-based approaches is restricted by the reliability of the feature detector. Popular frequency domain approaches use the magnitude of global frequencies for registration, and are limited in the class of transformations that can be estimated. We propose the use of local phase information for accurate image registration as it is robust to noise and illumination conditions and the estimates are obtained at sub-pixel accuracy without any correspondence computation. We form an overdetermined system of equations from the phase differences to estimate the parameters of image registration. We demonstrate the effectiveness of the approach for affine transformation under Gaussian white noise and varying illumination conditions.
On using classical poetry structure for Indian language post-processing
Anoop Namboodiri,Narayanan P J,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2007
@inproceedings{bib_On_u_2007, AUTHOR = {Anoop Namboodiri, Narayanan P J, Jawahar C V}, TITLE = {On using classical poetry structure for Indian language post-processing}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2007}}
Post-processors are critical to the performance of language recognizers like OCRs, speech recognizers, etc. Dictionary-based post-processing commonly employ either an algorithmic approach or a statistical approach. Other linguistic features are not exploited for this purpose. The language analysis is also largely limited to the prose form.This paper proposes a framework to use the rich metric and formal structure of classical poetic forms in Indian languages for post-processing a recognizer like an OCR engine. We show that the structure present in the form of the vrtta and prasa ¯ can be efficiently used to disambiguate some cases that may be difficult for an OCR. The approach is efficient, and complementary to other post-processing approaches and can be used in conjunction with them.
Text driven temporal segmentation of cricket videos
Pramod Sankar Kompalli,SAURABH KUMAR PANDEY,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2006
@inproceedings{bib_Text_2006, AUTHOR = {Pramod Sankar Kompalli, SAURABH KUMAR PANDEY, Jawahar C V}, TITLE = {Text driven temporal segmentation of cricket videos}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2006}}
In this paper we address the problem of temporal segmentation of videos. We present a multi-modal approach where clues from different information sources are merged to perform the segmentation. Specifically, we segment videos based on textual descriptions or commentaries of the action in the video. Such a parallel information is available for cricket videos, a class of videos where visual feature based (bottom-up) scene segmentation algorithms generally fail, due to lack of visual dissimilarity across space and time. With additional topdown information from textual domain, these ambiguities could be resolved to a large extent. The video is segmented to meaningful entities or scenes, using the scene level descriptions provided by the commentary. These segments can then be automatically annotated with the respective descriptions. This allows for a semantic access and retrieval of video segments, which is difficult to obtain from existing visual feature based approaches. We also present techniques for automatic highlight generation using our scheme.
Robust homography-based control for camera positioning in piecewise planar environments
SANTOSH KUMAR D,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2006
@inproceedings{bib_Robu_2006, AUTHOR = {SANTOSH KUMAR D, Jawahar C V}, TITLE = {Robust homography-based control for camera positioning in piecewise planar environments}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2006}}
This paper presents a vision-based control for positioning a camera with respect to an unknown piecewise planar object. We introduce a novel homography-based approach that integrates information from multiple homographies to reliably estimate the relative displacement of the camera. This approach is robust to image measurement errors and provides a stable estimate of the camera motion that is free from degeneracies in the task space. We also develop a new control formulation that meets the contradictory requirements of producing a decoupled camera trajectory and ensuring object visibility by only utilizing the homography relating the two views. Experimental results validate the efficiency and robustness of our approach and demonstrate its applicability
Discriminative actions for recognising events
KARTEEK ALAHARI,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2006
@inproceedings{bib_Disc_2006, AUTHOR = {KARTEEK ALAHARI, Jawahar C V}, TITLE = {Discriminative actions for recognising events}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2006}}
This paper presents an approach to identify the importance of different parts of a video sequence from the recognition point of view. It builds on the observations that: (1) events consist of more fundamental (or atomic) units, and (2) a discriminant-based approach is more appropriate for the recognition task, when compared to the standard modelling techniques, such as PCA, HMM, etc. We introduce discriminative actions which describe the usefulness of the fundamental units in distinguishing between events. We first extract actions to capture the fine characteristics of individual parts in the events. These actions are modelled and their usefulness in discriminating between events is estimated as a score. The score highlights the important parts (or actions) of the event from the recognition aspect. Applicability of the approach on different classes of events is demonstrated along with a statistical analysis.
Enabling search over large collections of telugu document images–an automatic annotation based approach
Pramod Sankar Kompalli,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2006
@inproceedings{bib_Enab_2006, AUTHOR = {Pramod Sankar Kompalli, Jawahar C V}, TITLE = {Enabling search over large collections of telugu document images–an automatic annotation based approach}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2006}}
For the first time, search is enabled over a massive collection of 21 Million word images from digitized document images. This work advances the state-of-the-art on multiple fronts: i) Indian language document images are made searchable by textual queries, ii) interactive content-level access is provided to document images for search and retrieval, iii) a novel recognition-free approach, that does not require an OCR, is adapted and validated iv) a suite of image processing and pattern classification algorithms are proposed to efficiently automate the process and v) the scalability of the solution is demonstrated over a large collection of 500 digitised books consisting of 75,000 pages. Character recognition based approaches yield poor results for developing search engines for Indian language document images, due to the complexity of the script and the poor quality of the documents. Recognition free approaches, based on word-spotting, are not directly scalable to large collections, due to the computational complexity of matching images in the feature space. For example, if it requires 1 mSec to match two images, the retrieval of documents to a single query, from a large collection like ours, would require close to a day’s time. In this paper we propose a novel automatic annotation based approach to provide textual description of document images. With a one time, offline computational effort, we are able to build a text-based retrieval system, over annotated images. This system has an interactive response time of about 0.01 second. However, we pay the price in the form of massive offline computation, which is performed on a cluster of 35 computers, for about a month. Our procedure is highly automatic, requiring minimal human intervention
Computing eigen space from limited number of views for recognition
PARESH KUMAR JAIN,KARTIK RAO POLEPALLI,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2006
@inproceedings{bib_Comp_2006, AUTHOR = {PARESH KUMAR JAIN, KARTIK RAO POLEPALLI, Jawahar C V}, TITLE = {Computing eigen space from limited number of views for recognition}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2006}}
This paper presents a novel approach to construct an eigen space representation from limited number of views, which is equivalent to the one obtained from large number of images captured from multiple view points. This procedure implicitly incorporates a novel view synthesis algorithm in the eigen space construction process. Inherent information in an appearance representation is enhanced using geometric computations. We experimentally verify the performance for orthographic, affine and projective camera models. Recognition results on the COIL and SOIL image database are promising.
Textual search in graphics stream of PDF
A.Balasubramanian ,Jawahar C V
International Conference on Digital Libraries, ICDLi, 2006
@inproceedings{bib_Text_2006, AUTHOR = {A.Balasubramanian , Jawahar C V}, TITLE = {Textual search in graphics stream of PDF}, BOOKTITLE = {International Conference on Digital Libraries}. YEAR = {2006}}
Digitized books and manuscripts in digital libraries are often stored as images or graphics. They are not searchable at the content level due to the lack of OCRs or poor quality of the scanned images. Portable Document Format (PDF) has emerged as the most popular document representation schema for wider access across platforms. When there is no textual (UNICODE, ASCII) representation available, scanned images are stored in the graphics stream of PDF. In this paper, we propose a novel solution to search the textual data in graphics stream of the PDF files at content level. The proposed solution is demonstrated by enhancing an open source PDF viewer (Xpdf). Indian language support is also provided. Users can type a word in Roman (ITRANS), view it in a font, and search in textual and graphics stream of PDF documents simultaneously.
Dynamic events as mixtures of spatial and temporal features
KARTEEK ALAHARI,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2006
@inproceedings{bib_Dyna_2006, AUTHOR = {KARTEEK ALAHARI, Jawahar C V}, TITLE = {Dynamic events as mixtures of spatial and temporal features}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2006}}
Dynamic events comprise of spatiotemporal atomic units. In this paper we model them using a mixture model. Events are represented using a framework based on the Mixture of Factor Analyzers (MFA) model. It is to be noted that our framework is generic and is applicable for any mixture modelling scheme. The MFA, used to demonstrate the novelty of our approach, clusters events into spatially coherent mixtures in a low dimensional space. Based the observations that, (i) events comprise of varying degrees of spatial and temporal characteristics, and (ii) the number of mixtures determines the composition of these features, a method that incorporates models with varying number of mixtures is proposed. For a given event, the relative importance of each model component is estimated, thereby choosing the appropriate feature composition. The capabilities of the proposed framework are demonstrated with an application: recognition of events such as hand gestures, activities
Task specific factors for video characterization
RANJEETH KUMAR D,S.MANIKANDAN,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2006
@inproceedings{bib_Task_2006, AUTHOR = {RANJEETH KUMAR D, S.MANIKANDAN, Jawahar C V}, TITLE = {Task specific factors for video characterization}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2006}}
Factorization methods are used extensively in computer vision for a wide variety of tasks. Existing factorization techniques extract factors that meet requirements such as compact representation, interpretability, efficiency, dimensionality reduction etc. However, when the extracted factors lack interpretability and are large in number, identification of factors that cause the data to exhibit certain properties of interest is useful in solving a variety of problems. Identification of such factors or factor selection has interesting applications in data synthesis and recognition. In this paper simple and efficient methods are proposed, for identification of factors of interest from a pool of factors obtained by decomposing videos represented as tensors into their constituent low rank factors. The method is used to select factors that enable appearance based facial expression transfer and facial expression recognition. Experimental results demonstrate that the factor selection facilitates efficient solutions to these problems with promising results.
Retrieval from document image collections
A. Balasubramanian,MILLION MESHESHA,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2006
@inproceedings{bib_Retr_2006, AUTHOR = {A. Balasubramanian, MILLION MESHESHA, Jawahar C V}, TITLE = {Retrieval from document image collections}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2006}}
This paper presents a system for retrieval of relevant documents from large document image collections. We achieve effective search and retrieval from a large collection of printed document images by matching image features at word-level. For representations of the words, profile-based and shape-based features are employed. A novel DTWbased partial matching scheme is employed to take care of morphologically variant words. This is useful for grouping together similar words during the indexing process. The system supports cross-lingual search using OM-Trans transliteration and a dictionary-based approach. Systemlevel issues for retrieval (eg. scalability, effective delivery etc.) are addressed in this paper.
Digitizing a million books: Challenges for document analysis
Pramod Sankar Kompalli,Vamshi Ambati,P LAKSHMI AASRITHA,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2006
@inproceedings{bib_Digi_2006, AUTHOR = {Pramod Sankar Kompalli, Vamshi Ambati, P LAKSHMI AASRITHA, Jawahar C V}, TITLE = {Digitizing a million books: Challenges for document analysis}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2006}}
This paper describes the challenges for document image analysis community for building large digital libraries with diverse document categories. The challenges are identified from the experience of the on-going activities toward digitizing and archiving one million books. Smooth workflow has been established for archiving large quantity of books, with the help of efficient image processing algorithms. However, much more research is needed to address the challenges arising out of the diversity of the content in digital libraries.
A semi-automatic adaptive OCR for digital libraries
SACHIN KUMAR RAWAT,K S SESH KUMAR,MILLION MESHESHA,INDRANEEL DEB SIKDAR,A. Balasubramanian,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2006
@inproceedings{bib_A_se_2006, AUTHOR = {SACHIN KUMAR RAWAT, K S SESH KUMAR, MILLION MESHESHA, INDRANEEL DEB SIKDAR, A. Balasubramanian, Jawahar C V}, TITLE = {A semi-automatic adaptive OCR for digital libraries}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2006}}
. This paper presents a novel approach for designing a semi-automatic adaptive OCR for large document image collections in digital libraries. We describe an interactive system for continuous improvement of the results of the OCR. In this paper a semi-automatic and adaptive system is implemented. Applicability of our design for the recognition of Indian Languages is demonstrated. Recognition errors are used to train the OCR again so that it adapts and learns for improving its accuracy. Limited human intervention is allowed for evaluating the output of the system and take corrective actions during the recognition process.
Homography estimation from planar contours
PARESH KUMAR JAIN,Jawahar C V
International Symposium on 3D Data Processing Visualization and Transmission, 3DPVT, 2006
@inproceedings{bib_Homo_2006, AUTHOR = {PARESH KUMAR JAIN, Jawahar C V}, TITLE = {Homography estimation from planar contours}, BOOKTITLE = {International Symposium on 3D Data Processing Visualization and Transmission}. YEAR = {2006}}
Homography estimation is an important step in many computer vision algorithms. Most existing algorithms estimate the homography from point or line correspondences which are difficult to reliably obtain in many real-life situations. In this paper we propose a technique based on correspondences of contours. Homography estimation is carried out in Fourier domain. Starting from an affine estimate, the proposed algorithm computes the projective homography in an iterative manner. This technique does not require explicit point to point correspondences; in fact such point correspondences are a by-product of the proposed algorithm. Experimental results and applications validate the use of our technique.
Target model estimation using particle filters for visual servoing
A. H. Abdul Hafez,Jawahar C V
International conference on Pattern Recognition, ICPR, 2006
@inproceedings{bib_Targ_2006, AUTHOR = {A. H. Abdul Hafez, Jawahar C V}, TITLE = {Target model estimation using particle filters for visual servoing}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2006}}
In this paper, we present a novel method for model estimation for visual servoing. This method employs a particle filter algorithm to estimate the depth of the image features online. A Gaussian probabilistic model is employed to model the object points in the current camera frame. A set of 3D samples drawn from the model is projected into the image space in the next frame. The 3D sample that maximizes the likelihood is the most probable real-world 3D point. The variance value of the depth density function converges to very small value within a few iterations. Results show accurate estimate of the depth/model and a high level of stability in the visual servoing process
Visual servoing in presence of non-rigid motion
SANTOSH KUMAR D,Jawahar C V
International conference on Pattern Recognition, ICPR, 2006
@inproceedings{bib_Visu_2006, AUTHOR = {SANTOSH KUMAR D, Jawahar C V}, TITLE = {Visual servoing in presence of non-rigid motion}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2006}}
Most robotic vision algorithms have been proposed by envisaging robots operating in industrial environments, where the world is assumed to be static and rigid. These algorithms cannot be used in environments where the assumption of a rigid world does not hold. In this paper, we study the problem of visual servoing in presence of nonrigid objects and analyze the design of servoing strategies needed to perform optimally even in unconventional environments. We also propose a servoing algorithm that is robust to non-rigidity. The algorithm extracts invariant features of the non-rigid object and uses these features in the servoing process. We validate the technique with experiments and demonstrate the applicability
Efficient region based indexing and retrieval for images with elastic bucket tries
PATHAPATI SUMAN KARTHIK,Jawahar C V
International conference on Pattern Recognition, ICPR, 2006
@inproceedings{bib_Effi_2006, AUTHOR = {PATHAPATI SUMAN KARTHIK, Jawahar C V}, TITLE = {Efficient region based indexing and retrieval for images with elastic bucket tries}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2006}}
Retrieval and indexing in multimedia databases has been an active topic both in the Information Retrieval and computer vision communities for a long time. In this paper we propose a novel region based indexing and retrieval scheme for images. First we present our virtual textual description using which, images are converted to text documents containing keywords. Then we look at how these documents can be indexed and retrieved using modified elastic bucket tries and show that our approach is one order better than standard spatial indexing approaches. We also show various operations required for dealing with complex features like relevance feedback. Finally we analyze the method comparatively and and validate our approach.
Learning mixtures of offline and online features for handwritten stroke recognition
KARTEEK ALAHARI,SATYA LAHARI PUTREVU,Jawahar C V
International conference on Pattern Recognition, ICPR, 2006
@inproceedings{bib_Lear_2006, AUTHOR = {KARTEEK ALAHARI, SATYA LAHARI PUTREVU, Jawahar C V}, TITLE = {Learning mixtures of offline and online features for handwritten stroke recognition}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2006}}
In this paper we propose a novel scheme to combine offline and online features of handwritten strokes. The stateof-the-art methods in handwritten stroke recognition have used a pre-determined combination of these features, which is not optimal in all situations. The proposed model addresses this issue by learning mixtures of offline and online characteristics from a set of exemplars. Each stroke is represented as a probabilistic sequence of substrokes with varying compositions of these features. The model adapts to any stroke and chooses the feature composition that best characterizes it. The superiority of the method is demonstrated on handwritten numeral and character strokes.
Integration framework for improved visual servoing in image and cartesian spaces
A. H. Abdul Hafez,Jawahar C V
International Conference on Intelligent Robots and Systems, IROS, 2006
@inproceedings{bib_Inte_2006, AUTHOR = {A. H. Abdul Hafez, Jawahar C V}, TITLE = {Integration framework for improved visual servoing in image and cartesian spaces}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2006}}
In this paper, we present a new integration method for improving the performance of visual servoing. The method integrates both image-based visual servoing (IBVS) and positionbased visual servoing (PBVS) to satisfy the requirements of the visual servoing process. We define a probabilistic integration rule for IBVS and PBVS controllers. Density functions that determine the probability of each controller are defined to satisfy the above constraints. We prove that this integration method provides global stability, and avoids local minima. The new integration method is validated on positioning tasks, and compared with other switching methods. I. I
Synthesis of online handwriting in indian languages
Jawahar C V,A.Balasubramanian
International Workshop on Frontiers in Handwriting Recognition, IWFHR, 2006
@inproceedings{bib_Synt_2006, AUTHOR = {Jawahar C V, A.Balasubramanian}, TITLE = {Synthesis of online handwriting in indian languages}, BOOKTITLE = {International Workshop on Frontiers in Handwriting Recognition}. YEAR = {2006}}
Synthesis of handwriting has a variety of applications including generation of personalized documents, study of writing styles, automatic generation of data for training recognizers, and matching of handwritten data for retrieval. Most of the existing algorithms for handwriting synthesis deal with English, where the spatial layout of the components are relatively simple, while the cursiveness of the script introduces many challenges. In this paper, we present a synthesis model for generating handwritten data for Indian languages, where the layout of characters is complex while the script is fundamentally non-cursive. The algorithm learns from annotated data and improves its representation with feedback.
The digital library of india project: Process, policies and architecture
Vamshi Ambati, N.Balakrishnan,Raj Reddy,P LAKSHMI AASRITHA,Jawahar C V
International Conference on Digital Libraries, ICDLi, 2006
@inproceedings{bib_The__2006, AUTHOR = {Vamshi Ambati, N.Balakrishnan, Raj Reddy, P LAKSHMI AASRITHA, Jawahar C V}, TITLE = {The digital library of india project: Process, policies and architecture}, BOOKTITLE = {International Conference on Digital Libraries}. YEAR = {2006}}
In this paper we share the experience gained from establishing a process and a supporting architecture for the Digital Library of India (DLI) project. The DLI project was started with a vision of digitizing books and making them available online, in a searchable and browseable form. The digitization of the books takes place at geographically distributed locations. This raises many issues related to policy and collaboration. We discuss these problems in detail and present the process and workflow that is established to solve them. We also share the architecture of the project that supports the smooth implementation of the process. The architecture of the DLI project has been arrived at after considering factors like high performance, scalability, availability and economy.
Analysis of relevance feedback in content based image retrieval
PATHAPATI SUMAN KARTHIK,Jawahar C V
International Conference on Control, Automation, Robotics and Vision, ICARCV, 2006
@inproceedings{bib_Anal_2006, AUTHOR = {PATHAPATI SUMAN KARTHIK, Jawahar C V}, TITLE = {Analysis of relevance feedback in content based image retrieval}, BOOKTITLE = {International Conference on Control, Automation, Robotics and Vision}. YEAR = {2006}}
Relevance feedback in Content Based Image Retrieval(CBIR) has been an active field of research for quite some time now. Many schemes and techniques of relevance feedback exist with many assumptions and operating criteria. Yet there exist few ways of quantitatively measuring and comparing different relevance feedback algorithms. Such analysis is necessary if a CBIR system is to perform consistently. In this paper we propose an abstract model of a CBIR system where the effects of different modules over the entire system is observed. Using this model we thoroughly analyse performance a set of basic relevance feedback algorithms. Besides using standard measures like precision and recall we also suggest two new measures to gauge the performance of any contemporary CBIR system.
Probabilistic integration of 2D and 3D cues for visual servoing
A. H. Abdul Hafez,Jawahar C V
International Conference on Control, Automation, Robotics and Vision, ICARCV, 2006
@inproceedings{bib_Prob_2006, AUTHOR = {A. H. Abdul Hafez, Jawahar C V}, TITLE = {Probabilistic integration of 2D and 3D cues for visual servoing}, BOOKTITLE = {International Conference on Control, Automation, Robotics and Vision}. YEAR = {2006}}
In this paper we present a new integration method for improving the performance of visual servoing. The method integrates image-based visual servoing (IBVS) and positionbased visual servoing (PBVS) approaches to satisfy the widely varying requirements of the visual servoing process. We define an integration rule for IBVS and PBVS controllers. Density functions that determine the weighting factor of each controller are defined to satisfy the above constraints. We prove that this integration method provides global stability, and avoids local minima. The new integration method is validated on positioning tasks and compared with other switching methods.
Improvement to the minimization of hybrid error functions for pose alignment
A. H. Abdul Hafez,Jawahar C V
International Conference on Control, Automation, Robotics and Vision, ICARCV, 2006
@inproceedings{bib_Impr_2006, AUTHOR = {A. H. Abdul Hafez, Jawahar C V}, TITLE = {Improvement to the minimization of hybrid error functions for pose alignment}, BOOKTITLE = {International Conference on Control, Automation, Robotics and Vision}. YEAR = {2006}}
Many problems in computer vision such as pose recovery and structure estimation are formulated as a minimization process. These problems vary in the use of image measurements directly or using them to extract 3D cues in the minimization process. Hybrid methods have the advantage of combining the 2D and 3D visual information to improve the performance over the above two methods. In this paper, we present a new formulation for minimizing a class of hybrid error functions. This is done by using 2D information from the image space and 3D information from the Cartesian space in one error function. Applications to visual servoing and image alignment problems are presented. The positioning task of a robot arm has been formulated as a minimization problem. Gradient decent as a first order approxima ion and Gauss-Newton as a second order approximation are considered in this paper. Simulation results show, comparing with 2 1/2 D hybrid method, that these two\ methods provide an efficient solution to the features visibility problems and the camera trajectory in the Cartesian space.
Learning segmentation of documents with complex scripts
K S SESH KUMAR,Anoop Namboodiri,Jawahar C V
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2006
@inproceedings{bib_Lear_2006, AUTHOR = {K S SESH KUMAR, Anoop Namboodiri, Jawahar C V}, TITLE = {Learning segmentation of documents with complex scripts}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2006}}
Most of the state-of-the-art segmentation algorithms are de-signed to handle complex document layouts and backgrounds, while assuming a simple script structure such as in Roman script. They perform poorly when used with Indian languages, where the components are not strictly collinear. In this paper, we propose a document segmentation algorithm that can handle the complexity of Indian scripts in large document image collections. Segmentation is posed as a graph cut problem that incorporates the a prior information from script structure in the objective function of the cut. We show that this information can be learned automatically and be adapted within a collection of documents (a book)and across collections to achieve accurate segmentation. We show the results on Indian language documents in Telugu script. The approach is also applicable to other languages with complex scripts such as Bangla,Kannada, Malayalam, and Urdu
Model-based annotation of online handwritten datasets
ANAND KUMAR,A BALA SUBRAMANIAN,Anoop Namboodiri,Jawahar C V
International Conference on Frontiers in Handwriting Recognition, ICFHR, 2006
@inproceedings{bib_Mode_2006, AUTHOR = {ANAND KUMAR, A BALA SUBRAMANIAN, Anoop Namboodiri, Jawahar C V}, TITLE = {Model-based annotation of online handwritten datasets}, BOOKTITLE = {International Conference on Frontiers in Handwriting Recognition}. YEAR = {2006}}
Annotated datasets of handwriting are a prerequisite to attempt a variety of problems such as building recognizers, developing writer identification algorithms, etc. However, the annotation of large datasets is a tedious and expensive process, especially at the character or stroke level. In this paper we propose a novel, automated method for annotation at the character level, given a parallel corpus of online handwritten data and the corresponding text. The method employs a model-based handwriting synthesis unit to map the two corpora to the same space and the annotation is propagated to the word level and then to the individual characters using elastic matching. The initial results of annotation are used to improve the handwriting synthesis model for the user under consideration, which in turn refines the annotation. The method can take care of errors in the handwriting such as spurious and missing strokes or characters. The output is stored in the UPXInkML format
Quality management in digital libraries
Vamshi Ambati,Pramod Sankar Kompalli,P LAKSHMI AASRITHA,Jawahar C V
International Conference on Digital Libraries, ICDLi, 2005
@inproceedings{bib_Qual_2005, AUTHOR = {Vamshi Ambati, Pramod Sankar Kompalli, P LAKSHMI AASRITHA, Jawahar C V}, TITLE = {Quality management in digital libraries}, BOOKTITLE = {International Conference on Digital Libraries}. YEAR = {2005}}
Digital Libraries have received wide attention in the recent years allowing access to digital information from anywhere across the world. They have become widely accepted and even preferred information sources in areas of education, science and others. When the full potential of digital libraries is realized, any citizen will for the first time be able to access all human knowledge immediately from any location. For a successful and useful Digital Library, the assurance of quality of the product from digitization process is quite essential. In this paper we discuss the major quality concerns of data from the digitization process, and how we assure for better quality before they are web enabled for the end-user to use. Also we discuss a Quality Management Framework that we applied for the improvement of both the Quality of the digital output and the efficiency of the processes.
Recognizing Human Activities from Constituent Actions
SANTOSH RAVI KIRAN S,KARTEEK ALAHARI,Jawahar C V
National Conference on Communications, NCC, 2005
@inproceedings{bib_Reco_2005, AUTHOR = {SANTOSH RAVI KIRAN S, KARTEEK ALAHARI, Jawahar C V}, TITLE = {Recognizing Human Activities from Constituent Actions}, BOOKTITLE = {National Conference on Communications}. YEAR = {2005}}
Many of the human activities such as Jumping, Squatting have a correlated spatiotemporal structure. They are composed of homogeneous units. These units, which we refer to as actions, are often common to more than one activity. Therefore, it is essential to have a representation which can capture these activities effectively. To develop this, we model the frames of activities as a mixture model of actions and employ a probabilistic approach to learn their low-dimensional representation. We present recognition results on seven activities performed by various individuals. The results demonstrate the versatility and the ability of the model to capture the ensemble of human activities.
Robust visual servoing based on novel view prediction
A.H. Abdul Hafez,PIYUSH JANAWADKAR,Jawahar C V
International Journal of Robotics Research, IJRR, 2005
@inproceedings{bib_Robu_2005, AUTHOR = {A.H. Abdul Hafez, PIYUSH JANAWADKAR, Jawahar C V}, TITLE = {Robust visual servoing based on novel view prediction}, BOOKTITLE = {International Journal of Robotics Research}. YEAR = {2005}}
In this paper we propose a novel technique for robust visual servoing in presence of a large proportion of outliers in image measurements. The method employs robust statistical techniques and novel view prediction for improving the performance. We identify a set of points from initial and reference images and compute the essential matrix relating them. The selected points are predicted from the initial and reference images to the current frame using essential matrices. A function of the difference between the observed and predicted image point measurements is used to identify outliers. This technique is validated with many experiments and compared with other robust methods in a simulation framework.
Perspective correction methods for camera based document analysis
L JAGANNATHAN,Jawahar C V
International Workshop on Document Analysis Systems, DAS, 2005
@inproceedings{bib_Pers_2005, AUTHOR = {L JAGANNATHAN, Jawahar C V}, TITLE = {Perspective correction methods for camera based document analysis}, BOOKTITLE = {International Workshop on Document Analysis Systems}. YEAR = {2005}}
In this paper we describe a spectrum of algorithm for rectification of document images of camera based analysis and recognition. Clues like document boundaries page ,page layout information, organisation of text and graphics, components, apriori knowledge of script or selected symbols etc. are effectively used for removal of perspective effect and comforting the frontal view needed for a typical document image analysis algorithm. Appropriate results from projective geometry for planar surfaces are exploited in these situations.
Recognition of printed Amharic documents
MILLION MESHESHA,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2005
@inproceedings{bib_Reco_2005, AUTHOR = {MILLION MESHESHA, Jawahar C V}, TITLE = {Recognition of printed Amharic documents}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2005}}
In Africa, there are a number of languages with their own indigenous scripts. This paper presents an OCR for Amharic scripts. Amharic is the official and working language of Ethiopia. This is possibly the first attempt towards the development of an OCR system for Amharic. Research in the recognition of Amharic script faces major challenges due to (i) the use of more than 300 characters in writing and (ii) existence of a large set of visually similar characters. In this paper, we propose a two-stage feature extraction scheme using PCA and LDA, followed by a decision DAG classifier with SVMs as the nodes. Recognition results are presented to demonstrate the performance on the various printing variations (fonts, styles and sizes) and real-life degraded documents such as books, magazines and newspapers.
Discriminant substrokes for online handwriting recognition
KARTEEK ALAHARI,SATYA LAHARI PUTREVU,Jawahar C V
International Conference on Document Analysis and Recognition, ICDAR, 2005
@inproceedings{bib_Disc_2005, AUTHOR = {KARTEEK ALAHARI, SATYA LAHARI PUTREVU, Jawahar C V}, TITLE = {Discriminant substrokes for online handwriting recognition}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2005}}
A discriminant-based framework for automatic recognition of online handwriting data is presented in this paper. We identify the substrokes that are more useful in discriminating between two online strokes. A similarity/dissimilarity score is computed based on the discriminatory potential of various parts of the stroke for the classification task. The discriminatory potential is then converted to the relative importance of the substroke. Experimental verification on online data such as numerals, characters supports our claims. We achieve an average reduction of in the classification error rate on many test sets of similar character pair
Crosslingual access of textual information using camera phones
L JAGANNATHAN,Jawahar C V
International Conference on Cognition and Recognition, ICCR, 2005
@inproceedings{bib_Cros_2005, AUTHOR = {L JAGANNATHAN, Jawahar C V}, TITLE = {Crosslingual access of textual information using camera phones}, BOOKTITLE = {International Conference on Cognition and Recognition}. YEAR = {2005}}
In this paper we describe a prototype system that recognizes text in images, captured by camera phones and provide access to the content in a different language. Two Indian languages – Hindi and Tamil are used to demonstrate such a system. The prototype system is built using off-the-shelf components, and in house developed algorithms. The acquired image is first transferred to a server, which corrects the perspective distortions, detects recognizes and then translates the text (word). This translated text is sent back to the camera-phone in a suitable form. We have also described here, the Hindi and Tamil OCRs which we use for the character recognition. We also propose methods to make the recognizer efficient in storage and computation. The translated text, along with any additional information, is transmitted back to the user.
Video retrieval based on textual queries
Jawahar C V,CHENNUPATI BALAKRISHNA,BALMANOHAR PALURI,NATARAJ. J
International Conference on Advanced Computing and Communications, AdCom, 2005
@inproceedings{bib_Vide_2005, AUTHOR = {Jawahar C V, CHENNUPATI BALAKRISHNA, BALMANOHAR PALURI, NATARAJ. J}, TITLE = {Video retrieval based on textual queries}, BOOKTITLE = {International Conference on Advanced Computing and Communications}. YEAR = {2005}}
With the advancement of multimedia, digital video creation has become very common. There is enormous information present in the video, necessitating search techniques for specific content. In this paper, we present an approach that enables search based on the textual information present in the video. Regions of textual information are identified within the frames of the video. Video is then annotated with the textual content present in the images. Traditionally, OCRs are used to extract the text within the video. The choice of OCRs brings in many constraints on the language and the font that they can take care of. We hence propose an approach that enables matching at the image-level and therby avoiding an OCR. Videos containing the query string are retrieved from a video database and sorted based on the relevance. Results are shown from video collections in English,Hindi and Telugu.
Design of hierarchical classifier with hybrid architectures
M N S S K PAVAN KUMAR,Jawahar C V
Conference on Pattern Recognition and Machine Intelligence, PReMI, 2005
@inproceedings{bib_Desi_2005, AUTHOR = {M N S S K PAVAN KUMAR, Jawahar C V}, TITLE = {Design of hierarchical classifier with hybrid architectures}, BOOKTITLE = {Conference on Pattern Recognition and Machine Intelligence}. YEAR = {2005}}
Performance of hierarchical classifiers depends on two aspects – the performance of the individual classifiers, and the design of the architecture. In this paper, we present a scheme for designing hybrid hierarchical classifiers under user specified constraints on time and space.
Learning to segment document images
K S SESH KUMAR,Anoop Namboodiri,Jawahar C V
Conference on Pattern Recognition and Machine Intelligence, PReMI, 2005
@inproceedings{bib_Lear_2005, AUTHOR = {K S SESH KUMAR, Anoop Namboodiri, Jawahar C V}, TITLE = {Learning to segment document images}, BOOKTITLE = {Conference on Pattern Recognition and Machine Intelligence}. YEAR = {2005}}
A hierarchical framework for document segmentation is proposed as an optimization problem. The model incorporates the dependencies between various levels of the hierarchy unlike traditional document segmentation algorithms.This framework is applied to learn the parameters of the document segmentation algorithm using optimization methods like gradient descent and Q-learning.The novelty of our approach lies in learning the segmentation parameters in the absence of ground truth.
Searching in Document Images.
Jawahar C V,MILLION MESHESHA,A. Balasubramanian
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2004
@inproceedings{bib_Sear_2004, AUTHOR = {Jawahar C V, MILLION MESHESHA, A. Balasubramanian}, TITLE = {Searching in Document Images.}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2004}}
Searching in scanned documents is an important problem in Digital Libraries. If OCRs are not available, the scanned images are inaccessible. In this paper, we demonstrate a searching procedure without an intermediate textual representation. We achieve effective retrieval from document databases by matching at word-level using image features. Word profiles, structural features and transform domain representations are employed for characterising the word images. A novel partial matching approach based on dynamic time warping (DTW) is proposed to take care of word form variations. With the new partial matching procedure, morphologically variant words become similar in image space. This is specially useful for grouping together similar words for indexing purpose. We extend our formulation for cross-lingual search with the help of transliteration.
Word-level access to document image datasets
Jawahar C V,A.Balasubramanian,MILLION MESHESHA
Indian Conference on Computer Vision, Graphics and Image Processing Workshops, ICVGIP-W, 2004
@inproceedings{bib_Word_2004, AUTHOR = {Jawahar C V, A.Balasubramanian, MILLION MESHESHA}, TITLE = {Word-level access to document image datasets}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing Workshops}. YEAR = {2004}}
This paper presents a novel approach for retrieval of relevant documents from a large collection of printed document images, where search in corresponding textual content is practically impossible due to the unavailability of robust OCRs. We achieve effective search by matching at wordlevel using image features. Importance of a document for a given query is also computed at image-level.
Fourier domain representation of planar curves for recognition in multiple views
SUJIT KUTHIRUMMAL,Jawahar C V,Narayanan P J
Pattern Recognition, PR, 2004
@inproceedings{bib_Four_2004, AUTHOR = {SUJIT KUTHIRUMMAL, Jawahar C V, Narayanan P J}, TITLE = {Fourier domain representation of planar curves for recognition in multiple views}, BOOKTITLE = {Pattern Recognition}. YEAR = {2004}}
Recognition of planar shapes is an important problem in computer vision andpattern recognition. The same planar object contour imaged from di1erent cameras or from di1erent viewpoints looks di1erent and their recognition is non-trivial. Traditional shape recognition deals with views of the shapes that di1er only by simple rotations, translations, and scaling. However, shapes su1er more serious deformation between two general views and hence recognition approaches designed to handle translations, rotations, and/or scaling would prove to be insu5cient. Many algebraic relations between matching primitives in multiple views have been identi7edrecently. In this paper, we explore how shape properties andmultiview relations can be combinedto recognize planar shapes across multiple views. We propose novel recognition constraints that a planar shape boundary must satisfy in multiple views. The constraints are on the rank of a Fourier-domain measurement matrix computed from the points on the shape boundary. Our method can additionally compute the correspondence between the curve points after a match is established. We demonstrate the applications of these constraints experimentally on a number of synthetic and real images.
Constraints on coplanar moving points
SUJIT KUTHIRUMMAL,Jawahar C V,Narayanan P J
European Conference on Computer Vision, ECCV, 2004
@inproceedings{bib_Cons_2004, AUTHOR = {SUJIT KUTHIRUMMAL, Jawahar C V, Narayanan P J}, TITLE = {Constraints on coplanar moving points}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2004}}
Configurations of dynamic points viewed by one or more cameras have not been studied much. In this paper, we present several view and time-independent constraints on different configurations of points moving on a plane. We show that 4 points with constant independent velocities or accelerations under affine projection can be characterized in a view independent manner using 2 views. Under perspective projection, 5 coplanar points under uniform linear velocity observed for 3 time instants in a single view have a view-independent characterization. The best known constraint for this case involves 6 points observed for 35 frames. Under uniform acceleration, 5 points in 5 time instants have a view-independent characterization. We also present constraints on a point undergoing arbitrary planar motion under affine projections in the Fourier domain. The constraints introduced in this paper involve fewer points or views than similar results reported in the literature and are simpler to compute in most cases. The constraints developed can be applied to many aspects of computer vision. Recognition constraints for several planar point configurations of moving points can result from them. We also show how time-alignment of views captured independently can follow from the constraints on moving point configurations.
Building blocks for autonomous navigation using contour correspondences
PAWAN KUMAR M,Jawahar C V,Narayanan P J
International Conference on Image Processing, ICIP, 2004
@inproceedings{bib_Buil_2004, AUTHOR = {PAWAN KUMAR M, Jawahar C V, Narayanan P J}, TITLE = {Building blocks for autonomous navigation using contour correspondences}, BOOKTITLE = {International Conference on Image Processing}. YEAR = {2004}}
We address a few problems in navigation of automated vehicles using images captured by a mounted camera. Specifically, we look at the recognition of sign boards, rectification of planar objects imaged by the camera, and estimation of the position of a vehicle with respect to a fixed sign board. Our solutions are based on contour correspondence between a reference view and the current view. A mapping between corresponding points of a planar object in two different views is a matrix called the homography. A novel two-step linear algorithm for homography calculation from contour correspondence is developed first. Our algorithm requires the identification of an image contour as the projections of a known planar world contour and the selection of a known starting point. The homography between the reference view and the target view is applied to several reallife navigation applications, results of which are presented in this paper
Representation and annotation of online handwritten data
Ajay S. Bhaskarabhatla,Sriganesh Madhvanath,M N S S K PAVAN KUMAR, A. Balasubramania,Jawahar C V
International Workshop on Frontiers in Handwriting Recognition, IWFHR, 2004
@inproceedings{bib_Repr_2004, AUTHOR = {Ajay S. Bhaskarabhatla, Sriganesh Madhvanath, M N S S K PAVAN KUMAR, A. Balasubramania, Jawahar C V}, TITLE = {Representation and annotation of online handwritten data}, BOOKTITLE = {International Workshop on Frontiers in Handwriting Recognition}. YEAR = {2004}}
Annotated datasets of handwriting are a prerequisite for the design and training of handwriting recognition algorithms. In this paper, we briefly describe an XML representation for annotation of online handwriting data that uses the emerging Digital Ink Markup Language (InkML) standard from W3C for the representation of handwriting data. We then describe a tool based on the proposed representation that can be used for annotation of digital ink. Ease and speed of annotation are emphasized in the design of the tool. Together, the representation and the tool attempt to address the requirements of creation of annotated datasets of handwritten data in different scripts around the worldwid
Geometric Structure Computation from Conics.
Pawan Kumar Mudigonda,Jawahar C V,Narayanan P J
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2004
@inproceedings{bib_Geom_2004, AUTHOR = {Pawan Kumar Mudigonda, Jawahar C V, Narayanan P J}, TITLE = {Geometric Structure Computation from Conics.}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2004}}
This paper presents several results on images of various configurations of conics. We extract information about the plane from single and multiple views of known and unknown conics, based on planar homography and conic correspondences. We show that a single conic section cannot provide sufficient information. Metric rectification of the plane can be performed from a single view if two conics can be identified to be images of circles without knowing their centers or radii. The homography between two views of a planar scene can be computed if two arbitrary conics are identified in them without knowing anything specific about them. The scene can be reconstructed from a single view if images of a pair of circles can be identified in two planes. Our results are simpler and require less information from the images than previously known results. The results presented here involve univariate polynomial equations of degree 4 or 8 and always have solutions. Applications to metric rectification, homography calculation, 3D reconstruction, and projective OCR are presented to demonstrate the usefulness of our scheme
Discrete contours in multiple views: approximation and recognition
PAWAN KUMAR M,SAURABH GOYAL,SUJIT KUTHIRUMMAL,Jawahar C V,Narayanan P J
Image Vision Computing, IVC, 2004
@inproceedings{bib_Disc_2004, AUTHOR = {PAWAN KUMAR M, SAURABH GOYAL, SUJIT KUTHIRUMMAL, Jawahar C V, Narayanan P J}, TITLE = {Discrete contours in multiple views: approximation and recognition}, BOOKTITLE = {Image Vision Computing}. YEAR = {2004}}
Recognition of discrete planar contours under similarity transformations has received a lot of attention but little work has been reported on recognizing them under more general transformations. Planar object boundaries undergo projective or affine transformations across multiple views. We present two methods to recognize discrete curves in this paper. The first method computes a piecewise parametric approximation of the discrete curve that is projectively invariant. A polygon approximation scheme and a piecewise conic approximation scheme are presented here. The second method computes an invariant sequence directly from the sequence of discrete points on the curve in a Fourier transform space. The sequence is shown to be identical up to a scale factor in all affine related views of the curve. We present the theory and demonstrate its applications to several problems including numeral recognition, aircraft recognition, and homography computation. q 2004 Elsevier B.V. All rights reserved
Planar homography from fourier domain representation
PAWAN KUMAR M,SUJIT KUTHIRUMMAL,Jawahar C V,Narayanan P J
International Conference on Signal Processing and Communications, SPCOM, 2004
@inproceedings{bib_Plan_2004, AUTHOR = {PAWAN KUMAR M, SUJIT KUTHIRUMMAL, Jawahar C V, Narayanan P J}, TITLE = {Planar homography from fourier domain representation}, BOOKTITLE = {International Conference on Signal Processing and Communications}. YEAR = {2004}}
Computing the transformation between two views of a planar scene is an important step in many computer vision applications. Spatial approaches to solve this problem need corresponding sets of primitives – points, lines, conics, etc. Identification of corresponding primitives in two images is non-trivial, limiting the applicability of such approaches. In this paper, we present a novel Fourier domain based approach that makes use of image intensities for computing the image-to-image transformation. Our approach transforms the images to the Fourier domain and then represents them in a coordinate system in which the affine transformation is reduced to an anisotropic scaling. The anisotropic scale factors can be computed using cross correlation methods, and working backwards from this, we compute the entire transformation. It does not require any correspondences thereby making it practically very useful. Applications to registration and recognition are discussed.
Tools for developing OCRs for Indian scripts
M N S S K PAVAN KUMAR,SANTOSH RAVI KIRAN S,ABHISHEK NAYANI,Jawahar C V,Narayanan P J
Computer Vision and Pattern Recognition Conference workshops, CVPR-W, 2003
@inproceedings{bib_Tool_2003, AUTHOR = {M N S S K PAVAN KUMAR, SANTOSH RAVI KIRAN S, ABHISHEK NAYANI, Jawahar C V, Narayanan P J}, TITLE = {Tools for developing OCRs for Indian scripts}, BOOKTITLE = {Computer Vision and Pattern Recognition Conference workshops}. YEAR = {2003}}
Development of OCRs for Indian script is an active area of activity today. Indian scripts present great challenges to an OCR designer due to the large number of letters in the alphabet, the sophisticated ways in which they combine, and the complicated graphemes they result in. The problem is compounded by the unstructured manner in which popular fonts are designed. There is a lot of common structure in the different Indian scripts. In this paper, we argue that a number of automatic and semi-automatic tools can ease the development of recognizers for new font styles and new scripts. We discuss briefly three such tools we developed and show how they have helped build new OCRs. An integrated approach to the design of OCRs for all Indian scripts has great benefits. We are building OCRs for all Indian languages following this approach as part of a system to provide tools to create content in them.
A bilingual OCR for Hindi-Telugu documents and its applications
Jawahar C V,M N S S K PAVAN KUMAR,SANTOSH RAVI KIRAN S
International Conference on Document Analysis and Recognition, ICDAR, 2003
@inproceedings{bib_A_bi_2003, AUTHOR = {Jawahar C V, M N S S K PAVAN KUMAR, SANTOSH RAVI KIRAN S}, TITLE = {A bilingual OCR for Hindi-Telugu documents and its applications}, BOOKTITLE = {International Conference on Document Analysis and Recognition}. YEAR = {2003}}
This paper describes the character recognition process from printed documents containing Hindi and Telugu text. Hindi and Telugu are among the most popular languages in India. The bilingual recognizer is based on Principal Component Analysis followed by support vector classification. This attains an overall accuracy of approximately 96.7%. Extensive experimentation is carried out on an independent test set of approximately 200000 c
A rule-based approach to image retrieval
MEHTA DHAVAL MANHARLAL,EMANI SVNLS DIWAKAR,Jawahar C V
IEEE Region 10 Conference, TENCON, 2003
@inproceedings{bib_A_ru_2003, AUTHOR = {MEHTA DHAVAL MANHARLAL, EMANI SVNLS DIWAKAR, Jawahar C V}, TITLE = {A rule-based approach to image retrieval}, BOOKTITLE = {IEEE Region 10 Conference}. YEAR = {2003}}
In this paper, a rule-based approach is introduced for retrieving images from an image database. Compared to image-based queries, this approach allows the user to query in a more natural language. Performance is demonstrated with simple queries on a generic image database and sophisticated queries on a database of face images.
Multiview image compression using algebraic constraints
KAMISETTY SRI CHAITANYA,Jawahar C V
IEEE Region 10 Conference, TENCON, 2003
@inproceedings{bib_Mult_2003, AUTHOR = {KAMISETTY SRI CHAITANYA, Jawahar C V}, TITLE = {Multiview image compression using algebraic constraints}, BOOKTITLE = {IEEE Region 10 Conference}. YEAR = {2003}}
—In this paper, we propose a novel method for compression of multiple view images using algebraic constraints. The redundancy present in multiview images is considerably different from that in isolated images or video streams. A three view relationship based on a Trilinear Tensor is employed for view prediction and residual computation. The geometric redundancy in the form of common world structure is exploited during this.
A Novel Approach to Script Separation
Ranjith Kumar,Vamsi Chaitanya,Jawahar C V
International Conference on Advances in Pattern Recognition, CAPR, 2003
@inproceedings{bib_A_No_2003, AUTHOR = {Ranjith Kumar, Vamsi Chaitanya, Jawahar C V}, TITLE = {A Novel Approach to Script Separation}, BOOKTITLE = {International Conference on Advances in Pattern Recognition}. YEAR = {2003}}
This paper describes a new approach for script separation. A characterlevel script separation scheme is combined with a Viterbi algorithm to get an optimal sequence of scripts which could generate such a text. This method complements the popular approaches for script separation at paragraph level using texture features or at line level using structural features.
Towards fuzzy calibration
Jawahar C V,Narayanan P J
International Conference on Fuzzy Systems, FUZZ , 2002
@inproceedings{bib_Towa_2002, AUTHOR = {Jawahar C V, Narayanan P J}, TITLE = {Towards fuzzy calibration}, BOOKTITLE = {International Conference on Fuzzy Systems}. YEAR = {2002}}
Algebraic Constraints on Moving Points in Multiple Views.
SUJIT KUTHIRUMMAL,Jawahar C V,Narayanan P J
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2002
@inproceedings{bib_Alge_2002, AUTHOR = {SUJIT KUTHIRUMMAL, Jawahar C V, Narayanan P J}, TITLE = {Algebraic Constraints on Moving Points in Multiple Views.}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2002}}
Multiview analysis of scenes includes the study of sceneindependent constraints satisfied by a configuration of cameras for all types of scenes as well as the study of viewindependent constraints satisfied by any camera on a configuration of points. In this paper, we derive new constraints involving configurations of points that move with constant velocity, with constant acceleration, and for unconstrained planar motion. We show how these constraints can be applied to problems like motion recognition, frame alignment, etc.
Polygonal Approximation of Closed Curves across Multiple Views.
PAWAN KUMAR M,SAURABH GOYAL,Jawahar C V,Narayanan P J
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2002
@inproceedings{bib_Poly_2002, AUTHOR = {PAWAN KUMAR M, SAURABH GOYAL, Jawahar C V, Narayanan P J}, TITLE = {Polygonal Approximation of Closed Curves across Multiple Views.}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2002}}
Polygon approximation is an important step in the recognition of planar shapes. Traditional polygonal approximation algorithms handle only images that are related by a similarity transformation. The transformation of a planar shape as the viewpoint changes with a perspective camera is a general projective one. In this paper, we present a novel method for polygonal approximation of closed curves that is invariant to projective transformation. The polygons generated by our algorithm from two images, related by a projective homography, are isomorphic. We also describe an application of this in the form of numeral recognition. We demonstrate the importance of this algorithm for real-life applications like number plate recognition, aircraft recognition and metric rectification.
A Multifeature Correspondence Algorithm Using Dynamic Programming
Jawahar C V,Narayanan P J
Asian Conference on Computer Vision, ACCV, 2002
@inproceedings{bib_A_Mu_2002, AUTHOR = {Jawahar C V, Narayanan P J}, TITLE = {A Multifeature Correspondence Algorithm Using Dynamic Programming}, BOOKTITLE = {Asian Conference on Computer Vision}. YEAR = {2002}}
Correspondence between pixels is an important problem in stereo vision. Several algorithms have been proposed to carry out this task in literature. Almost all of them employ only gray-values. We show here that addition of primary or secondary evidence maps can improve the correspondence computation. However any particular combination is not guaranteed to provide proper results in a general sitiuation. What one needs is a mechanism to select the evidences which are apropriate for a particular pair of images. We present an algorithm for stereo correspondence that can take advantage of different image features adaptively for matching. A match measure combining different individual measures computed from different featuresis used by our algorithm. The advantages of each feature can be combined in a single correspondence computation. We describe an unsupervised scheme to compute the relevance of each feature to a particular situation, given a set of possibly useful features. We present an implementation of the scheme using dynamic programming for pixel-to-pixel correspondence
An adaptive multifeature correspondence algorithm for stereo using dynamic programming
Jawahar C V,Narayanan P J
Pattern Recognition Letters, PRLJ, 2002
@inproceedings{bib_An_a_2002, AUTHOR = {Jawahar C V, Narayanan P J}, TITLE = {An adaptive multifeature correspondence algorithm for stereo using dynamic programming}, BOOKTITLE = {Pattern Recognition Letters}. YEAR = {2002}}
We present an algorithm for stereo correspondence that can take advantage of different image features adaptively for matching. A match measure combining different match measures computed from different features is used by our algorithm. It is possible to compute correspondences using the gray value, multispectral components, derived features such as the edge strength, texture, etc., in a flexible manner using this algorithm. The advantages of each feature can be combined in a single correspondence computation. We describe a non-supervised scheme to compute the relevance of each feature to a particular situation, given a set of possibly useful features. We present an implementation of the scheme using dynamic programming for pixel-to-pixel correspondence. Results demonstrate the advantages of our scheme under different conditions.
Generalised correlation for multi-feature correspondence
Jawahar C V,Narayanan P J
Pattern Recognition, PR, 2002
@inproceedings{bib_Gene_2002, AUTHOR = {Jawahar C V, Narayanan P J}, TITLE = {Generalised correlation for multi-feature correspondence}, BOOKTITLE = {Pattern Recognition}. YEAR = {2002}}
Computing correspondences between pairs of images is fundamental to all structures from motion algorithms. Correlation is a popular method to estimate similarity between patches of images. In the standard formulation, the correlation function uses only one feature such as the gray level values of a small neighbourhood. Research has shown that different features—such as colour, edge strength, corners, texture measures—work better under di0erent conditions. We propose a framework of generalized correlation that can compute a real valued similarity measure using a feature vector whose components can be dissimilar. The framework can combine the e0ects of di0erent image features, such as multi-spectral features, edges, corners, texture measures, etc., into a single similarity measure in a 3exible manner. Additionally, it can combine results of di0erent window sizes used for correlation with proper weighting for each. Relative importances of the features can be estimated from the image itself for accurate correspondence. In this paper, we present the framework of generalised correlation, provide a few examples demonstrating its power, as well as discuss the implementation issues. ? 2002 Published by Elsevier Science Ltd on behalf of Pattern Recognition Society.
Planar shape recognition across multiple views
SUJIT KUTHIRUMMAL,Jawahar C V,Narayanan P J
International conference on Pattern Recognition, ICPR, 2002
@inproceedings{bib_Plan_2002, AUTHOR = {SUJIT KUTHIRUMMAL, Jawahar C V, Narayanan P J}, TITLE = {Planar shape recognition across multiple views}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2002}}
Multiview studies in Computer Vision have concentrated on the constraints satisfied by individual primitives such as points and lines. Not much attention has been paid to the properties of a collection of primitives in multiple views, which could be studied in the spatial domain or in an appropriate transform domain. We derive an algebraic constraint for planar shape recognition across multiple views based on the rank of a matrix of Fourier domain descriptor coefficients of the shape in different views. We also show how correspondence between points on the boundary can be computed for matching shapes using the phase of a measure for recognition.
A multimedia-based City information system
KRANTHI KUMAR RAVI,Narayanan P J,Jawahar C V
IETE Technical Review, TR, 2001
@inproceedings{bib_A_mu_2001, AUTHOR = {KRANTHI KUMAR RAVI, Narayanan P J, Jawahar C V}, TITLE = {A multimedia-based City information system}, BOOKTITLE = {IETE Technical Review}. YEAR = {2001}}
Plenty of information is available today from different sources about a particular geographical area. However, none of it is reliable when different needs arise like for tourists, city planners, administrators, and lay people. We have designed a multimedia based city information system which attempts to solve this problem. In this paper we explain the design and implementation of our system. The paper discusses in detail the need for such a system, how it compares with similar existing systems and the features expected in such a system. It also discusses our implementation of the system and the future work we want to pursue in this direction
A flexible scheme for representation, matching, and retrieval of images
Jawahar C V,Narayanan P J,Subrata Rakshit
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2000
@inproceedings{bib_A_fl_2000, AUTHOR = {Jawahar C V, Narayanan P J, Subrata Rakshit}, TITLE = {A flexible scheme for representation, matching, and retrieval of images}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2000}}
Image databases index in them using features extracted from the image. The indexing scheme is decided apriori and is optimized for a specific querying criterion. This is not suitable for a generic database of images which may be queried by multiple users based on different criteria. IN this paper, we present a flexible scheme which adapts itself to the user 's preferences. Though the method uses a conservative set of features during indexing that includes a large number and type of fundamental features the query processing time does not increase due to this redundancy. A boot-strapping mechanism allows the user to build up a "desired class" from a few samples. The feature selection computation scales linearly with the size of the desired class, rather than that of the entire database. THis feature makes our algorithm viable for very large databases. We present an implementation of our scheme and some results from it.
Feature Integration and Selection for Pixel Correspondence
Jawahar C V,Narayanan P J
Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP, 2000
@inproceedings{bib_Feat_2000, AUTHOR = {Jawahar C V, Narayanan P J}, TITLE = {Feature Integration and Selection for Pixel Correspondence}, BOOKTITLE = {Indian Conference on Computer Vision, Graphics and Image Processing}. YEAR = {2000}}
Pixel correspondence is an important problem in stereo vision, motion, structure from motion, etc. Several procedures have been proposed in the literature for this problem, using a variety of image features to identify the corresponding features. Dierent features work wel l under dierent conditions. An algorithm that can seamlessly integrate multiple features in a flexible manner can combine the advantages of each. We propose a framework to combine heterogenous features, each with a dierent measure of importance, into a single correspondence computation in this paper. We also present an unsupervised procedure to select the optimal combination of features for a given pair of images by computing the relative importances of each feature. A unique aspect of our framework is that it is independent of the specic correspondence algorithm used. Optimal feature selection can be done using any correspondence mechanism that can be extended to use multiple features. We also present a few examples that demonstrate the eectiveness of the feature selection framework.
Analysis of fuzzy thresholding schemes
Jawahar C V,P.K. Biswas,A.K. Ray
Pattern Recognition, PR, 2000
@inproceedings{bib_Anal_2000, AUTHOR = {Jawahar C V, P.K. Biswas, A.K. Ray}, TITLE = {Analysis of fuzzy thresholding schemes}, BOOKTITLE = {Pattern Recognition}. YEAR = {2000}}
Fuzzy thresholding schemes preserve the structural details embedded in the original gray distribution. In this paper, various fuzzy thresholding schemes are analysed in detail. Thresholding scheme based on fuzzy clustering has been extended to a possibilistic framework. The characteristic di!erence for assignment of membership of fuzzy algorithms and their correspondence with conventional hard thresholding schemes have been investigated. A possible direction towards unifying a number of hard and fuzzy thresholding schemes has been presented. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved