Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Kristen Grauman,Siddhant Bansal,Richard Newcombe, James M. Rehg,Michael Wray,Jianbo Shi,Robert Kuo,Andrew Westbury,Lorenzo Torresani,Kris Kitani, Jitendra Malik,Triantafyllos Afouras,Jawahar C V,Kumar Ashutosh,Vijay Baiyya
Technical Report, arXiv, 2023
@inproceedings{bib_Ego-_2023, AUTHOR = {Kristen Grauman, Siddhant Bansal, Richard Newcombe, James M. Rehg, Michael Wray, Jianbo Shi, Robert Kuo, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Jawahar C V, Kumar Ashutosh, Vijay Baiyya}, TITLE = {Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
We present Ego-Exo4D, a diverse, large-scale multi- modal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured ego- centric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these ac- tivities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by mul- tichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiplepaired language descriptions—including a novel “expert commentary” done by coaches and teach- ers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled hu- man activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity un- derstanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community
DocVQA: A Dataset for VQA on Document Images
MINESH MATHEW,Dimosthenis Karatzas,R. Manmatha,Jawahar C V
Technical Report, arXiv, 2020
@inproceedings{bib_DocV_2020, AUTHOR = {MINESH MATHEW, Dimosthenis Karatzas, R. Manmatha, Jawahar C V}, TITLE = {DocVQA: A Dataset for VQA on Document Images}, BOOKTITLE = {Technical Report}. YEAR = {2020}}
We present a new dataset for Visual Question Answering on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. We provide detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document in crucial.
RoadText-1K: Text Detection & Recognition Dataset for Driving Videos
Sangeeth Reddy Battu,MINESH MATHEW,Lluis Gomez,Marcal Rusinol,Dimosthenis Karatzas,Jawahar C V
International Conference on Robotics and Automation, ICRA, 2020
@inproceedings{bib_Road_2020, AUTHOR = {Sangeeth Reddy Battu, MINESH MATHEW, Lluis Gomez, Marcal Rusinol, Dimosthenis Karatzas, Jawahar C V}, TITLE = {RoadText-1K: Text Detection & Recognition Dataset for Driving Videos}, BOOKTITLE = {International Conference on Robotics and Automation}. YEAR = {2020}}
Abstract— Perceiving text is crucial to understand semantics of outdoor scenes and hence is a critical requirement to build intelligent systems for driver assistance and self-driving. Most of the existing datasets for text detection and recognition comprise still images and are mostly compiled keeping text in mind. This paper introduces a new ”RoadText-1K” dataset for text in driving videos. The dataset is 20 times larger than the existing largest dataset for text in videos. Our dataset comprises 1000 video clips of driving without any bias towards text and with annotations for text bounding boxes and transcriptions in every frame. State of the art methods for text detection, recognition and tracking are evaluated on the new dataset and the results signify the challenges in unconstrained driving videos compared to existing datasets. This suggests that RoadText-1K is suited for research and development of reading systems, robust enough to be incorporated into more complex downstream tasks like driver assistance and self-driving. The dataset can be found at http://cvit.iiit.ac.in/research/ projects/cvit-projects/roadtext-1k
Evaluation and Visualization of Driver Inattention Rating From Facial Features
ISHA DUA,Akshay Uttama Nambi,Jawahar C V,Venkata N. Padmanabhan
IEEE Transactions on Biometrics, Behavior, and Identity Science, TBOIM, 2019
@inproceedings{bib_Eval_2019, AUTHOR = {ISHA DUA, Akshay Uttama Nambi, Jawahar C V, Venkata N. Padmanabhan}, TITLE = {Evaluation and Visualization of Driver Inattention Rating From Facial Features}, BOOKTITLE = {IEEE Transactions on Biometrics, Behavior, and Identity Science}. YEAR = {2019}}
In this paper, we present AUTORATE, a system that leverages the front camera of a windshield-mounted smartphone to monitor driver’s attention by combining several features. We derive a driver attention rating by fusing spatio-temporal features based on the driver state and behavior such as head pose, eye gaze, eye closure, yawns, use of cellphones, etc. We perform extensive evaluation of AUTORATE on real-world driving data and also data from controlled, static vehicle settings with 30 drivers in a large city. We compare AUTORATE’s automatically-generated rating with the scores given by 5 human annotators. We compute the agreement between AUTORATE’s rating and human annotator rating using kappa coefficient. AUTORATE’s automatically-generated rating has an overall agreement of 0.88 with the ratings provided by 5 human annotators. We also propose soft attention mechanism in AUTORATE which improves AUTORATE’s accuracy by 10%. We use temporal and spatial attention to visualize the key frame and the key action which justify the model’s predicted rating. Further, we observe that personalization in AUTORATE can improve driver specific results by a significant amount
Hwnet v2: An efficient word image representation for handwritten documents
PRAVEEN KRISHNAN,Jawahar C V
International Journal on Document Analysis and Recognition, IJDAR, 2019
@inproceedings{bib_Hwne_2019, AUTHOR = {PRAVEEN KRISHNAN, Jawahar C V}, TITLE = {Hwnet v2: An efficient word image representation for handwritten documents}, BOOKTITLE = {International Journal on Document Analysis and Recognition}. YEAR = {2019}}
We present a framework for learning an efficient holistic representation for handwritten word images. The proposed method uses a deep convolutional neural network with traditional classification loss. The major strengths of our work lie in: (i) the efficient usage of synthetic data to pre-train a deep network, (ii) an adapted version of the ResNet-34 architecture with the region of interest pooling (referred to as HWNet v2) which learns discriminative features for variable sized word images, and (iii) a realistic augmentation of training data with multiple scales and distortions which mimics the natural process of handwriting. We further investigate the process of transfer learning to reduce the domain gap between synthetic and real domain, and also analyze the invariances learned at different layers of the network using visualization techniques proposed in the literature.
IDD: A dataset for exploring problems of autonomous navigation in unconstrained environments
Girish Varma,Anbumani Subramanian,Manmohan Chandraker,Anoop Namboodiri,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2019
@inproceedings{bib_IDD:_2019, AUTHOR = {Girish Varma, Anbumani Subramanian, Manmohan Chandraker, Anoop Namboodiri, Jawahar C V}, TITLE = {IDD: A dataset for exploring problems of autonomous navigation in unconstrained environments}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2019}}
While several datasets for autonomous navigation have become available in recent years, they have tended to focus on structured driving environments. This usually corresponds to well-delineated infrastructure such as lanes, a small number of well-defined categories for traffic participants, low variation in object or background appearance and strong adherence to traffic rules. We propose DS, a novel dataset for road scene understanding in unstructured environments where the above assumptions are largely not satisfied. It consists of 10,004 images, finely annotated with 34 classes collected from 182 drive sequences on Indian roads. The label set is expanded in comparison to popular benchmarks such as Cityscapes, to account for new classes. It also reflects label distributions of road scenes significantly different from existing datasets, with most classes displaying greater within-class diversity. Consistent with …
Generating 1 Minute Summaries of Day Long Egocentric Videos
ANUJ RATHORE,Pravin Nagar,Chetan Arora,Jawahar C V
International Conference on Multimedia, IMM, 2019
@inproceedings{bib_Gene_2019, AUTHOR = {ANUJ RATHORE, Pravin Nagar, Chetan Arora, Jawahar C V}, TITLE = {Generating 1 Minute Summaries of Day Long Egocentric Videos}, BOOKTITLE = {International Conference on Multimedia}. YEAR = {2019}}
The popularity of egocentric cameras and their always-on nature has lead to the abundance of day-long first-person videos. Because of the extreme shake and highly redundant nature, these videos are difficult to watch from beginning to end and often require summarization tools for their efficient consumption. However, traditional summarization techniques developed for static surveillance videos, or highly curated sports videos and movies are, either, not suitable or simply do not scale for such hours long videos in the wild. On the other hand, specialized summarization techniques developed for egocentric videos limit their focus to important objects and people. In this paper, we present a novel unsupervised reinforcement learning technique to generate video summaries from day long egocentric videos. Our approach can be adapted to generate summaries of various lengths making it possible to view even 1-minute summaries of one’s entire day. The technique can also be adapted to various rewards, such as distinctiveness and indicativeness of the summary. When using the facial saliency-based reward, we show that our approach generates summaries focusing on social interactions, similar to the current state-of-the-art (SOTA). Quantitative comparison on the benchmark Disney dataset shows that our method achieves significant improvement in Relaxed F-Score (RFS) (32.56 vs. 19.21) and BLEU score (12.12 vs. 10.64). Finally, we show that our technique can be applied for summarizing traditional, short, hand-held videos as well, where we improve the SOTA F-score on benchmark SumMe and TVSum datasets from 41.4 to 45.6 and 57.6 to 59.1 respectively
Pan-Renal Cell Carcinoma classification and survival prediction from histopathology images using deep learning
Vinod Palakkad Krishnanunni,Jawahar C V
NPG Nature Scientific Reports, NPG, 2019
@inproceedings{bib_Pan-_2019, AUTHOR = {Vinod Palakkad Krishnanunni, Jawahar C V}, TITLE = {Pan-Renal Cell Carcinoma classification and survival prediction from histopathology images using deep learning}, BOOKTITLE = {NPG Nature Scientific Reports}. YEAR = {2019}}
Histopathological images contain morphological markers of disease progression that have diagnostic and predictive values. In this study, we demonstrate how deep learning framework can be used for an automatic classification of Renal Cell Carcinoma (RCC) subtypes, and for identification of features that predict survival outcome from digital histopathological images. Convolutional neural networks (CNN’s) trained on whole-slide images distinguish clear cell and chromophobe RCC from normal tissue with a classification accuracy of 93.39% and 87.34%, respectively. Further, a CNN trained to distinguish clear cell, chromophobe and papillary RCC achieves a classification accuracy of 94.07%. Here, we introduced a novel support vector machine-based method that helped to break the multi-class classification task into multiple binary classification tasks which not only improved the performance of the model but also helped to deal with data imbalance. Finally, we extracted the morphological features from high probability tumor regions identified by the CNN to predict patient survival outcome of most common clear cell RCC. The generated risk index based on both tumor shape and nuclei features are significantly associated with patient survival outcome. These results highlight that deep learning can play a role in both cancer diagnosis and prognosis.
Beyond supervised learning: A computer vision perspective
Anbumani Subramanian,Vineeth N Balasubramanian,Jawahar C V
Journal on Indian Institute of Science, IIS, 2019
@inproceedings{bib_Beyo_2019, AUTHOR = {Anbumani Subramanian, Vineeth N Balasubramanian, Jawahar C V}, TITLE = {Beyond supervised learning: A computer vision perspective}, BOOKTITLE = {Journal on Indian Institute of Science}. YEAR = {2019}}
AbstractFully supervised deep learning-based methods have created a profound im-pact in various fields of computer science. Compared to classical methods, superviseddeep learning-based techniques face scalability issues as they require huge amountsof labeled data and, more significantly, are unable to generalize to multiple domainsand tasks. In recent years, a lot of research has been targeted towards addressingthese issues within the deep learning community. Although there have been extensivesurveys on learning paradigms such as semi-supervised and unsupervised learning,there are few timely reviews after the emergence of deep learning. In this paper, weprovide an overview of the contemporary literature surrounding alternatives to fullysupervised learning in the deep learning context. First, we summarize the relevanttechniques that fall between the paradigm of supervised and unsupervised learning.Second, we take autonomous navigation as a running example to explain and com-pare different models. Finally, we highlight some shortcomings of current methodsand suggest future directions.
City-scale road audit system using deep learning
YARRAM SUDHIR KUMAR REDDY,Girish Varma,Jawahar C V
International Conference on Intelligent Robots and Systems, IROS, 2018
@inproceedings{bib_City_2018, AUTHOR = {YARRAM SUDHIR KUMAR REDDY, Girish Varma, Jawahar C V}, TITLE = {City-scale road audit system using deep learning}, BOOKTITLE = {International Conference on Intelligent Robots and Systems}. YEAR = {2018}}
Abstract— Road networks in cities are massive and is a critical component of mobility. Fast response to defects, that can occur not only due to regular wear and tear but also because of extreme events like storms, is essential. Hence there is a need for an automated system that is quick, scalable and costeffective for gathering information about defects. We propose a system for city-scale road audit, using some of the most recent developments in deep learning and semantic segmentation. For building and benchmarking the system, we curated a dataset which has annotations required for road defects. However, many of the labels required for road audit have high ambiguity which we overcome by proposing a label hierarchy. We also propose a multi-step deep learning model that segments the road, subdivide the road further into defects, tags the frame for each defect and finally localizes the defects on a map gathered using GPS. We analyze and evaluate the models on image tagging as well as segmentation at different levels of the label hierarchy
Learning human poses from actions
ADITYA ARUN,Jawahar C V,M. Pawan Kumar
British Machine Vision Conference, BMVC, 2018
@inproceedings{bib_Lear_2018, AUTHOR = {ADITYA ARUN, Jawahar C V, M. Pawan Kumar}, TITLE = {Learning human poses from actions}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2018}}
We consider the task of learning to estimate human pose in still images. In order to avoid the high cost of full supervision, we propose to use a diverse data set, which consists of two types of annotations: (i) a small number of images are labeled using the expensive ground-truth pose; and (ii) other images are labeled using the inexpensive action label. As action information helps narrow down the pose of a human, we argue that this approach can help reduce the cost of training without significantly affecting the accuracy. To demonstrate this we design a probabilistic framework that employs two distributions: (i) a conditional distribution to model the uncertainty over the human pose given the image and the action; and (ii) a prediction distribution, which provides the pose of an image without using any action information. We jointly estimate the parameters of the two aforementioned distributions by minimizing their dissimilarity coefficient, as measured by a task-specific loss function. During both training and testing, we only require an efficient sampling strategy for both the aforementioned distributions. This allows us to use deep probabilistic networks that are capable of providing accurate pose estimates for previously unseen images. Using the MPII data set, we show that our approach outperforms baseline methods that either do not use the diverse annotations or rely on pointwise estimates of the pose.
Unsupervised learning of face representations
Samyak Datta,Gaurav Sharma,Jawahar C V
International Conference on Automatic Face and Gesture Recognition, FG, 2018
@inproceedings{bib_Unsu_2018, AUTHOR = {Samyak Datta, Gaurav Sharma, Jawahar C V}, TITLE = {Unsupervised learning of face representations}, BOOKTITLE = {International Conference on Automatic Face and Gesture Recognition}. YEAR = {2018}}
We present an approach for unsupervised training of CNNs in order to learn discriminative face representations. We mine supervised training data by noting that multiple faces in the same video frame must belong to different persons and the same face tracked across multiple frames must belong to the same person. We obtain millions of face pairs from hundreds of videos without using any manual supervision. Although faces extracted from videos have a lower spatial resolution than those which are available as part of standard supervised face datasets such as LFW and CASIA-WebFace, the former represent a much more realistic setting, e.g. in surveillance scenarios where most of the faces detected are very small. We train our CNNs with the relatively low resolution faces extracted from video frames collected, and achieve a higher verification accuracy on the benchmark LFW dataset cf. hand-crafted features such as LBPs, and even surpasses the performance of state-of-the-art deep networks such as VGG-Face, when they are made to work with low resolution input images.
NeuroIoU: Learning a Surrogate Loss for Semantic Segmentation
NAGENDAR. G,DIGVIJAY SINGH,V. Balasubramanian,Jawahar C V
British Machine Vision Conference, BMVC, 2018
@inproceedings{bib_Neur_2018, AUTHOR = {NAGENDAR. G, DIGVIJAY SINGH, V. Balasubramanian, Jawahar C V}, TITLE = {NeuroIoU: Learning a Surrogate Loss for Semantic Segmentation}, BOOKTITLE = {British Machine Vision Conference}. YEAR = {2018}}
Semantic segmentation is a popular task in computer vision today, and deep neural network models have emerged as the popular solution to this problem in recent times. The typical loss function used to train neural networks for this task is the cross-entropy loss. However, the success of the learned models is measured using Intersection-OverUnion (IoU), which is inherently non-differentiable. This gap between performance measure and loss function results in a fall in performance, which has also been studied by few recent efforts. In this work, we propose a novel method to automatically learn a surrogate loss function that approximates the IoU loss and is better suited for good IoU performance. To the best of our knowledge, this is the first such work that attempts to learn a loss function for this purpose. The proposed loss can be directly applied over any network. We validated our method over different networks (FCN, SegNet, UNet) on the PASCAL VOC and Cityscapes datasets. Our results on this work show consistent improvement over baseline methods.
Word spotting in silent lip videos
Abhishek Jha,Vinay P. Namboodiri,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2018
@inproceedings{bib_Word_2018, AUTHOR = {Abhishek Jha, Vinay P. Namboodiri, Jawahar C V}, TITLE = {Word spotting in silent lip videos}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2018}}
Our goal is to spot words in silent speech videos withoutexplicitly recognizing the spoken words, where the lip mo-tion of the speaker is clearly visible and audio is absent. Ex-isting work in this domain has mainly focused on recogniz-ing a fixed set of words in word-segmented lip videos, whichlimits the applicability of the learned model due to limitedvocabulary and high dependency on the model’s recogni-tion performance.Our contribution is two-fold: 1) we develop a pipelinefor recognition-free retrieval, and show its performanceagainst recognition-based retrieval on a large-scale datasetand another set of out-of-vocabulary words. 2) We intro-duce a query expansion technique using pseudo-relevantfeedback and propose a novel re-ranking method based onmaximizing the correlation between spatio-temporal land-marks of the query and the top retrieval candidates. Ourword spotting method achieves 35% higher mean aver-age precision over recognition-based method on large-scaleLRW dataset. Finally, we demonstrate the application of themethod by word spotting in a popular speech video (“Thegreat dictator” by Charlie Chaplin) where we show that theword retrieval can be used to understand what was spokenperhaps in the silent movies
Automated Top View Registration of Broadcast Football Videos
RAHUL ANAND SHARMA,BHARATH BHAT,Vineet Gandhi,Jawahar C V
Winter Conference on Applications of Computer Vision, WACV, 2017
@inproceedings{bib_Auto_2017, AUTHOR = {RAHUL ANAND SHARMA, BHARATH BHAT, Vineet Gandhi, Jawahar C V}, TITLE = {Automated Top View Registration of Broadcast Football Videos}, BOOKTITLE = {Winter Conference on Applications of Computer Vision}. YEAR = {2017}}
In this paper, we propose a novel method to register football broadcast video frames on the static top view model of the playing surface. The proposed method is fully automatic in contrast to the current state of the art which requires manual initialization of point correspondences between the image and the static model. Automatic registration using existing approaches has been difficult due to the lack of sufficient point correspondences. We investigate an alternate approach exploiting the edge information from the line markings on the field. We formulate the registration problem as a nearest neighbour search over a synthetically generated dictionary of edge map and homography pairs. The synthetic dictionary generation allows us to exhaustively cover a wide variety of camera angles and positions and reduce this problem to a minimal per-frame edge map matching procedure. We show that the per-frame results can …
Fine-grain annotation of cricket videos
RAHUL ANAND SHARMA,Pramod Sankar K.,Jawahar C V
Asian Conference on Pattern Recognition, ACPR, 2015
@inproceedings{bib_Fine_2015, AUTHOR = {RAHUL ANAND SHARMA, Pramod Sankar K., Jawahar C V}, TITLE = {Fine-grain annotation of cricket videos}, BOOKTITLE = {Asian Conference on Pattern Recognition}. YEAR = {2015}}
The recognition of human activities is one of the key problems in video understanding. Action recognition is challenging even for specific categories of videos, such as sports, that contain only a small set of actions. Interestingly, sports videos are accompanied by detailed commentaries available online, which could be used to perform action annotation in a weakly-supervised setting. For the specific case of Cricket videos, we address the challenge of temporal segmentation and annotation of actions with semantic descriptions. Our solution consists of two stages. In the first stage, the video is segmented into "scenes", by utilizing the scene category information extracted from text-commentary. The second stage consists of classifying videoshots as well as the phrases in the textual description into various categories. The relevant phrases are then suitably mapped to the video-shots. The novel aspect of this work is the …
Blocks that shout: Distinctive parts for scene classification
MAYANK JUNEJA,Andrea Vedaldi,Jawahar C V,Andrew Zisserman
Computer Vision and Pattern Recognition, CVPR, 2013
@inproceedings{bib_Bloc_2013, AUTHOR = {MAYANK JUNEJA, Andrea Vedaldi, Jawahar C V, Andrew Zisserman}, TITLE = {Blocks that shout: Distinctive parts for scene classification}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2013}}
The automatic discovery of distinctive parts for an ob-ject or scene class is challenging since it requires simulta-neously to learn the part appearance and also to identify the part occurrences in images. In this paper, we propose a simple, efficient, and effective method to do so. We ad-dress this problem by learning parts incrementally, starting from a single part occurrence with an Exemplar SVM. In this manner, additional part instances are discovered and aligned reliably before being considered as training exam-ples. We also propose entropy-rank curves as a means of evaluating the distinctiveness of parts shareable between categories and use them to select useful parts out of a set of candidates.We apply the new representation to the task of scene cat-egorisation on the MIT Scene 67 benchmark. We show that our method can learn parts which are significantly more in-formative and for a fraction of the cost, compared to previ-ous part-learning methods such as Singhet al. [28]. We also show that a well constructed bag of words or Fisher vector model can substantially outperform the previous state-of-the-art classification performance on this data.
Cats and dogs
PARKHI OMKAR MORESHWAR,Andrea Vedaldi,Andrew Zisserman,Jawahar C V
Computer Vision and Pattern Recognition, CVPR, 2012
@inproceedings{bib_Cats_2012, AUTHOR = {PARKHI OMKAR MORESHWAR, Andrea Vedaldi, Andrew Zisserman, Jawahar C V}, TITLE = {Cats and dogs}, BOOKTITLE = {Computer Vision and Pattern Recognition}. YEAR = {2012}}
We investigate the fine grained object categorization problem of determining the breed of animal from an image. To this end we introduce a new annotated dataset of pets covering 37 different breeds of cats and dogs. The visual problem is very challenging as these animals, particularly cats, are very deformable and there can be quite subtle differences between the breeds. We make a number of contributions: first, we introduce a model to classify a pet breed automatically from an image. The model combines shape, captured by a deformable part model detecting the pet face, and appearance, captured by a bag-of-words model that describes the pet fur. Fitting the model involves automatically segmenting the animal in the image. Second, we compare two classification approaches: a hierarchical one, in which a pet is first assigned to the cat or dog family and then to a breed, and a flat one, in which the breed is obtained directly. We also investigate a number of animal and image orientated spatial layouts. These models are very good: they beat all previously published results on the challenging ASIRRA test (cat vs dog discrimination). When applied to the task of discriminating the 37 different breeds of pets, the models obtain an average accuracy of about 59%, a very encouraging result considering the difficulty of the problem.
The truth about cats and dogs
PARKHI OMKAR MORESHWAR,Andrea Vedaldi,Jawahar C V,Andrew Zisserman
International Conference on Computer Vision, ICCV, 2011
@inproceedings{bib_The__2011, AUTHOR = {PARKHI OMKAR MORESHWAR, Andrea Vedaldi, Jawahar C V, Andrew Zisserman}, TITLE = {The truth about cats and dogs}, BOOKTITLE = {International Conference on Computer Vision}. YEAR = {2011}}
Template-based object detectors such as the deformable parts model of Felzenszwalb et al. [11] achieve state-ofthe-art performance for a variety of object categories, but are still outperformed by simpler bag-of-words models for highly flexible objects such as cats and dogs. In these cases we propose to use the template-based model to detect a distinctive part for the class, followed by detecting the rest of the object via segmentation on image specific information learnt from that part. This approach is motivated by two observations: (i) many object classes contain distinctive parts that can be detected very reliably by template-based detectors, whilst the entire object cannot; (ii) many classes (e.g. animals) have fairly homogeneous coloring and texture that can be used to segment the object once a sample is provided in an image. We show quantitatively that our method substantially outperforms whole-body template-based detectors for these highly deformable object categories, and indeed achieves accuracy comparable to the state-of-the-art on the PASCAL VOC competition, which includes other models such as bagof-words.