Abstract
We address the problem of automatic image annotation in large vocabulary datasets.In such datasets, for a given label, there could be several other labels that act as its confusing labels. Three possible factors for this are (i) incomplete-labeling (“cars”vs.“vehicle”), (ii) label-ambiguity (“flowers”vs.“blooms”), and (iii) structural-overlap(“lion”vs.“tiger”). While previous studies in this domain have mostly focused on nearest-neighbour based models, we show that even the conventional one-vs-rest SVM significantly outperforms several benchmark models. We also demonstrate that with a simple modification in the hinge-loss of SVM, it is possible to significantly improve its performance. In particular, we introduce a tolerance-parameter in the hinge-loss. Thismakes the new model more tolerant against the errors in the classification of samples tagged with confusing labels as compared to other samples. This tolerance parameter is automatically determined using visual similarity and dataset statistics. Experimental evaluations demonstrate that our method (referred to as SVM with Variable Tolerance or SVM-VT) shows promising results on the task of image annotation on three challenging datasets, and establishes a baseline for such models in this domain.