Abstract
Building bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary cross-modal prediction tasks: (i) predicting text(s) given a query image (“Im2Text”), and (ii) predicting image(s) given a piece of text (“Text2Im”). We make no assumption on the specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We pose both these tasks in a retrieval framework. For Im2Text, given a query image, our goal is to retrieve a ranked list of semantically relevant texts from an independent text-corpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query text, we aim to retrieve a ranked list of semantically relevant images from a collection of unannotated images (i.e., images without any associated textual meta-data). We propose a novel Structural SVM based unified framework for these two tasks, and show how it can be efficiently trained and tested. Using a variety of loss functions, extensive experiments are conducted on three popular datasets (two medium-scale datasets containing few thousands of samples, and one web-scale dataset containing one million samples). Experiments demonstrate that our framework gives promising results compared to competing baseline cross-modal search techniques, thus confirming its efficacy. Keywords: Image search; Image description; Cross-media analysis