Abstract
Detecting objects in images and videos is very challenging due to i) large intra-class variety and ii) pose/scale variations. It is hard to build strong recognition engines for generic object categories, while applying them to large video collections is computationally infeasible (due to the explosion of frames to test). In this paper, we present a detection-byinterpolation framework, where object-tracking is achieved by interpolating between candidate object detections in a subset of the video frames. Given the location of an object in two frames of a video-shot, our algorithm tries to identify the locations of the object in the intermediate frames. We evaluate two tracking solutions based on greedy and dynamic programming approaches, and observe that a hybrid method gives significant performance boost as well as speedup in detection. On 6 hours of HD quality video, we were able to cut-down the detection time from 10000 hours to 1500 hours, while simultaneously improving the detection accuracy from 54% (of [1]) to 68%. As a result of this work, we build a dataset of 100,000 car images, spanning a wide range of viewpoints, scale and make; about 100 times larger than existing collections