Abstract
The Oxford/IIIT team participated in the high-level feature extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or audio information. For the high-level feature extraction task, we used two different approaches, both based on a combination of visual features. One used a SVM classifier using a linear combination of kernels, the other used a random forest classifier. For both methods, we trained all high-level features using publicly available annotations [3]. The advantage of the random forest classifier is the speed of training and testing. In addition, for the people feature, we took a more targeted approach. We used a real-time face detector and an upper body detector, in both cases running on every frame. Our best performing submission, C OXVGG 1 1, which used a rank fusion of our random forest and SVM approach, achieved an mAP of 0.101 and was above the median for all but one feature. In the interactive search task, our team came third overall with an mAP of 0.158. The system used was identical to last year with the only change being a source of accurate upper body detections.