Abstract
Query-by-example spoken term detection (QbE-STD) refers to the task of determining the subsequence of a reference which matches with a query, where both the queryand the reference are in audio format. Dynamic time warping(DTW) based techniques are explored to match the two sequences with different lengths in an unsupervised manner. In this paper,a completely unsupervised approach based on Segmental DTW(SDTW), a variant of DTW, is considered for the task of QbE-STD where both reference and query utterances are represented using a sequence of Gaussian posteriorgram vectors. SDTW using two different types of bands i.e., Sakoe-Chiba band and Itakura parallelogram is considered to compare the Gaussian posterior-grams of the query and the reference sequence. The effect o fvarying different local constraints of the DTW algorithm on the performance of SDTW is also analyzed in this paper . Results obtained on MediaEval 2012 dataset indicate that SDTW using a band with variable speaking rate, as in Itakura parallelogram,performs better compared to that of using a band with fixed speaking rate, as in Sakoe-Chiba band, across all variations in local constraints.