Abstract
Language identification (LID) systems, which can model highlevel information such as phonotactics have exhibited superior performance. State-of-the-art models use sequential models to capture the high-level information, but these models are sensitive to the length of the utterance and do not equally generalize over variable length utterances. To effectively capture this information, a feature that can model the long-term temporal context is required. This study aims to capture the long-term temporal context by appending successive shifted delta cepstral (SDC) features. Deep neural networks have been explored for developing LID systems. Experiments have been performed using AP17-OLR database. LID systems developed by stacking SDC features have shown significant improvement compared to the system trained with SDC features. The proposed feature with residual connections in the feed-forward networks reduced the equal error rate from 21.04, 18.02, 16.45 to 14.42, 11.14 and 10.11 on the 1-second, 3-seconds and > 3-second test utterances respectively.