Abstract
The demand for high-quality speech data has been increasing as deep-learning approaches gain popularity in speech applications. Among these, automatic speech recognition (ASR) and text-to-speech (TTS) require large amount of data contain- ing speech and the corresponding text. For these applications, high-quality data is often obtained through manual validation, which ensures matching between speech and text. The manual validation is not scalable as per the demand due to the cost and time involved. In order to cater to the high-quality data demand, validating the data automatically could be useful. In this work, for automatic data validation, a spoken English corpus named IIITH MM2 Speech-Text is created, containing matched and mismatched speech-text pairs under read speech conditions from Indian speakers with different nativities. For the creation, we consider 100 unique stimuli selected from the TIMIT corpus, ensuring phonetic richness, for which a joint entropy maximization is proposed. These stimuli are recorded from 50 speakers, resulting in matched and mismatched sets containing 5000 and 764 utterances with a total duration of 6 hours and 1 hour, respectively. The mismatched set contains speech from the instances where the speakers naturally made spoken errors while reading the reference text. It also contains two stimuli per utterance, one stimulus is the reference text, and the other is manually annotated text that reflects the erroneous speech. Thus, the reference and the annotated text are used for building the models of speech-text mismatch detection and correction, respectively. To the best of our knowledge, no such corpora exist containing both matched and mismatched speech-text. As a preliminary analysis for speech-text mismatch detection, a baseline considering Wav2Vec-2.0 representations and DTW results in the detection F1-score of 0.87