Abstract
The demand for high-quality parallel speech data has been increasing as deep learning-based Speech to Speech Machine Translation (SSMT) and automatic dubbing approaches gain popularity in speech applications. Traditional and well-established speech applications such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) heavily rely on large corpus of monolingual speech and the corresponding text. While there is a wealth of parallel text data available for both English and Indic languages, parallel speech data is available only for English and other European languages, yet it often lacks natural prosody and semantic alignment between the languages. For achieving cross-lingual prosody transfer, end-to-end SSMT models, and high-quality dubbing from English to Hindi, in this work, an English-Hindi parallel bilingual speech-text corpus named lIlT-Speech Twins 1.0 is created. This data contains twin-like English and Hindi speech-text pairs obtained from publicly available children's stories in both the languages, through manual and automatic processing. Starting with 8 stories in each language, totaling around 4 hours of audio, the final outcome was a 2-hour dataset. This was achieved through systematic segmentation, re-moval of non-speech background audio, and sentence-by-sentence alignment to ensure accurate meaning in both languages. In addition to ensuring proper alignment and transcription, this dataset offers a rich source of natural prosody, expressions, and emotions, due to the narrative diversity within the stories. The dataset also provides sig-nificant speaker variability, with different characters being voiced by various speakers, enhancing the richness of the lIlT-Speech Twins 1.0 corpus.