IIITH Kicks Off Largest Crowdsourcing Speech Project To Connect Voice with Vernacular

According to the Artificial Intelligence for Speech Recognition Market in India 2019 report, more than 50% of the Indian population use devices that are embedded with AI-based speech recognition technology. However, in a multilingual country like ours with 22 official languages and 12 different scripts, these voice-enabled devices are dominated by English-speaking assistants such as Siri, Alexa, and the Google Assistant among others. While an attempt has been made to make conversations possible with the Alexa in Hindi and Siri in an Indian-accented English, it still leaves out a large vernacular populace out of the picture. Google’s Year in Search 2020 for India reveals that local language searches in tier 2 and 3 towns continue to grow: “And while Hindi is still the dominant language, Tamil, Marathi, and Bengali are quickly gaining prominence online”. In the private sector space, apart from the big tech companies, there is also now a sizeable number of startups trying to bridge the gap and engage with multilingual consumers to enable them to shop, pay bills, extract relevant information and do much more in their preferred language.

Voice To Local

On the social front, under the Technology Development for Indian Languages (TDIL) initiative (informally known as ‘Bahu Bhashik’) of the Ministry of Electronics and Information Technology, the Govt of India aims to overcome language barriers and enable the wide spread proliferation of ICT in all Indian languages. This involves automatic speech recognition, speech to speech translation and speech to text translation. With formidable expertise in Language and Speech Processing, the International Institute of Information Technology Hyderabad has joined forces with the government to embark on the Automatic Speech Recognition (ASR) module for translation of Indian languages. The project is being headed by Prakash Yalla, Head, Technology Transfer Office and Dr. Anil Kumar Vuppala, Associate Prof, Speech Processing Centre. With his credo of ‘Technology in the Service of Society’, Padma Bhushan Prof. Raj Reddy, chairman of the institute governing council and recipient of the prestigious ACM Turing Award and the Legion of Honour, University Professor of CS & Robotics and Moza Bint Naseer Chair at the School of Computer Science in the Carnegie Mellon University, provides all the inspiration. Championing the cause of ‘Indian language Alexas’, he strongly believes that AI can help in bridging the country’s multilingual divide.

Automatic Speech Recognition

To build AI-enabled automatic speech recognition systems however requires 1000s and 1000s of hours of speech data, along with transcribed text of the same for each language. Providing a reference of scale to this gargantuan task, Dr. Anil Vuppala says, “In our lab, we have been working on speech recognition technology for the last 10 years and have collected data too. But it is of the order of 50-60 hours. Now we need 1000s of hours of data. The main challenge here is not limited to the audio or speech file alone. The important thing is fragmenting the speech files, and writing them down in the form of text. It’s a very laborious process.”

Seeking The Crowds

With Prof. Raj Reddy’s vision of reaching out to the common man, conversational AI assumes importance. Hence datasets containing speech in as natural a setting as possible is crucial. For that, as a cost-effective measure the project is looking towards crowd sourcing of speech data. As a pilot, the team is initially inviting volunteers to contribute to Telugu language speech data. “The idea here is to collect around 2000 hours of spoken Telugu over the course of a year. For this, we’re planning on liaisoning with academic institutions across Andhra Pradesh and Telangana and conducting Just-A-Minute and debate competitions. Another approach is via the existing Telugu Wikipedia community consisting of learned scholars and lovers of the language,” says Prakash. Apart from this, the team is also working with industry partners such as OzoneTel and Pactera Edge and leveraging their network to get access to data.

Initial Steps

Remarking about the challenges facing an exercise of this scale, Dr. Vuppala says, “The volunteers should have relevant experience in transcription as well since it will reflect in the quality of the ASR we’re building.” The initial collection of Telugu speech data is expected to lead to the establishment of protocols and systems in place for crowd sourcing of data for all Indian languages. “If everything works, it’ll become a nation-wide data collection exercise, probably the largest ever and we’ll make it available to the general public free of cost,” says Prakash.

To know more and contribute to the project, contact Dr. Anil Vuppala at anil.vuppala@iiit.ac.in