@inproceedings{bib_Atte_2024, AUTHOR = {Sai Akarsh C, Narasinga Vamshi Raghu Simha, Anindita Mondal, Anil Kumar Vuppala}, TITLE = {Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation}, BOOKTITLE = {International Conference on Signal Processing and Communications}. YEAR = {2024}}
The language diversity in India's education sector poses a significant challenge, hindering inclusivity. Despite the democratization of knowledge through online educational content, the dominance of English, as the internet's lingua franca, limits accessibility, emphasizing the crucial need for translation into Indian languages. Despite existing Speech-to-Speech Machine Translation (SSMT) technologies, the lack of intonation in these systems gives monotonous translations, leading to a loss of audience interest and disengagement from the content. To address this, our paper introduces a dataset with stress annotations in Indian English and also a Text-to-Speech (TTS) architecture capable of incorporating stress into synthesized speech. This dataset is used for training a stress detection model, which is then used in the SSMT system for detecting stress in the source speech and transferring it into the target language speech. The TTS architecture is based on FastPitch and can modify the variances based on stressed words given. We present an Indian English-to-Hindi SSMT system that can transfer stress and aim to enhance the overall quality and engagement of educational content.
@inproceedings{bib_Stre_2024, AUTHOR = {Sai Akarsh C, Narasinga Vamshi Raghu Simha, Anil Kumar Vuppala}, TITLE = {Stress Transfer in Speech-to-Speech Machine Translation Application Demo}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2024}}
India's education sector faces a significant challenge due to its linguistic diversity, hindering inclusivity. The dominance of English on the internet underscores the critical need for translating educational content into Indian languages to enhance accessibility. Although Speech-to-Speech Machine Translation (SSMT) technologies exist, their deficiency in reproducing intonation results in monotonous translations, diminishing audience engagement and interest in the content. To address this issue, this paper demonstrates an SSMT application with a Text-to-Speech (TTS) architecture capable of incorporating stress into synthesized speech to give a more engaging experience. The SSMT pipeline also has components like a stress classifier that captures the stress in the source speech and allows it to be utilized during speech generation. The application takes in a speech file and gives a translated speech file with stress transferred from the source.
@inproceedings{bib_Open_2024, AUTHOR = {Kesavaraj V, Anil Kumar Vuppala}, TITLE = {Open Vocabulary Keyword Spotting through Transfer Learning from Speech Synthesis}, BOOKTITLE = {International Conference on Signal Processing and Communications}. YEAR = {2024}}
Identifying keywords in an open-vocabulary context is crucial for personalizing interactions with smart devices. Previous approaches to open vocabulary keyword spotting dependon a shared embedding space created by audio and text encoders. However, these approaches suffer from heterogeneous modality representations (i.e., audio-text mismatch). To address this issue, our proposed framework leverages knowledge acquired from a pre-trained text-to-speech (TTS) system. This knowledge transfer allows for the incorporation of awareness of audio projections into the text representations derived from the text encoder. The performance of the proposed approach is compared with various baseline methods across four different datasets. The robustness of our proposed model is evaluated by assessing its performance across different word lengths and in an Out-of-Vocabulary (OOV) scenario. Additionally, the effectiveness of transfer learning from the TTS system is investigated by analyzing its different intermediate representations. The experimental results indicate that, in the challenging LibriPhrase Hard dataset, the proposed approach outperformed the cross-modality correspondence detector (CMCD) method by a significant improvement of 8.22% in area under the curve (AUC) and 12.56% in equal error rate (EER).
@inproceedings{bib_User_2024, AUTHOR = {Kesavaraj V, Anuprabha, Anil Kumar Vuppala}, TITLE = {User-defined keyword spotting using shifted delta coefficients}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2024}}
Identifying user-defined keywords is crucial for personalizing interactions with smart devices. Previous approaches of user-defined keyword spotting (UDKWS) have relied on short-term spectral features such as mel frequency cepstral coefficients (MFCC) to detect the spoken keyword. However, these features may face challenges in accurately identifying closely related pronunciation of audio-text pairs, due to their limited capability in capturing the temporal dynamics of the speech signal. To address this challenge, we propose to use shifted delta coefficients (SDC) which help in capturing pronunciation variability (transition between connecting phonemes) by incorporating long-term temporal information. The performance of the SDC feature is compared with various baseline features across four different datasets using a cross-attention based end-to-end system. Additionally, various configurations of SDC are explored to find the suitable temporal context for the UDKWS task. The experimental results reveal that the SDC feature outperforms the MFCC baseline feature, exhibiting an improvement of 8.32% in area under the curve (AUC) and 8.69% in terms of equal error rate (EER) on the challenging Libriphrase-hard dataset. Moreover, the proposed approach demonstrated superior performance when compared to state-of-the-art UDKWS techniques.
Narasinga Vamshi Raghu Simha,Hina Fathima,Motepalli Kowshik Siva Sai,Sangeetha Mahesh,Sai Akarsh C,Purva Barche,Mirishkar Sai Ganesh,ajish k abraham,Anil Kumar Vuppala
@inproceedings{bib_Enha_2024, AUTHOR = {Narasinga Vamshi Raghu Simha, Hina Fathima, Motepalli Kowshik Siva Sai, Sangeetha Mahesh, Sai Akarsh C, Purva Barche, Mirishkar Sai Ganesh, ajish K Abraham, Anil Kumar Vuppala}, TITLE = {Enhancing Stuttering Detection: A Syllable-Level Stutter Dataset}, BOOKTITLE = {International Conference on Signal Processing and Communications}. YEAR = {2024}}
Stuttering constitutes a pervasive speech disorder with a global impact on individuals. Detecting stuttering in speech would aid speech pathologists in monitoring fluency over time and enhance the quality of life for those with atypical speech patterns. Presently, the available public datasets tailored for stuttering identification are confined to the utterance level, with labels ascribed to individual utterances. However, in real-time situations, a considerable proportion of individuals grappling with stuttering manifest dysfluency at the syllable level. This limitation markedly constrains the effectiveness and applicability of dysfluency detection systems. In an effort to address this lacuna in knowledge, this study introduces a novel dataset comprising audio recordings directly obtained from native Kannada-speaking individuals afflicted with the stuttering disorder at the syllable level and also provides detailed information on the dataset's creation, including patient selection criteria and clip labelling. Additionally, benchmark models trained on this distinctive dataset are presented as part of this research endeavor.
Motepalli Kowshik Siva Sai,Narasinga Vamshi Raghu Simha,Pathuri Venkata Sri Harsha,Hina Fathima Fazal Khan,Sangeetha mahesh,Ajish Abraham,Anil Kumar Vuppala
@inproceedings{bib_Stut_2023, AUTHOR = {Motepalli Kowshik Siva Sai, Narasinga Vamshi Raghu Simha, Pathuri Venkata Sri Harsha, Hina Fathima Fazal Khan, Sangeetha Mahesh, Ajish Abraham, Anil Kumar Vuppala}, TITLE = {Stuttering Detection Application}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2023}}
Stuttering is a prevalent speech disorder that affects millions of people worldwide. In this Show and Tell presentation, we demonstrate a novel platform that takes speech samples in English and Kannada to detect and analyze stuttering in patients. The user-friendly interface includes demographic details and speech samples, generating comprehensive reports for different stuttering disfluencies. The platform has four different user types, providing full read-only access for admins and full write access for super admins. Our platform provides valuable assistance for speechlanguage pathologists to evaluate speech samples. The proposed platform supports both live and recorded speech samples and presents a flexible approach to stuttering detection and analysis. Our research demonstrates the potential of technology to improve speech-language pathology for stuttering. Used F-score as a metric for evaluating the models for the stutter detection task.
@inproceedings{bib_How__2023, AUTHOR = {Shelly Jain, Aditya Yadavalli, Mirishkar Sai Ganesh, Anil Kumar Vuppala}, TITLE = {How do Phonological Properties Affect Bilingual Automatic Speech Recognition?}, BOOKTITLE = {IEEE Spoken Language Technology Workshop}. YEAR = {2023}}
Multilingual Automatic Speech Recognition (ASR) for Indian lan- guages is an obvious technique for leveraging their similarities. We present a detailed analysis of how phonological similarities and differences between languages affect Time Delay Neural Net- work (TDNN) and End-to-End (E2E) ASR. To study this, we select genealogically similar pairs from five Indian languages and train bilingual acoustic models. We compare these against corresponding monolingual acoustic models and find similar phoneme distribu- tions within speech to be the primary factor for improving model performance, with phoneme overlap being secondary. The influ- ence of phonological properties on performance is visible in both cases. Word Error Rate (WER) of E2E decreased by a median of 2.35%, and upto 8.5% when the phonological similarity was great- est. WER of TDNN increased by 11.69% when the similarity was lowest. Thus, it is clear that the choice of supplementary language is important for model performance. Index Terms— Phonetics and Phonology, Analysis, Automatic Speech Recognition, Low-Resource Languages, Multilingual Sys-
Hardware Accelerator for Transformer based End-to-End Automatic Speech Recognition System
Durgavajjula Shaarada Yamini,Mirishkar Sai Ganesh,Anil Kumar Vuppala,Venkata Suresh Reddy Purini
@inproceedings{bib_Hard_2023, AUTHOR = {Durgavajjula Shaarada Yamini, Mirishkar Sai Ganesh, Anil Kumar Vuppala, Venkata Suresh Reddy Purini}, TITLE = {Hardware Accelerator for Transformer based End-to-End Automatic Speech Recognition System}, BOOKTITLE = {Reconfigurable Architectures Workshop }. YEAR = {2023}}
Hardware accelerators are being designed to offload compute-intensive tasks such as deep neural networks from the CPU to improve the overall performance of an application, specifically on the performance-per-watt metric. Encoder-decoderbased sequence-to-sequence models such as the Transformer model have demonstrated state-of-the-art results in end-to-end automatic speech recognition systems (ASRs). The Transformer model being intensive on memory and computation poses a challenge for an FPGA implementation. This paper proposes an end-to-end architecture to accelerate a Transformer for an ASR system. The host CPU orchestrates the computations from different encoder and decoder stages of the Transformer architecture on the designed hardware accelerator with no necessity for intervening FPGA reconfiguration. The communication latency is hidden by prefetching the weights of the next encoder/decoder block while the current block is being processed. The computation is split across both the Super Logic Regions (SLRs) of the FPGA, mitigating the inter-SLR communication. The proposed design presents an optimal latency, exploiting the available resources. The accelerator design is realized using highlevel synthesis tools and evaluated on an Alveo U-50 FPGA card. The design demonstrates an average speed-up of 32× compared to an Intel Xeon E5-2640 CPU and 8.8× compared to NVIDIA GeForce RTX 3080 Ti Graphics card for a 32-bit floating point single precision model. Index Terms—Automatic speech recognition, Hardware accelerator, Transformer, FPGAs.
Enhancing Language Identification in Indian Context through Exploiting Learned Features with Wav2Vec2.0
Motepalli Kowshik Siva Sai,Shivang Gupta,Vamsi Narasinga,Ravi Kumar,Anil Kumar Vuppala
@inproceedings{bib_Enha_2023, AUTHOR = {Motepalli Kowshik Siva Sai, Shivang Gupta, Vamsi Narasinga, Ravi Kumar, Anil Kumar Vuppala}, TITLE = {Enhancing Language Identification in Indian Context through Exploiting Learned Features with Wav2Vec2.0}, BOOKTITLE = {International Conference on Speech and Computers}. YEAR = {2023}}
This work proposes the utilization of a self-supervised pretrained network for developing a Language Identification (LID) system catering to low-resource Indian languages. The framework employed is Wav2vec2.0-XLSR-53, pre-trained on 53k hours of unlabeled speech data. The unsupervised training of the model enables it to learn the acoustic patterns specific to a language. Given that languages share phonetic space, multi-lingual pre-training is instrumental in learning crosslingual information and building systems that cater to low-resource languages. The results showcase a relative improvement of 33.2% over the DNN-A (DNN with attention) model and 19.04% over Dense Resnets for the Language Identification task on the IIITH-ILSC database using the proposed features.
IIITH-CSTD Corpus: Crowdsourced Strategies for the Collection of a Large-scale Telugu Speech Corpus
Mirishkar Sai Ganesh,VISHNU VIDYADHARA RAJU V,Meher Dinesh Naroju,Sudhamay Maity,Veera Prakash Yalla,Anil Kumar Vuppala
@inproceedings{bib_IIIT_2023, AUTHOR = {Mirishkar Sai Ganesh, VISHNU VIDYADHARA RAJU V, Meher Dinesh Naroju, Sudhamay Maity, Veera Prakash Yalla, Anil Kumar Vuppala}, TITLE = {IIITH-CSTD Corpus: Crowdsourced Strategies for the Collection of a Large-scale Telugu Speech Corpus}, BOOKTITLE = {ACM Trasactions on Asian and Low Resource Language Information Processing}. YEAR = {2023}}
Due to the lack of a large annotated speech corpus, many low-resource Indian languages struggle to utilize recent advancements in deep neural network architectures for Automatic Speech Recognition (ASR) tasks. Collecting large-scale databases is an expensive and time-consuming task. Current approaches lack extensive traditional expert-based data acquisition guidelines, as they are tedious and complex. In this work, we present the International Institute of Information Technology Hyderabad-Crowd Sourced Telugu Database (IIITH-CSTD), a Telugu corpus collected through crowdsourcing
Stable Implementation Of Voice Activity Detector using Zero-Phase Zero Frequency Resonator On FPGA
Syed Abdul Jabbar,Purva Barche,GURUGUBELLI KRISHNA,Azeemuddin Syed,Anil Kumar Vuppala
IEEE International Conference and Expo on Real Time Communications, RTC, 2023
@inproceedings{bib_Stab_2023, AUTHOR = {Syed Abdul Jabbar, Purva Barche, GURUGUBELLI KRISHNA, Azeemuddin Syed, Anil Kumar Vuppala}, TITLE = {Stable Implementation Of Voice Activity Detector using Zero-Phase Zero Frequency Resonator On FPGA}, BOOKTITLE = {IEEE International Conference and Expo on Real Time Communications}. YEAR = {2023}}
Voice activity detection (VAD), is a signal processing technique used to determine whether a given speech signal contains voiced or unvoiced segments. VAD is used in various applications such as Speech Coding, Voice Controlled Systems, speech feature extraction, etc. For example, in Adaptive multirate (AMR) speech coding, VAD is used as an efficient way of coding different speech frames at different bit rates. In this paper, we implemented the application of a Zero-Phase Zero Frequency Resonator (ZP-ZFR) as VAD on hardware. ZP-ZFR is an Infinite Impulse Response (IIR) filter that offers the advantage of requiring a lower filter order, making it suitable for hardware implementation. The proposed system is implemented on the TIMIT database using the Nexys Video Artix-7 FPGA board. The hardware design is carried out using Vivado 2021.1, a popular tool for FPGA development. The Hardware Description Language (HDL) used for implementation is Verilog. The proposed system achieves good performance with low complexity. Therefore this work is implemented on hardware, which can be used in various applications. Index Terms—Voice Activity Detection (VAD), Zero-Phase Zero Frequency Resonator (ZP-ZFR), Zero Frequency Filter (ZFF), MATLAB, Field Programmable Gate Array (FPGA).
Enhancing Stutter Detection in Speech using Zero Time Windowing Cepstral Coefficients and Phase Information
Narasinga Vamshi Raghu Simha,Mirishkar Sai Ganesh,Anil Kumar Vuppala
International Conference on Speech and Computers, SPECOM, 2023
@inproceedings{bib_Enha_2023, AUTHOR = {Narasinga Vamshi Raghu Simha, Mirishkar Sai Ganesh, Anil Kumar Vuppala}, TITLE = {Enhancing Stutter Detection in Speech using Zero Time Windowing Cepstral Coefficients and Phase Information}, BOOKTITLE = {International Conference on Speech and Computers}. YEAR = {2023}}
Stuttering is a speech disorder that affects speech fluency and rhythm, with millions worldwide experiencing it. Early diagnosis and treatment can significantly improve speech fluency and the quality of life for individuals who stutter. Automatic detection of stuttering events can help diagnose, monitor, and develop effective interventions. Therefore, this paper aims to propose a feature space-based classifier for detecting stuttering events in speech. To achieve this, we have investigated Zero Time Windowing Cepstral Coefficients (ZTWCC) as a feature set for stutter detection using classifiers such as SVM, LSTM, and Bidirectional LSTM. We compared the performance of ZTWCC with the standard handcrafted features, such as MFCC, CQCC, and SFFCC, on the SEP- 28K dataset with and without including phase information. The results in both cases indicate that ZTWCC is giving a higher F-1 score than baseline MFCC features.
Study of Indian English Pronunciation Variabilities Relative to Received Pronunciation
Priyanshi Pal,Shelly Jain,Chiranjeevi Yarra,Prasanta Kumar Ghosh,Anil Kumar Vuppala
International Conference on Speech and Computers, SPECOM, 2023
Abs | | bib Tex
@inproceedings{bib_Stud_2023, AUTHOR = {Priyanshi Pal, Shelly Jain, Chiranjeevi Yarra, Prasanta Kumar Ghosh, Anil Kumar Vuppala}, TITLE = {Study of Indian English Pronunciation Variabilities Relative to Received Pronunciation}, BOOKTITLE = {International Conference on Speech and Computers}. YEAR = {2023}}
Analysis of Indian English (IE) pronunciation variabilities is useful in ASR and TTS modelling for the Indian context. Prior works characterised IE variabilities by reporting qualitative phonetic rules relative to Received Pronunciation (RP). However, such characterisations lack quantitative descriptors and data-driven analysis of diverse IE pronunciations, which could be due to the scarcity of phonetically labelled data. Furthermore, the versatility of IE stems from the influence of a large diversity of the speakers’ mother tongues and demographic region differences. To address these issues, we consider the corpus Indic TIMIT and manually obtain 13, 632 phonetic transcriptions in addition to those parts of the corpus. By performing a data-driven analysis on 15, 974 phonetic transcriptions of 80 speakers from diverse regions of India, we present a new set of phonetic rules and validate them against the existing phonetic rules to identify their relevance. Finally, we test the efficacy of Grapheme-to-Phoneme (G2P) conversion developed based on the obtained rules considering Phoneme Error Rate (PER) as the metric for performance.
Enhancing Language Identification in Indian Context Through Exploiting Learned Features with Wav2Vec2. 0
Shivang Gupta,Motepalli Kowshik Siva Sai,Narasinga Vamshi Raghu Simha,Ravi Kumar,Mirishkar Sai Ganesh,Anil Kumar Vuppala
International Conference on Speech and Computers, SPECOM, 2023
Abs | | bib Tex
@inproceedings{bib_Enha_2023, AUTHOR = {Shivang Gupta, Motepalli Kowshik Siva Sai, Narasinga Vamshi Raghu Simha, Ravi Kumar, Mirishkar Sai Ganesh, Anil Kumar Vuppala}, TITLE = {Enhancing Language Identification in Indian Context Through Exploiting Learned Features with Wav2Vec2. 0}, BOOKTITLE = {International Conference on Speech and Computers}. YEAR = {2023}}
This work proposes the utilization of a self-supervised pre-trained network for developing a Language Identification (LID) system catering to low-resource Indian languages. The framework employed is Wav2vec2.0-XLSR-53, pre-trained on 53k hours of unlabeled speech data. The unsupervised training of the model enables it to learn the acoustic patterns specific to a language. Given that languages share phonetic space, multi-lingual pre-training is instrumental in learning cross-lingual information and building systems that cater to low-resource languages. Further fine-tuning with a limited amount of labeled data significantly boosts the model’s accuracy. The results showcase a relative improvement of 33.2% over the DNN-A (DNN with attention) model and 19.04% over Dense Resnets for the Language Identification task on the IIITH-ILSC database using the proposed features (Shivang Gupta and Kowshik Siva Sai Motepalli share first authorship).
Automatic detection of Parkinsons disease Using Zero-Time Window based cepstral feature
MONICA PONNAM,Purva Barche,Anil Kumar Vuppala
India Council International Conference, INDICON, 2023
@inproceedings{bib_Auto_2023, AUTHOR = {MONICA PONNAM, Purva Barche, Anil Kumar Vuppala}, TITLE = {Automatic detection of Parkinsons disease Using Zero-Time Window based cepstral feature}, BOOKTITLE = {India Council International Conference}. YEAR = {2023}}
Parkinson’s disease (PD) is a neurological disorder that affects the nerves responsible for controlling movements. Automatic detection methods used for PD detection from the speech signal are important as they offer benefits such as reliability, repeatability, and cost-effectiveness. In this paper, cepstral features derived from zero-time windowing (ZTW) based time-frequency analysis method for automatic detection of PD were explored. This study used the PC-GITA database to evaluate the performance using support vector machine classifier and neural networks. The performance of the system is also compared with conventional mel-frequency cepstral coeffcients (MFCCs) features. Cepstral features derived from ZTW method shown better performance than baseline feature. The best performance of the PD detection system obtained using the cepstral features derived from ZTW method in terms of classification accuracy is 73.6% . Index Terms—Parkinson’s Disease, Zero-time windowing, Automatic detection methods using speech signal.
An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations
Shelly Jain,Priyanshi Pal,Anil Kumar Vuppala,Prasanta Kumar Ghosh,Chiranjeevi Yarra
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023
@inproceedings{bib_An_I_2023, AUTHOR = {Shelly Jain, Priyanshi Pal, Anil Kumar Vuppala, Prasanta Kumar Ghosh, Chiranjeevi Yarra}, TITLE = {An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2023}}
Speech systems are sensitive to accent variations. This is a challenge in India, which has numerous languages but few linguistic studies on pronunciation variation. The growing number of L2 English speakers reinforces the need to study accents and L1-L2 interactions. We investigate Indian English (IE) accents
and report our observations on regional and shared features. Specifically, we observe phonemic variations and phonotactics in speakers’ native languages and apply this to their English pronunciations. We demonstrate the influence of 18 Indian languages on IE by comparing native language features with IE
pronunciations obtained from literature studies and phonetically annotated speech. Hence, we validate Indian language influences on IE by justifying pronunciation rules from the perspective of Indian language phonology. We obtain a comprehensive
description of generalised and region-specific IE characteristics, which facilitates accent adaptation of existing speech systems.
Novel feature representation using single frequency filtering and nonlinear energy operator for speech emotion recognition
Ramakrishna Thirumurua,Krishna gurbelli,Anil Kumar Vuppala
Digital Signal Processing, DSP, 2022
@inproceedings{bib_Nove_2022, AUTHOR = {Ramakrishna Thirumurua, Krishna Gurbelli, Anil Kumar Vuppala}, TITLE = {Novel feature representation using single frequency filtering and nonlinear energy operator for speech emotion recognition }, BOOKTITLE = {Digital Signal Processing}. YEAR = {2022}}
In this paper, the intrinsic characteristics of speech modulations are estimated to propose the instant modulation spectral features for efficient emotion recognition. This feature representation is based on single frequency filtering (SFF) technique and higher order nonlinear energy operator. The speech signal is decomposed into frequency sub-bands using SFF, and associated nonlinear energies are estimated with higher order nonlinear energy operator. Then, the feature vector is realized using cepstral analysis. The high-resolution property of SFF technique is exploited to extract the amplitude envelope of the speech signal at a selected frequency with good time-frequency resolution. The fourth order nonlinear energy operator provides noise robustness in estimating the modulation components. The proposed feature set is tested for the emotion recognition task using the i-vector model with the probabilistic linear discriminant scoring scheme, support vector machine and random forest classifiers. The results demonstrate that the performance of this feature representation is better than the widely used spectral and prosody features, achieving detection accuracy of 85.75%, 59.88%, and 65.78% on three emotional databases, EMODB, FAUAIBO, and IEMOCAP, respectively. Further, the proposed features are found to be robust in the presence of additive white Gaussian and vehicular noises.
Decoding self-automated and motivated finger movements using novel single-frequency filtering method–An EEG study
Arhant Jain,Krishna gurgubelli,kavitha Vemuri,Anil Kumar Vuppala
Biomedical Signal Processing and Control, BSPC, 2022
@inproceedings{bib_Deco_2022, AUTHOR = {Arhant Jain, Krishna Gurgubelli, kavitha Vemuri, Anil Kumar Vuppala}, TITLE = {Decoding self-automated and motivated finger movements using novel single-frequency filtering method–An EEG study}, BOOKTITLE = {Biomedical Signal Processing and Control}. YEAR = {2022}}
Electroencephalography (EEG) provides the temporal resolution required to map the neural activations for studying motor movements and control. This study aims to compare the power amplitude of electrodes covering the central and frontal regions of a 32-channel scalp EEG. The activations from a standard index finger-tapping and a game paradigm are analyzed. Twenty-five right-handed and five left-handed healthy subjects (range = 18–30 years; mean = 24.25 years; SD = 3.96 years) participated in this study. A novel single-frequency filter (SFF) bank was applied to identify the peak amplitude from the power spectral density plots. The results show that the gaming paradigm yields lower or comparable power amplitudes than the standard finger tapping. We observed that the right-hand finger tapping by the left-handed subjects shows lower between-subject dispersion in amplitude. Nonparametric Spearman correlation showed no association between game scores and power amplitude for the right-handed participants. However, for left-handers, both positive and negative associations were observed. This study demonstrates the efficacy of SFF for extracting power amplitudes with a better signalto-noise ratio, which has implications in BCI and motor rehabilitation applications. The findings support the role of game paradigms for motor movement research and in understanding bilateral hemispheric activations in cognitive tasks.
An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations
Shelly Jain,Priyanshi Pa,Anil Kumar Vuppala,Prasanta Ghosh,Chiranjeevi Yarra
Technical Report, arXiv, 2022
@inproceedings{bib_An_I_2022, AUTHOR = {Shelly Jain, Priyanshi Pa, Anil Kumar Vuppala, Prasanta Ghosh, Chiranjeevi Yarra}, TITLE = {An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Speech systems are sensitive to accent variations. This is especially challenging in the Indian context, with an abundance of languages but a dearth of linguistic studies characterising pronunciation variations. The growing number of L2 English speakers in India reinforces the need to study accents and L1-L2 interactions. We investigate the accents of Indian English (IE) speakers and report in detail our observations, both specific and common to all regions. In particular, we observe the phonemic variations and phonotactics occurring in the speakers' native languages and apply this to their English pronunciations. We demonstrate the influence of 18 Indian languages on IE by comparing the native language pronunciations with IE pronunciations obtained jointly from existing literature studies and phonetically annotated speech of 80 speakers. Consequently, we are able to validate the intuitions of Indian language influences on IE pronunciations by justifying pronunciation rules from the perspective of Indian language phonology. We obtain a comprehensive description in terms of universal and region-specific characteristics of IE, which facilitates accent conversion and adaptation of existing ASR and TTS systems to different Indian accents.
Towards improving Disfluency Detection from Speech using Shifted Delta Cepstral Coefficients
Utkarsh Mehrotra,SPARSH,GURUGUBELLI KRISHNA,Anil Kumar Vuppala
International Conference on Contemporary Computing, IC3, 2022
@inproceedings{bib_Towa_2022, AUTHOR = {Utkarsh Mehrotra, SPARSH, GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Towards improving Disfluency Detection from Speech using Shifted Delta Cepstral Coefficients}, BOOKTITLE = {International Conference on Contemporary Computing}. YEAR = {2022}}
In this paper, we investigate the use of shifted delta cepstral (SDC) coefficients for detecting disfluencies in two types of speech - stuttered speech and spontaneous lecture-mode speech. SDC features capture temporal variations in the speech signal effectively across several frames. The UCLASS stuttered speech dataset and IIITH-IED dataset are used here to develop frame-level automatic disfluency detection systems for four types of disfluencies and the effect of SDC features on the detection of each disfluency type is examined using MFCC and SFFCC cepstral representations. Overall, it is found that using MFCC+SDC features gives an absolute improvement of 2.98% and 6.02% for stuttered and spontaneous speech disfluency detection respectively over the static MFCC features, while SFFCC+SDC features give an absolute improvement of 4.62% for stutter disfluencies and 7.28% for spontaneous speech disfluencies over the static SFFCC features, showing the importance of considering temporal variations for disfluency detection.
Implementation of Zero-Phase Zero Frequency Resonator Algorithm on FPGA
Syed Abdul Jabbar,Purva Barche,GURUGUBELLI KRISHNA,Azeemuddin Syed,Anil Kumar Vuppala,Anil Kumar Vuppala
International Conference on Contemporary Computing, IC3, 2022
@inproceedings{bib_Impl_2022, AUTHOR = {Syed Abdul Jabbar, Purva Barche, GURUGUBELLI KRISHNA, Azeemuddin Syed, Anil Kumar Vuppala, Anil Kumar Vuppala}, TITLE = {Implementation of Zero-Phase Zero Frequency Resonator Algorithm on FPGA}, BOOKTITLE = {International Conference on Contemporary Computing}. YEAR = {2022}}
Epoch location is an important parameter for analysis of excitation source information from the speech signals. From several years lot of research have been done to find accurate locations of epochs. Epochs are instants at which vocal tract system are excited signifi- cantly. The prominent location of epochs can be found during the production of speech signals. However, due to the time varying nature of excitation source and the vocal tract system it is difficult to find accurate location of epochs. From various epoch extraction methods, Zero Frequency Filtering (ZFF) is the simplest and most widely used method to find accurate locations of epochs due to its highest identification rate and lowest false alarm rate among other algorithms. ZFF uses Infinite Impulse Response (IIR) filter followed by trend removal blocks however, the filter used in ZFF method is unstable which makes it unsuitable for practical implementation. In the literature many stable implementations of ZFF algorithms have been proposed. Compared to other stable algorithms of ZFF, Zero-phase Zero Frequency Resonator has simple design and gives the highest identification rate among other epoch extraction algo- rithms including ZFF. In this paper, the implementation of ZFF and ZP-ZFR has been proposed on FPGA board using Verilog which is Hardware Description Language (HDL).
Investigation of Subword-Based Bilingual Automatic Speech Recognition for Indian Languages
Aditya Yadavalli,Shelly Jain,Mirishkar Sai Ganesh,Anil Kumar Vuppala
International Conference on Contemporary Computing, IC3, 2022
@inproceedings{bib_Inve_2022, AUTHOR = {Aditya Yadavalli, Shelly Jain, Mirishkar Sai Ganesh, Anil Kumar Vuppala}, TITLE = {Investigation of Subword-Based Bilingual Automatic Speech Recognition for Indian Languages}, BOOKTITLE = {International Conference on Contemporary Computing}. YEAR = {2022}}
Modelling Indian languages is difficult as the number of word forms they present is high. In such cases, prior work has proposed us- ing subwords instead of words as the basic units to model the language. This provides the potential to cover more word forms. Additionally, previous work revealed that all Indian languages have several phonetic similarities. Consequently, multilingual acoustic models have been proposed to counter the data scarcity issue most Indian language datasets have. However, these models use monolin- gual language models (LM). They do not exploit the common basic tokens that certain Indian languages have. This motivated us to im- plement a bilingual subword-based Automatic Speech Recognition (ASR) system for Hindi and Marathi. Further, we try a combination of word-based monolingual LM with bilingual acoustic model to examine the reason for degradation in word-based multilingual ASRs. Our experiments show that multilingual subword-based ASR models outperform their word-based counterparts by upto 9.77% and 26.35% relative Character Error Rate (CER) in the case of Hindi and Marathi respectively.
Shifted Delta Cepstral Coefficients with RNN to Improve the Detection of Parkinson’s Disease from the Speech
ANSHUL LAHOTI,GURUGUBELLI KRISHNA,Juan Rafel Orozco Arroyave,Anil Kumar Vuppala
International Conference on Contemporary Computing, IC3, 2022
@inproceedings{bib_Shif_2022, AUTHOR = {ANSHUL LAHOTI, GURUGUBELLI KRISHNA, Juan Rafel Orozco Arroyave, Anil Kumar Vuppala}, TITLE = {Shifted Delta Cepstral Coefficients with RNN to Improve the Detection of Parkinson’s Disease from the Speech}, BOOKTITLE = {International Conference on Contemporary Computing}. YEAR = {2022}}
Parkinson’s disease (PD) is a progressive neurodegenerative dis- order of the central nervous system identified by motor and non- motor activities abnormalities. PD affects respiration, laryngeal, articulation, resonance, and prosodic aspects of speech production. Detection of PD from speech is a non-invasive approach useful for automatic screening. Perceptual attributes of speech due to PD are manifested as temporal variations in speech. In this regard, current work investigated the use of LSTM and BiLSTM networks with shifted delta cepstral (SDC) features to detect PD from speech. Further in BiLSTM networks, a multi-head attention mechanism is introduced, assuming that each head captures distinct information to detect PD. SDC features obtained from MFCCs, and SFFCCs are used for developing the PD detection system. The performance of the experiments is validated using the PC-GITA database. The experimental results revealed that BiLSTM networks give a rela- tive improvement of 4-5% over the LSTM networks. The use of a multi-head attention mechanism further improved the detection accuracy of the PD detection system, showing that it can capture various discriminative features.
Exploring High Spectro-Temporal Resolution for Alzheimer’s Dementia Detection
Nayan Anand Vats,Purva Barche,Mirishkar Sai Ganesh,Anil Kumar Vuppala
International Conference on Signal Processing and Communications, SPCOM, 2022
@inproceedings{bib_Expl_2022, AUTHOR = {Nayan Anand Vats, Purva Barche, Mirishkar Sai Ganesh, Anil Kumar Vuppala}, TITLE = {Exploring High Spectro-Temporal Resolution for Alzheimer’s Dementia Detection}, BOOKTITLE = {International Conference on Signal Processing and Communications}. YEAR = {2022}}
Alzheimer’s Dementia is a progressive neurological disorder characterized by cognitive impairment. It affects mem- ory, thinking skills, language, and the ability to perform simple tasks. Detection of Alzheimer’s Dementia from the speech is considered a primitive task, as most speech cues are preserved in it. Studies in the literature focused mainly on the lexical features and few acoustic features for detecting Alzheimer’s disease. The present work explores the single frequency filtering cepstral coefficients (SFCC) for the automatic detection of Alzheimer’s disease. In contrast to STFTs, the proposed feature has better temporal and spectral resolution and captures the transient part more appropriately. This offers a very compact and efficient way to derive the formant structure in the speech signal. The experiments were conducted on the ADReSSo dataset, using the support vector machine classifier. The classification performance was compared with several baseline features like Mel-frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP), linear prediction cepstral coefficient (LPCC), Mel frequency cepstral coefficients of LP-residual (MFCC-WR), ZFF signal (MFCC-ZF) and eGeMAPS (openSMILE). The experiments conducted on Alzheimer’s Dementia classification task show that the proposed feature performs better than conventional MFCCs. Among all the features, SFCC offers the best classification accuracy of 65.1% and 60.6% for dementia detection on cross- validation and test data, respectively. The combination of baseline features with SFCC features further improved the performance. Index Terms—Alzheimer’s disease, Cognitive impairment, Sin- gle frequency filtering cepstral coefficients.
Exploring the Effect of Dialect Mismatched Language Models in Telugu Automatic Speech Recognition
Aditya Yadavalli,Mirishkar Sai Ganesh,Anil Kumar Vuppala
NAACL Student Research Workshop, NAACL-SRW, 2022
@inproceedings{bib_Expl_2022, AUTHOR = {Aditya Yadavalli, Mirishkar Sai Ganesh, Anil Kumar Vuppala}, TITLE = {Exploring the Effect of Dialect Mismatched Language Models in Telugu Automatic Speech Recognition}, BOOKTITLE = {NAACL Student Research Workshop}. YEAR = {2022}}
Previous research has found that Acoustic Models (AM) of an Automatic Speech Recognition (ASR) system are susceptible to dialect variations within a language, thereby adversely affecting the ASR. To counter this, researchers have proposed to build a dialectspecific AM while keeping the Language Model (LM) constant for all the dialects. This study explores the effect of dialect mismatched LM by considering three different Telugu regional dialects: Telangana, Coastal Andhra, and Rayalaseema. We show that dialect variations that surface in the form of a different lexicon, grammar, and occasionally semantics can significantly degrade the performance of the LM under mismatched conditions. Therefore, this degradation has an adverse effect on the ASR even when dialect-specific AM is used. We show a degradation of up to 13.13 perplexity points when LM is used under mismatched conditions. Furthermore, we show a degradation of over 9% and over 15% in Character Error Rate (CER) and Word Error Rate (WER), respectively, in the ASR systems when using mismatched LMs over matched LMs
Long-Term Average Spectral Feature-Based Parkinson’s Disease Detection from Speech
ANSHUL LAHOTI,GURUGUBELLI KRISHNA,Juan Rafel Orozco Arroyave,Anil Kumar Vuppala
Advanced Machine Intelligence and Signal Processing, AMISP, 2022
Abs | | bib Tex
@inproceedings{bib_Long_2022, AUTHOR = {ANSHUL LAHOTI, GURUGUBELLI KRISHNA, Juan Rafel Orozco Arroyave, Anil Kumar Vuppala}, TITLE = {Long-Term Average Spectral Feature-Based Parkinson’s Disease Detection from Speech}, BOOKTITLE = {Advanced Machine Intelligence and Signal Processing}. YEAR = {2022}}
Parkinson’s disease (PD) is a speech disorder caused by the effect of neurological damage on motor control activity of speech production. Due to the noninvasive nature, detection of PD from the speech is remarkable. This study proposes the use of Long-Term Average Spectral (LTAS) feature to detect PD from speech using different filterbank techniques and classifiers on the PC-GITA dataset. The fact that PD can be perceived from long-temporal information has motivated us to explore the LTAS features for PD detection. Performance of LTAS features is compared to baseline features, namely statistical and openSMILE features (ComParE, GeMAPS, and eGeMAPS). The performance of these features is validated using five different classifiers. The LTAS features, which are obtained with auditory filterbanks (Constant Q Transform and Gammatone), showed better performance as compared to the statistical features (MFCC_Stat and LPCC_Stat) and openSMILE features (ComParE, GeMAPS, and eGeMAPS). The combination of the Gammatone-based LTAS feature with eGeMAPS further improved the detection accuracy of the PD detection system on the PC-GITA database.
Multi-Task End-to-End Model for Telugu Dialect and Speech Recognition
Aditya Yadavalli,Mirishkar Sai Ganesh,Anil Kumar Vuppala
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022
@inproceedings{bib_Mult_2022, AUTHOR = {Aditya Yadavalli, Mirishkar Sai Ganesh, Anil Kumar Vuppala}, TITLE = {Multi-Task End-to-End Model for Telugu Dialect and Speech Recognition}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2022}}
Conventional Automatic Speech Recognition (ASR) systems are susceptible to dialect variations within a language, thereby adversely affecting the ASR. Therefore, the current practice is to use dialect-specific ASRs. However, dialectspecific information or data is hard to obtain making it difficult to build dialect-specific ASRs. Furthermore, it is cumbersome to maintain multiple dialect-specific ASR systems for each language. We build a unified multi-dialect End-to-End ASR that removes the need for a dialect recognition block and the need to maintain multiple dialect-specific ASRs for three Telugu regional dialects: Telangana, Coastal Andhra, and Rayalaseema. We find that pooling the data and training a multi-dialect ASR benefits the low-resource dialect the most – an improvement of over 9.71% in relative Word Error Rate (WER). Subsequently, we experiment with multi-task ASRs where the primary task is to transcribe the audio and the secondary task is to predict the dialect. We do this by adding a Dialect ID to the output targets. Such a model outperforms naive multi-dialect ASRs by up to 8.24% in relative WER. Additionally, we test this model on a dialect recognition task and find that it outperforms strong baselines by 6.14% in accuracy. Index Terms: multi-dialect, speech recognition, sequence-tosequence, dialect recognition
IE-CPS Lexicon: An Automatic Speech Recognition Oriented Indian English Pronunciation Dictionary
Shelly Jain,Aditya Yadavalli,Mirishkar Sai Ganesh,Chiranjeevi Yarra,Anil Kumar Vuppala
International Conference on Natural Language Processing., ICON, 2021
@inproceedings{bib_IE-C_2021, AUTHOR = {Shelly Jain, Aditya Yadavalli, Mirishkar Sai Ganesh, Chiranjeevi Yarra, Anil Kumar Vuppala}, TITLE = {IE-CPS Lexicon: An Automatic Speech Recognition Oriented Indian English Pronunciation Dictionary}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2021}}
Indian English (IE), on the surface, seems quite similar to standard English. However, closer observation shows that it has actually been influenced by the surrounding vernacu- lar languages at several levels from phonology to vocabulary and syntax. Due to this, automatic speech recognition (ASR) systems developed for American or British varieties of English result in poor performance on Indian English data. The most prominent feature of Indian English is the characteristic pronunciation of the speakers. The systems are unable to learn these acoustic variations while modelling and cannot parse the non-standard articulation of non-native speakers. For this purpose, we propose a new phone dictionary de- veloped based on the Indian language Com- mon Phone Set (CPS). The dictionary maps the phone set of American English to existing Indian phones based on perceptual similarity. This dictionary is named Indian English Com- mon Phone Set (IE-CPS). Using this, we build an Indian English ASR system and compare its performance with an American English ASR system on speech data of both varieties of En- glish. Our experiments on the IE-CPS show that it is quite effective at modelling the pro- nunciation of the average speaker of Indian En- glish. ASR systems trained on Indian English data perform much better when modelled us- ing IE-CPS, achieving a reduction in the word error rate (WER) of upto 3.95% when used in place of CMUdict. This shows the need for a different lexicon for Indian English.
Comparative Study of Different Epoch Extraction Methods for Speech Associated with Voice Disorders
Purva Barche,GURUGUBELLI KRISHNA,Anil Kumar Vuppala
International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2021
@inproceedings{bib_Comp_2021, AUTHOR = {Purva Barche, GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Comparative Study of Different Epoch Extraction Methods for Speech Associated with Voice Disorders}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2021}}
Accurate detection of epoch locations is important in extracting the features from the speech signal for automatic detection and assessment of voice disorders. Therefore, this study aimed to compare the various algorithms for detecting epoch locations from the speech associated with voice disorders. In this regard, nine state-of-the-art epoch extraction algorithms were considered, and their performance for different categories of voice disorders was evaluated on the SVD dataset. Experimental results indicate that most of the epoch extraction methods showed better performance for healthy speech; however, their performance was degraded for speech associated with voice disorders. Furthermore, the performance of epoch extraction methods was degraded for the speech of structural and neurogenic disorders compared to the speech of psychogenic and functional disorders. Among the different epoch extraction algorithms, zero phase-zero frequency filtering showed the best performance in terms identification rate (90.37%) and identification accuracy (0.34ms), for speech associated with voice disorders.
Reed: An Approach Towards Quickly Bootstrapping Multilingual Acoustic Models
Bipasha Sen,Aditya Agarwal,Mirishkar Sai Ganesh,Anil Kumar Vuppala
IEEE Spoken Language Technology Workshop, SLT-W, 2021
@inproceedings{bib_Reed_2021, AUTHOR = {Bipasha Sen, Aditya Agarwal, Mirishkar Sai Ganesh, Anil Kumar Vuppala}, TITLE = {Reed: An Approach Towards Quickly Bootstrapping Multilingual Acoustic Models}, BOOKTITLE = {IEEE Spoken Language Technology Workshop}. YEAR = {2021}}
Multilingual automatic speech recognition (ASR) system is a single entity capable of transcribing multiple languages sharing a common phone space. Performance of such a system is highly dependent on the compatibility of the languages. State of the art speech recognition systems are built using sequential architectures based on recurrent neural networks (RNN) limiting the computational parallelization in training. This poses a significant challenge in terms of time taken to bootstrap and validate the compatibility of multiple languages for building a robust multilingual system. Complex architectural choices based on self-attention networks are made to improve the parallelization thereby reducing the training time. In this work, we propose Reed, a simple system based on 1D convolutions which uses very short context to improve the training time. To improve the performance of our system, we use raw time-domain speech signals directly as input. This enables the convolutional layers to learn feature representations rather than relying on handcrafted features such as MFCC. We report improvement on training and inference times by atleast a factor of 4× and 7.4× respectively with comparable WERs against standard RNN based baseline systems on SpeechOcean's multilingual low resource dataset.
Towards a Database For Detection of Multiple Speech Disfluencies in Indian English
SPARSH,utkarsh mehrotra,GURUGUBELLI KRISHNA,Anil Kumar Vuppala
National Conference on Communications, NCC, 2021
@inproceedings{bib_Towa_2021, AUTHOR = {SPARSH, utkarsh Mehrotra, GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Towards a Database For Detection of Multiple Speech Disfluencies in Indian English}, BOOKTITLE = {National Conference on Communications}. YEAR = {2021}}
The detection and removal of disfluencies from speech is an important task since the presence of disfluencies can adversely affect the performance of speech-based applications such as Automatic Speech Recognition (ASR) systems and speech-to-speech translation systems. From the perspective of Indian languages, there is a lack of studies pertaining to speech disfluencies, their types and frequency of occurrence. Also, the resources available to perform such studies in an Indian context are limited. Through this paper, we attempt to address this issue by introducing the IIITH-Indian English Disfluency (IIITH-IED) Dataset. This dataset consists of 10-hours of lecture mode speech in Indian English. Five types of disfluencies - filled pause, prolongation, word repetition, part-word repetition and phrase repetition were identified in the speech signal and annotated in the corresponding transcription to prepare this dataset. The IIITH-IED dataset was then used to develop framelevel automatic disfluency detection systems. Two sets of features were extracted from the speech signal and then used to train classifiers for the task of disfluency detection. Amongst all the systems employed, Random Forest with MFCC features resulted in the highest average accuracy of 89.61% and F1-score of 0.89.
ACOUSTIC FEATURES, BERT Model AND THEIR COMPLEMENTARY NATURE FOR ALZHEIMER’S DEMENTIA DETECTION
Nayan Anand Vats,Aditya Yadavalli,GURUGUBELLI KRISHNA,Anil Kumar Vuppala
International Conference on Contemporary Computing, IC3, 2021
@inproceedings{bib_ACOU_2021, AUTHOR = {Nayan Anand Vats, Aditya Yadavalli, GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {ACOUSTIC FEATURES, BERT Model AND THEIR COMPLEMENTARY NATURE FOR ALZHEIMER’S DEMENTIA DETECTION}, BOOKTITLE = {International Conference on Contemporary Computing}. YEAR = {2021}}
Dementia is a syndrome chronic or progressive that usually affects the cognitive functioning of the subjects. Alzheimer’s, a neurodegenerative disorder, is the leading cause of dementia. One of the many symptoms of Alzheimer’s Dementia is the inability to speak and understand language clearly. The last decade has seen a surge in the research done in Alzheimer’s Dementia detection using Linguistics and acoustic features. This paper takes up the Alzheimer’s Dementia classification task of ADReSS INTERSPEECH-2020 challenge, "Alzheimer’s Dementia Recognition through Spontaneous Speech: The ADReSS Challenge". It uses eight different acoustic features to find the attributes in the human speech production system (vocal track and excitation source) affected by Alzheimer’s Dementia. In this study, the Alzheimer’s dementia classification is performed using five different Machine Learning models using ADReSS INTERSPEECH-2020 challenge dataset. Since most of the studies in the previous literature have used linguistic features successfully for Alzheimer’s dementia classification, the current study also demonstrates the performance of the BERT model for the dementia classification task. The maximum accuracy obtained by the acoustic feature is 64.5%, and the BERT Model provides a classification accuracy of 79.1% over the test dataset. Finally, the score-level fusion of the acoustic model with the BERT Model shows an improvement of 6.1% classification accuracy over the BERT Model, which indicates the complementary nature of acoustic features to linguistic features.
Detecting Multiple Disfluencies from Speech using Pre-linguistic Automatic Syllabification withAcoustic and Prosody Features
Utkarsh Mehrotra,SPARSH,GURUGUBELLI KRISHNA,Anil Kumar Vuppala
Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), APSIPA, 2021
@inproceedings{bib_Dete_2021, AUTHOR = {Utkarsh Mehrotra, SPARSH, GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Detecting Multiple Disfluencies from Speech using Pre-linguistic Automatic Syllabification withAcoustic and Prosody Features}, BOOKTITLE = {Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)}. YEAR = {2021}}
In this paper, a new method to detect disfluencies directly from speech is explored. The method makes use of pre-linguistic automatic syllabification - the process of segmenting input speech signals into perceptually distinct syllable-like regions, to develop syllable-level disfluency detection systems. Statistical prosody features related to fundamental frequency, energy and duration are extracted from each syllable-like region and used to train a DNN classifier for automatic detection of speech disfluencies. Further, complementary information useful for the task of disfluency detection is added to the pipeline with the help of acoustic features. A BiLSTM feature extractor is used to get complex acoustic representation from the baseline MFCC features for each syllable-like region. This acoustic representation is concatenated with the prosody features and used in the proposed system for detecting multiple speech disfluencies. Experiments are conducted for four types of disfluencies in the UCLASS and the IIITH-IED datasets to test the proposed disfluency detection system. Overall, it is found that the proposed system gives a detection accuracy of 88.75% for the disfluencies in the UCLASS dataset, whereas for the IIITH-IED dataset, the accuracy obtained is 91.24%, showing the effectiveness of considering perceptually distinct syllable-like regions as representational units for detecting disfluencies.
Comparative Study of Filter Banks to Improve the Performance of Voice Disorder Assessment Systems using LTAS Features
Purva Barche,GURUGUBELLI KRISHNA,Anil Kumar Vuppala
Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), APSIPA, 2021
@inproceedings{bib_Comp_2021, AUTHOR = {Purva Barche, GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Comparative Study of Filter Banks to Improve the Performance of Voice Disorder Assessment Systems using LTAS Features}, BOOKTITLE = {Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)}. YEAR = {2021}}
Objective assessment of voice disorders is widely explored as an early diagnosis tool for the classification of voice disorders. Voice disorders affect the pitch, loudness and voice quality, which are perceived at the suprasegmental-level in the speech signal. For the detection and assessment of voice disorders, this study explores the effectiveness of Long Term Average Spectral (LTAS) features using four state-of-the-art filter banks designed with critical-band, constant-Q, gammatone, and singlefrequency filtering approaches. Moreover, the performance of the systems is compared with state-of-the-art statistical-average and openSMILE features. Voice disorder detection experiment was carried out on SVD and HUPA database, while only SVD database is used for assessment task. Assessment task is performed in clinical way, in which four binary classifiers were trained in our study. Voice disorder detection and assessment tasks were carried out using the support vector machine classifier. From the results, it was observed that constant-Q filter bank based LTAS features performed better among all LTAS features with classification accuracy of 78% and 81.4% for voice disorder detection task on SVD and HUPA database, respectively. Further, the combination of LTAS features with OpenSMILE features improved (89.6% and 86.6% for SVD and HUPA database, respectively) the performance.
CSTD-Telugu Corpus: Crowd-Sourced Approach for Large-Scale Speech data collection
Ganesh S Mirishkar,VISHNU VIDYADHARA RAJU V,Meher Dinesh Naroju,Sudhamay Maity,Veera Prakash Yalla,Anil Kumar Vuppala
Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), APSIPA, 2021
@inproceedings{bib_CSTD_2021, AUTHOR = {Ganesh S Mirishkar, VISHNU VIDYADHARA RAJU V, Meher Dinesh Naroju, Sudhamay Maity, Veera Prakash Yalla, Anil Kumar Vuppala}, TITLE = {CSTD-Telugu Corpus: Crowd-Sourced Approach for Large-Scale Speech data collection}, BOOKTITLE = {Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)}. YEAR = {2021}}
Speech is a natural mode of communication among all beings. India is a densely populated country, and people are diverse throughout the globe. The spoken language is the medium of instruction to interact among the people. The majority of Indian languages are spoken globally. The unavailability of larger volumes of transcribed and annotated speech data is often a hurdle for building reliable speech recognition (ASR) systems for Indian languages. Crowdsourcing strategies are effective in collaboratively collecting speech data resources. This paper describes the experience of large-scale speech data collection for the Telugu language through mobile and web-based applications. With this crowd contributed speech, the performance of the baseline ASR system is shown for clean speech. ASR performance for pink and white noises is also compared for various deep neural network (DNN) based acoustic models. The details regarding the usage of frameworks and their challenges during their implementation are part of this paper. The framework adopted for collecting the speech data is rapid, cost-saving, and offers the advantage of extending it to all the other Indian languages. Index Terms—ASR, Crowd-sourced, TDNN, GMM, SGMM
Outcomes of Speech to Speech Translation for Broadcast Speeches and Crowd Source Based Speech Data Collection Pilot Projects
Anil Kumar Vuppala,Veera Prakash Yalla,Mirishkar Sai Ganesh,VISHNU VIDYADHARA RAJU V
International Conference on Big Data Analytics, BDA, 2021
Abs | | bib Tex
@inproceedings{bib_Outc_2021, AUTHOR = {Anil Kumar Vuppala, Veera Prakash Yalla, Mirishkar Sai Ganesh, VISHNU VIDYADHARA RAJU V}, TITLE = {Outcomes of Speech to Speech Translation for Broadcast Speeches and Crowd Source Based Speech Data Collection Pilot Projects}, BOOKTITLE = {International Conference on Big Data Analytics}. YEAR = {2021}}
Speech-to-Speech Machine Translation (SSMT) applications and services use a three-step process. Speech recognition is the first step to obtain transcriptions. This is followed by text-to-text language translation and, finally, synthesis into text-speech. As data availability and computing power improved, these individual steps evolved. However, despite significant progress, there is always the error of the first stage in terms of speech recognition, accent, etc. Having traversed the speech recognition stage, the error becomes more prevalent and decreases very often. This chapter presents a complete pipeline for transferring speaker intent in SSMT involving humans in the loop. Initially, the SSMT pipeline has been discussed and analyzed for broadcast speeches and talks on a few sessions of Mann Ki Baat, where the source language is in Hindi, and the target language is in English and Telugu. To perform this task, industry-grade APIs from Google, Microsoft, CDAC, and IITM has been used for benchmarking. Later challenges faced while building the pipeline are discussed, and potential solutions have been introduced. Later this chapter introduces a framework developed to
Single Frequency Filter Bank Based Long-Term Average Spectra for Hypernasality Detection and Assessment in Cleft Lip and Palate Speech
GURUGUBELLI KRISHNA,Mohammad Hashim Javid,Anil Kumar Vuppala
International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2020
@inproceedings{bib_Sing_2020, AUTHOR = {GURUGUBELLI KRISHNA, Mohammad Hashim Javid, Anil Kumar Vuppala}, TITLE = {Single Frequency Filter Bank Based Long-Term Average Spectra for Hypernasality Detection and Assessment in Cleft Lip and Palate Speech}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2020}}
Hypernasality is an abnormality in speech production observed in subjects with craniofacial anomalies like cleft lip and palate (CLP). Detection and assessment of hypernasality is a primary step in the clinical diagnosis of individuals with CLP. Existing methods explore the short-term spectral information from speech to assess hy-pernasality. The present work examines long-term average spectral (LTAS) features obtained from speech to detect and assess hyper-nasality. This work proposes single frequency filter bank based long-term average spectral (SFFB-LTAS) features for hypernasality detection and assessment. The SFFB is used to extract long-term average spectra with a good spectral resolution. The experiments are carried out using NMCPC-CLP database collected from 41 speakers with CLP and 32 speakers without CLP. The experimental results show that, SFFB-LTAS features performed better compared to state-of-art spectral and prosody features. The proposed systems for the detection and assessment of hypernasality have shown classification accuracy of 89% and 82.1%, respectively.
Analytic Phase Features for Dysarthric Speech Detection and Intelligibility Assessment
GURUGUBELLI KRISHNA,Anil Kumar Vuppala
Speech Communication, SpComm, 2020
@inproceedings{bib_Anal_2020, AUTHOR = {GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Analytic Phase Features for Dysarthric Speech Detection and Intelligibility Assessment}, BOOKTITLE = {Speech Communication}. YEAR = {2020}}
The objectives of the dysarthria assessment are to discriminate dysarthric speech from normal speech, to estimate the severity of dysarthria in terms of the dysarthric speech intelligibility, and to find the motor speech subsystem which causes defects in speech production. In this work, analytic phase features are investigated for the objective assessment of dysarthria. In this connection, the importance of analytic phase in speech intelligibility is studied by employing phase modification schemes. The current investigation on the analytic phase, proposed a novel approach to estimate the instantaneous frequency components from the speech signal, by using single frequency filtering technique. In this study, dysarthric speech detection and intelligibility assessment systems are developed by using UA-Speech database. The efficiency of the analytic phase features is compared with state-of-the-art spectral features. The proposed features outperformed the magnitude and group delay features and shown a classification accuracies of 95.61% and 64.47% for dysarthric speech detection and intelligibility assessment tasks, respectively. The fusion of the evidence obtained from the analytic phase and magnitude spectral features revealed the complementary nature of analytic phase features.
Study on the Effect of Emotional Speech on Language Identification
PRIYAM JAIN,GURUGUBELLI KRISHNA,Anil Kumar Vuppala
National Conference on Communications, NCC, 2020
@inproceedings{bib_Stud_2020, AUTHOR = {PRIYAM JAIN, GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Study on the Effect of Emotional Speech on Language Identification}, BOOKTITLE = {National Conference on Communications}. YEAR = {2020}}
Identifying language information from speech utterance is referred to as spoken language identification. Language Identification (LID) is essential in multilingual speech systems. The performance of LID systems have been studied for various adverse conditions such as background noise, telephonic channel, short utterances, so on. In contrast to these studies, for the first time in the literature, the present work investigated the impact of emotional speech on language identification. In this work, different emotional speech databases have been pooled to create the experimental setup. Additionally, state-of-art i-vectors, timedelay neural networks, long short term memory, and deep neural network x-vector systems have been considered to build the LID systems. Performance of the LID system has been evaluated for speech utterances of different emotions in terms of equal error rate and Cavg. The results of the study indicate that the speech utterances of anger and happy emotions degrades performance of LID systems more compared to the neutral and sad emotions. Index Terms—Language Identification, i-vector, TDNN, LSTM, DNN x-vector
Duration of the rhotic approximant/ɹ/in spastic dysarthria of different severity levels
GURUGUBELLI KRISHNA,Anil Kumar Vuppala,NP Narendra,Paavo Alku
Speech Communication, SpComm, 2020
@inproceedings{bib_Dura_2020, AUTHOR = {GURUGUBELLI KRISHNA, Anil Kumar Vuppala, NP Narendra, Paavo Alku}, TITLE = {Duration of the rhotic approximant/ɹ/in spastic dysarthria of different severity levels}, BOOKTITLE = {Speech Communication}. YEAR = {2020}}
Dysarthria is a motor speech disorder leading to imprecise articulation of speech. Acoustic analysis capable of detecting and assessing articulation errors is useful in dysarthria diagnosis and therapy. Since speakers with dysarthria experience difficulty in producing rhotics due to complex articulatory gestures of these sounds, the hypothesis of the present study is that duration of the rhotic approximant /ɹ/ distinguishes dysarthric speech of different severity levels. Duration measurements were conducted using the third formant (F3) trajectories estimated from quasi-closed-phase (QCP) spectrograms. Results indicate that the severity level of spastic dysarthria has a significant effect on duration of /ɹ/. In addition, the phonetic context has a significant effect on duration of /ɹ/, the ɪ-r-ɛ context showing the largest difference in /ɹ/ duration between dysarthric speech of the highest severity levels and healthy speech. The results of this preliminary study can be used in the future to develop signal processing and machine learning methods to automatically predict the severity level of spastic dysarthria from speech signals.
Detection of Fricative Landmarks Using Spectral Weighting: A Temporal Approach
VYDANA HARI KRISHNA,Anil Kumar Vuppala
Circuits, Systems, and Signal Processing, CSSP, 2020
@inproceedings{bib_Dete_2020, AUTHOR = {VYDANA HARI KRISHNA, Anil Kumar Vuppala}, TITLE = {Detection of Fricative Landmarks Using Spectral Weighting: A Temporal Approach}, BOOKTITLE = {Circuits, Systems, and Signal Processing}. YEAR = {2020}}
Fricatives are characterized by two prime acoustic properties, i.e., having high-frequency spectral concentration and possessing noisy nature. Spectral domain approaches for detecting fricatives employ a time–frequency representation to compute acoustic cues such as band energy ratio, spectral centroid, and dominant resonant frequency. The detection accuracy of these approaches depends on the efficiency of the employed time–frequency representation. An approach that would not require any time–frequency representation for detecting fricatives from speech has been explored in this work. In this study, a time-domain operation is proposed which emphasizes the high-frequency spectral characteristics of fricatives implicitly. The proposed approach aims to scale the spectrum of the speech signal using a scaling function k2, where k is the discrete frequency. The spectral weighting function used in the proposed approach can be approximated as a cascaded temporal difference operation over speech signal. The emphasized regions in spectrally weighted speech signal are quantified to detect fricative regions. Contrasting the spectral domain approaches, the predictability measure-based approach in literature relies on capturing the noisy nature of fricatives. The proposed approach and the predictability measure-based approaches rely on two complementary properties for detecting fricatives, and a combination of these approaches is put forth in this work. The proposed approach has performed better than the state-of-the-art fricative detectors. To study the significance of the proposed evidence, an early fusion between the proposed evidence and the feature-space maximum log-likelihood transform features is explored for developing speech recognition systems.
Toward Improving the Performance of Epoch Extraction from Telephonic Speech
GURUGUBELLI KRISHNA,Mohammad Hashim Javid,KNRK RAJU ALLURI,Anil Kumar Vuppala
Circuits, Systems, and Signal Processing, CSSP, 2020
@inproceedings{bib_Towa_2020, AUTHOR = {GURUGUBELLI KRISHNA, Mohammad Hashim Javid, KNRK RAJU ALLURI, Anil Kumar Vuppala}, TITLE = {Toward Improving the Performance of Epoch Extraction from Telephonic Speech}, BOOKTITLE = {Circuits, Systems, and Signal Processing}. YEAR = {2020}}
Epoch is an abrupt closure event within a glottal cycle at which significant excitation to the vocal-tract system happens during the production of voiced speech. The state-of-the-art zero frequency filtering technique is a simple and efficient method that shows robustness in extracting the epochs from cleanspeech.However,this methodhasshownpoor performance for telephonic quality speech, due to the presence of spurious zero crossings in epoch evidence, which leads to a high false alarm rate. Recently,zero-phase zero frequency resonator(ZP-ZFR) an alternative tozero frequency filter isproposed for stable implementation of zero frequency filtering technique. In this study,higher-order ZP-ZFR is investigated to improve the performance of zero frequency filtering for epoch extraction from telephonic speech. The performance of the proposedZP-ZFR method is quantitatively evaluated on telephonic speech simulated usingsix standard databases having simultaneous electroglottograph recordings as ground truth. Experimental results suggest that the performance of the proposed method is significantly better than the state-of-the-art methods in terms of identification rate and false alarm rate.
Towards Automatic Assessment of Voice Disorders: A Clinical Approach
Purva Barche,GURUGUBELLI KRISHNA,Anil Kumar Vuppala
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020
@inproceedings{bib_Towa_2020, AUTHOR = {Purva Barche, GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Towards Automatic Assessment of Voice Disorders: A Clinical Approach}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2020}}
Automatic detection and assessment of voice disorders is important in diagnosis and treatment planning of voice disorders. This work proposes an approach for automatic detection and assessment of voice disorders from a clinical perspective. To accomplish this, a multi-level classification approach was explored in which four binary classifiers were used for the assessment of voice disorders. The binary classifiers were trained using support vector machines with excitation source features, vocal-tract system features, and state-of-art OpenSMILE features. In this study source features namely, glottal parameters obtained from glottal flow waveform, perturbation measures obtained from epoch locations, and cepstral features obtained from linear prediction residual and zero frequency filtered signal were explored. The present study used the Saarbucken voice disorders database to evaluate the performance of proposed approach. The OpenSMILE features namely ComParE and eGEMAPS feature sets shown better performance in terms of classification accuracies of 82.8% and 76%, respectively for voice disorder detection. The combination of excitation source features with baseline feature sets further improved the performance of detection and assessment systems, that highlight the complimentary nature of exciting source features.
Towards Emotion Independent Language Identification System
PRIYAM JAIN,GURUGUBELLI KRISHNA,Anil Kumar Vuppala
International Conference on Contemporary Computing, IC3, 2019
@inproceedings{bib_Towa_2019, AUTHOR = {PRIYAM JAIN, GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Towards Emotion Independent Language Identification System}, BOOKTITLE = {International Conference on Contemporary Computing}. YEAR = {2019}}
—India is a multilingual society having more than 1600 languages. Most of these languages are having an overlapping set of phonemes. This makes developing language identification (LID) framework difficult for Indian languages. In this paper, the above challenge is addressed using phonetic features. To model the temporal variations in phonetic features, attention based residual-time delay neural network (RES-TDNN) is proposed. This network effectively captures long-range temporal dependencies through TDNN and attention mechanism. The proposed network has been evaluated on IIITH-ILSC database using phonetic and acoustic features. The database consists of 22 official Indian languages and Indian English. Attention based RES-TDNN outperformed the other state-of-the-art networks such as deep neural network, long short-term memory network and produced an equal error rate of 9.46%. Further, the fusion of shifted delta cepstral and phonetic features have improved the performance.
Attention based residual-time delay neural network for Indian language identification
Mandava Tirusha,Anil Kumar Vuppala
International Conference on Contemporary Computing, IC3, 2019
@inproceedings{bib_Atte_2019, AUTHOR = {Mandava Tirusha, Anil Kumar Vuppala}, TITLE = {Attention based residual-time delay neural network for Indian language identification}, BOOKTITLE = {International Conference on Contemporary Computing}. YEAR = {2019}}
—India is a multilingual society having more than 1600 languages. Most of these languages are having an overlapping set of phonemes. This makes developing language identification (LID) framework difficult for Indian languages. In this paper, the above challenge is addressed using phonetic features. To model the temporal variations in phonetic features, attention based residual-time delay neural network (RES-TDNN) is proposed. This network effectively captures long-range temporal dependencies through TDNN and attention mechanism. The proposed network has been evaluated on IIITH-ILSC database using phonetic and acoustic features. The database consists of 22 official Indian languages and Indian English. Attention based RES-TDNN outperformed the other state-of-the-art networks such as deep neural network, long short-term memory network and produced an equal error rate of 9.46%. Further, the fusion of shifted delta cepstral and phonetic features have improved the performance.
An Investigation of LSTM-CTC based Joint Acoustic Model for Indian Language Identification
Mandava Tirusha,VUDDAGIRI RAVI KUMAR,VYDANA HARI KRISHNA,Anil Kumar Vuppala
Workshop on Automatic Speech Recognition and Understanding, ASRU, 2019
@inproceedings{bib_An_I_2019, AUTHOR = {Mandava Tirusha, VUDDAGIRI RAVI KUMAR, VYDANA HARI KRISHNA, Anil Kumar Vuppala}, TITLE = {An Investigation of LSTM-CTC based Joint Acoustic Model for Indian Language Identification}, BOOKTITLE = {Workshop on Automatic Speech Recognition and Understanding}. YEAR = {2019}}
In this paper, phonetic features derived from the joint acoustic model (JAM) of a multilingual end to end automatic speech recognition system are proposed for Indian language identification (LID). These features utilize contextual information learned by the JAM through long short-term memoryconnectionist temporal classification (LSTM-CTC) framework. Hence, these features are referred to as CTC features. A multi-head self-attention network is trained using these features, which aggregates the frame-level features by selecting prominent frames through a parametrized attention layer. The proposed features have been tested on IIITH-ILSC database that consists of 22 official Indian languages and Indian English. Experimental results demonstrate that CTC features outperformed i-vector and phonetic temporal neural LID systems and produced an 8.70% equal error rate. The fusion of shifted delta cepstral and CTC feature-based LID systems at the model level and feature level further improved the performance.
Multi-Head Self-Attention Networks for Language Identification
VUDDAGIRI RAVI KUMAR,Mandava Tirusha,VYDANA HARI KRISHNA,Anil Kumar Vuppala
International Conference on Contemporary Computing, IC3, 2019
@inproceedings{bib_Mult_2019, AUTHOR = {VUDDAGIRI RAVI KUMAR, Mandava Tirusha, VYDANA HARI KRISHNA, Anil Kumar Vuppala}, TITLE = {Multi-Head Self-Attention Networks for Language Identification}, BOOKTITLE = {International Conference on Contemporary Computing}. YEAR = {2019}}
Self-attention networks are being popularly employed in sequence classification and sequence summarization tasks. State-of-the-art models use sequential models to capture the high-level information, but these models are sensitive to length of utterance and do not equally generalize over variable length utterances. This work explores to study the efficiency of recent advancements in self-attentive networks for improving the performance of the LID system. In self-attentive network, variable length input sequence is converted to fixed dimensional vector which represents the whole sequence. The weighted mean of input sequence is considered as utterance level representation. Along with the mean, a standard deviation is employed to represent the whole input sequence. Experiments are performed using AP17-OLR database. Use of mean with standard deviation has reduced the equal error rate (EER) with an 8% relative improvement. A multi-head attention mechanism is introduced in self-attention networks with an assumption that each head captures the distinct information to discriminate languages. Use of multi-head self-attention has further reduced the EER with a 13% relative improvement. Best performance is achieved with multi-head self-attention network with residual connections. Shifted delta cepstral features (SDC) and stacked SDC features are used for developing LID systems.
Stable implementation of zero frequency filtering of speech signals for efficient epoch extraction
GURUGUBELLI KRISHNA,Anil Kumar Vuppala
IEEE Signal Processing Letter, SPL, 2019
@inproceedings{bib_Stab_2019, AUTHOR = {GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Stable implementation of zero frequency filtering of speech signals for efficient epoch extraction}, BOOKTITLE = {IEEE Signal Processing Letter}. YEAR = {2019}}
Epochs are the abrupt-closure events in vocal fold vibration during the production of voiced speech. Zero frequency filtering is a simple and effective technique used to estimate the glottal closure instants accurately from the speech signal. However, the zero frequency filter is an unstable system. Hence, it may not be suitable for practical implementation due to the requirements of high precision computation. In this letter, zero-phase zero frequency resonator is proposed as an alternative to zero frequency filter. The proposed approach provides a stable zero-phase response. The experimental results indicate that the performance of the proposed method outperformed the state-of-the-art methods in terms of identification rate 99.17% and provides comparable performance in terms of false alarm rate (0.41%), and identification accuracy (0.28 ms).
Perceptually enhanced single frequency filtering for dysarthric speech detection and intelligibility assessment
GURUGUBELLI KRISHNA,Anil Kumar Vuppala
International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2019
@inproceedings{bib_Perc_2019, AUTHOR = {GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Perceptually enhanced single frequency filtering for dysarthric speech detection and intelligibility assessment}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2019}}
This paper proposes a new speech feature representation that improves the intelligibility assessment of dysarthric speech. The formulation of the feature set is motivated from the human auditory perception and high time-frequency resolution property of single frequency filtering (SFF) technique. The proposed features are named as perceptually enhanced single frequency cepstral coefficients (PESFCC). As a part of SFF technique implementation, speech signal passed through a single pole complex bandpass filter bank to obtain high-resolution time-frequency distribution. Then, the distribution is enhanced by using a set of auditory perceptual operators. Lastly, traditional homomorphic analysis has been carried out on the resulting signal to obtain PE-SFCC feature vector. The performance of proposed features in dysarthric speech detection and its intelligibility assessment has been reported on UASPEECH database. The PE-SFCC features outperformed the state-of-the-art features in dysarthric speech detection and intelligibility assessment.
Replay spoofing countermeasures using high spectro-temporal resolution features
KNRK RAJU ALLURI,Anil Kumar Vuppala
International Journal of Speech Technology, IJST, 2019
@inproceedings{bib_Repl_2019, AUTHOR = {KNRK RAJU ALLURI, Anil Kumar Vuppala}, TITLE = {Replay spoofing countermeasures using high spectro-temporal resolution features}, BOOKTITLE = {International Journal of Speech Technology}. YEAR = {2019}}
The easy implementation of replay attacks by a fraudster poses a severe threat to automatic speaker verification (ASV) technology than the other spoofing attacks like speech synthesis and voice conversion. Replay attacks refer to an attack by a fraudster to get illegitimate access to an ASV system by playing back the speech sample collected from genuine target speaker. The significant cues that can differentiate between genuine and replay recordings are channel characteristics. To capture these characteristics, one need to extract features from the spectrum, which should have high spectral and temporal resolutions. Zero time windowing (ZTW) analysis of speech is one such time-frequency analysis technique, which results in high spectral and temporal resolution spectrum at each sampling instant. In this study, new features are proposed by applying cepstral analysis to ZTW spectrum. Experiments are performed on two publicly available replay attack databases namely BTAS 2016 and ASVspoof 2017. The first set of experiments are conducted using Gaussian mixture models to evaluate the potential of proposed features. Performance of the proposed system in terms of half total error rate is 0.75% and in terms of equal error rate is 14.75% on BTAS 2016 and ASVspoof 2017 evaluation sets respectively. A score level fusion is performed by using proposed features with previously proposed single frequency filtering cepstral coefficients. This fused result outperformed the previously reported best results on these two datasets
Application of emotion recognition and modification for emotional Telugu speech recognition
VISHNU VIDYADHARA RAJU V,GURUGUBELLI KRISHNA,Anil Kumar Vuppala
Mobile Networks and Applications, MONET, 2019
@inproceedings{bib_Appl_2019, AUTHOR = {VISHNU VIDYADHARA RAJU V, GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Application of emotion recognition and modification for emotional Telugu speech recognition}, BOOKTITLE = {Mobile Networks and Applications}. YEAR = {2019}}
Majority of the automatic speech recognition systems (ASR) are trained with neutral speech and the performance of these systems are affected due to the presence of emotional content in the speech. The recognition of these emotions in human speech is considered to be the crucial aspect of human-machine interaction. The combined spectral and differenced prosody features are considered for the task of the emotion recognition in the first stage. The task of emotion recognition does not serve the sole purpose of improvement in the performance of an ASR system. Based on the recognized emotions from the input speech, the corresponding adapted emotive ASR model is selected for the evaluation in the second stage. This adapted emotive ASR model is built using the existing neutral and synthetically generated emotive speech using prosody modification method. In this work, the importance of emotion recognition block at the front-end along with the emotive speech adaptation to the ASR system models were studied. The speech samples from IIIT-H Telugu speech corpus were considered for building the large vocabulary ASR systems. The emotional speech samples from IITKGP-SESC Telugu corpus were used for the evaluation. The adapted emotive speech models have yielded better performance over the existing neutral speech models.
Sound Privacy: A Conversational Speech Corpus for Quantifying the Experience of Privacy.
Pablo Perez Zarazaga,Sneha Das,Tom Backstrom,VISHNU VIDYADHARA RAJU V,Anil Kumar Vuppala
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019
@inproceedings{bib_Soun_2019, AUTHOR = {Pablo Perez Zarazaga, Sneha Das, Tom Backstrom, VISHNU VIDYADHARA RAJU V, Anil Kumar Vuppala}, TITLE = {Sound Privacy: A Conversational Speech Corpus for Quantifying the Experience of Privacy.}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2019}}
With the growing popularity of social networks, cloud services and online applications, people are becoming concerned about the way companies store their data and the ways in which the data can be applied. Privacy with devices and services operated by the voice are of particular interest. To enable studies in privacy, this paper presents a database which quantifies the experience of privacy users have in spoken communication. We focus on the effect of the acoustic environment on that perception of privacy. Speech signals are recorded in scenarios simulating real-life situations, where the acoustic environment has an effect on the experience of privacy. The acoustic data is complemented with measures of the speakers’ experience of privacy, recorded using a questionnaire. The presented corpus enables studies in how acoustic environments affect peoples’ experience of privacy, which in turn, can be used to develop speech operated applications which are respectful of their right to privacy. Index Terms: Experience of privacy, speech interfaces, speech corpus, acoustic environment, right to privacy
IIIT-H Spoofing Countermeasures for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2019.
KNRK RAJU ALLURI,Anil Kumar Vuppala
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019
@inproceedings{bib_IIIT_2019, AUTHOR = {KNRK RAJU ALLURI, Anil Kumar Vuppala}, TITLE = {IIIT-H Spoofing Countermeasures for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2019.}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2019}}
The ASVspoof 2019 challenge focuses on countermeasures for all major spoofing attacks, namely speech synthesis (SS), voice conversion (VC), and replay spoofing attacks. This paper describes the IIIT-H spoofing countermeasures developed for ASVspoof 2019 challenge. In this study, three instantaneous cepstral features namely, single frequency cepstral coefficients, zero time windowing cepstral coefficients, and instantaneous frequency cepstral coefficients are used as front-end features. A Gaussian mixture model is used as back-end classifier. The experimental results on ASVspoof 2019 dataset reveal that the proposed instantaneous features are efficient in detecting VC and SS based attacks. In detecting replay attacks, proposed features are comparable with baseline systems. Further analysis is carried out using metadata to assess the impact of proposed countermeasures on different synthetic speech generating algorithm/replay configurations
Prosody modification for speech recognition in emotionally mismatched conditions
VISHNU VIDYADHARA RAJU V,GURUGUBELLI KRISHNA,Anil Kumar Vuppala
International Journal of Speech Technology, IJST, 2018
@inproceedings{bib_Pros_2018, AUTHOR = {VISHNU VIDYADHARA RAJU V, GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Prosody modification for speech recognition in emotionally mismatched conditions}, BOOKTITLE = {International Journal of Speech Technology}. YEAR = {2018}}
A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of this work is to improve the performance of ASR in the presence of emotional conditions using prosody modification. The influence of different emotions on the prosody parameters is exploited in this work. Emotion conversion methods are employed to generate the word level non-uniform prosody modified speech. Modification factors for prosodic components such as pitch, duration and energy are used. The prosody modification is done in two ways. Firstly, emotion conversion is done at the testing stage to generate the neutral speech from the emotional speech. Secondly, the ASR is trained with the generated emotional speech from the neutral speech. In this work, the presence of emotions in speech is studied for the Telugu ASR systems. A new database of IIIT-H Telugu speech corpus is collected to build the large vocabulary neutral Telugu speech ASR system. The emotional speech samples from IITKGP-SESC Telugu corpus are used for testing it. The emotions of anger, happiness and compassion are considered during the evaluation. An improvement in the performance of ASR systems is observed in the prosody modified speech.
Combining evidences from excitation source and vocal tract system features for Indian language identification using deep neural networks
K V MOUNIKA,VUDDAGIRI RAVI KUMAR,Suryakanth Gangashetty,Anil Kumar Vuppala
International Journal of Speech Technology, IJST, 2018
@inproceedings{bib_Comb_2018, AUTHOR = {K V MOUNIKA, VUDDAGIRI RAVI KUMAR, Suryakanth Gangashetty, Anil Kumar Vuppala}, TITLE = {Combining evidences from excitation source and vocal tract system features for Indian language identification using deep neural networks}, BOOKTITLE = {International Journal of Speech Technology}. YEAR = {2018}}
In this paper, a combination of excitation source information and vocal tract system information is explored for the task of language identification (LID). The excitation source information is represented by features extracted from linear prediction (LP) residual signal called the residual cepstral coefficients (RCC). Vocal tract system information is represented by the mel frequency cepstral coefficients (MFCC). In order to incorporate additional temporal information, shifted delta cepstra (SDC) are computed. An LID system is built using SDC over both MFCC and RCC features individually and evaluated based on their equal error rate (EER). Experiments have been performed on a dataset consisting of 13 Indian languages with about 115 h for training and 30 h for testing using a deep neural network (DNN), DNN with attention (DNN-WA) and a state-ofthe-art i-vector system. DNN-WA outperforms the baseline i-vector system. An EER of 9.93 and 6.25% are achieved using RCC and MFCC features respectively. By combining evidence from both features using a late fusion mechanism, an EER of 5.76% is obtained. This result indicates the complementary nature of the excitation source information to that of the widely used vocal tract system information for the task of LID.
Incorporating Speaker Normalizing Capabilities to an End-to-End Speech Recognition System
VYDANA HARI KRISHNA,SIVANAND A,Anil Kumar Vuppala
Workshop on Spoken Language Technologies for Under-Resourced Languages, SLTU-W, 2018
@inproceedings{bib_Inco_2018, AUTHOR = {VYDANA HARI KRISHNA, SIVANAND A, Anil Kumar Vuppala}, TITLE = {Incorporating Speaker Normalizing Capabilities to an End-to-End Speech Recognition System}, BOOKTITLE = {Workshop on Spoken Language Technologies for Under-Resourced Languages}. YEAR = {2018}}
Speaker normalization is one of the crucial aspects of an Automatic speech recognition system (ASR). Speaker normalization is employed to reduce the performance drop in ASR due to speaker variabilities. Traditional speaker normalization methods are mostly linear transforms over the input data estimated per speaker, such transforms would be efficient with sufficient data. In practical scenarios, only a single utterance from the test speaker is accessible. The present study explores speaker normalization methods for end-to-end speech recognition systems that could efficiently be performed even when single utterance from the unseen speaker is available. In this work, it is hypothesized that by suitably providing information about the speaker’s identity while training an end-to-end neural network, the capability to normalize the speaker variability could be incorporated into an ASR system. The efficiency of these normalization methods depends on the representation used for unseen speakers. In this work, the identity of the training speaker is represented in two different ways viz. i) by using a one-hot speaker code, ii) a weighted combination of all the training speakers identities. The unseen speakers from the test set are represented using a weighted combination of training speakers representations. Both the approaches have reduced the word error rate (WER) by 0.6, 1.3% WSJ corpus
Automatic detection of retroflex approximants in a continuous tamil speech
THIRUMURU RAMA KRISHNA,Anil Kumar Vuppala
Circuits, Systems, and Signal Processing, CSSP, 2018
@inproceedings{bib_Auto_2018, AUTHOR = {THIRUMURU RAMA KRISHNA, Anil Kumar Vuppala}, TITLE = {Automatic detection of retroflex approximants in a continuous tamil speech}, BOOKTITLE = {Circuits, Systems, and Signal Processing}. YEAR = {2018}}
t Phonetic feature extraction is used in many types of speech applications. In this paper, we propose an approach for automatic detection of retroflex approximant /õ/ in a Tamil continuous speech using sonorant regions and formant structure of speech. Retroflex approximant is one of the distinctive phonetic features of Tamil speech. It occurs in V_V and V_ contexts. In the proposed approach, slopes of the formant trajectories and spectral dynamics of F2 and F3 in the region of retroflex approximant are considered as acoustic cues for detecting them in continuous speech. In this work, numerator of the group delay function-based formant extraction method is used to track formant trajectories. Based on experimental results, formants extracted using this approach produced better evidence compared to state-of-the-art formant extraction techniques for retroflex approximant detection.
Application of non-negative frequency-weighted energy operator for vowel region detection
THIRUMURU RAMA KRISHNA,Anil Kumar Vuppala
International Journal of Speech Technology, IJST, 2018
@inproceedings{bib_Appl_2018, AUTHOR = {THIRUMURU RAMA KRISHNA, Anil Kumar Vuppala}, TITLE = {Application of non-negative frequency-weighted energy operator for vowel region detection}, BOOKTITLE = {International Journal of Speech Technology}. YEAR = {2018}}
In this paper, a novel technique has been proposed for the vowel region detection from the continuous speech using an envelope of the derivative of the speech signal, which is a non-negative, frequency-weighted energy operator. The proposed vowel region detection method is implemented using a two-stage algorithm. The first stage of vowel region detection consists of speech signal analysis to detect vowel onset points (VOP) and vowel end-points (VEP) using an instantaneous energy contour obtained from the envelope of the derivative of a speech signal. The VOPs and VEPs are spotted using the peak-finding algorithm based upon the first order Gaussian differentiator. The next stage consists of removal of spurious vowel regions and the correction of hypothesized VOP and VEP locations using combined cues obtained from the uniformity of epoch intervals and strength of the excitation of the speech signal. Performance of the proposed method for detecting vowel regions from the speech signal is evaluated using TIMIT acoustic-phonetic speech corpus. The proposed approach resulted in significantly high detection rate and less false alarm rate compared to the state-of-the-art methods in both clean and noisy environments
Significance of Accuracy in Vowel Region Detection for Robust Language Identification
THIRUMURU RAMA KRISHNA,VUDDAGIRI RAVI KUMAR,GURUGUBELLI KRISHNA,Anil Kumar Vuppala
International Conference on Signal Processing and Integrated Networks, SPIN, 2018
@inproceedings{bib_Sign_2018, AUTHOR = {THIRUMURU RAMA KRISHNA, VUDDAGIRI RAVI KUMAR, GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Significance of Accuracy in Vowel Region Detection for Robust Language Identification}, BOOKTITLE = {International Conference on Signal Processing and Integrated Networks}. YEAR = {2018}}
In this study, a robust language identification (LID) system is investigated with vowel region detection scheme incorporated at the front-end. The background noise is known to severely reduce language identification performance. In this work, Mel-frequency cepstral features extracted from the vowel regions instead of voiced speech are utilized for language identification system modeling. Experimentation is done using different vowel region detection algorithms on deep neural networks (DNN) and deep neural networks with attention (DNN-WA) based LID systems on IIITH Indian language dataset. Experimental results show that the accurate vowel region detection mechanism at the front-end of a LID system improved the performance in noisy environment.
Differenced Prosody Features from Normal and Stressed Regions for Emotion Recognition
GURUGUBELLI KRISHNA,KNRK RAJU ALLURI,Anil Kumar Vuppala
International Conference on Signal Processing and Integrated Networks, SPIN, 2018
@inproceedings{bib_Diff_2018, AUTHOR = {GURUGUBELLI KRISHNA, KNRK RAJU ALLURI, Anil Kumar Vuppala}, TITLE = {Differenced Prosody Features from Normal and Stressed Regions for Emotion Recognition}, BOOKTITLE = {International Conference on Signal Processing and Integrated Networks}. YEAR = {2018}}
The physiological constraints in the human speech production mechanism restrict the emotional state of the speaker for a short duration in an emotional speech. In this work, the emotionally stressed regions are detected using the strength of the excitation (SoE) and fundamental frequency (F0) of the speech signal. These emotionally stressed regions for the different emotions are further classified using a linear kernel support vector machine (SVM) with a twostage binary decision logic. This classifier is modelled using the differenced prosody features extracted by considering the relative difference of the prosody components between normal and emotionally stressed regions. This experimentation is carried out using Berlin emotion speech (EMO-DB) database. The proposed approach produced a better result compared to existing state-of-the-art emotion recognition systems. An accuracy of 80.83% is reported in this paper.
Improved vowel region detection from a continuous speech using post processing of vowel onset points and vowel end-points
THIRUMURU RAMA KRISHNA,Suryakanth Gangashetty,Anil Kumar Vuppala
Journal on Multimedia Tools Applications, JMTA, 2018
@inproceedings{bib_Impr_2018, AUTHOR = {THIRUMURU RAMA KRISHNA, Suryakanth Gangashetty, Anil Kumar Vuppala}, TITLE = {Improved vowel region detection from a continuous speech using post processing of vowel onset points and vowel end-points}, BOOKTITLE = {Journal on Multimedia Tools Applications}. YEAR = {2018}}
Vowels are produced with an open configuration of the vocal tract, without any audible friction. The acoustic signal is relatively loud with varying strength of impulselike excitation. Vowels possess significant energy content in the low-frequency bands of the speech signal. Acoustic events such as vowel onset point (VOP) and vowel end-point (VEP) can be used as landmarks to detect vowel regions in a speech signal. In this paper, two-stage algorithm is proposed to detect precise vowel regions. In the first level, the speech signal is processed using zero frequency filtering to emphasize energy content in low-frequency bands of speech. Zero frequency filtered signal predominantly contains lowfrequency content of the speech signal as it is filtered around 0 Hz. This process is followed by the extraction of dominant spectral peaks from the magnitude spectrum around glottal closure regions of the speech signal. The vowel onset points and vowel end-points are obtained by convolving the enhanced spectral contour of zero frequency filtered signal with first order Gaussian differentiator. In the next level, a post-processing is carried out in the regions around VOP and VEP to remove spurious vowel regions based on uniformity of epoch intervals. In addition, the positions of VOPs and VEPs are also corrected using the strength of the excitation of the speech signal. The performance of the proposed vowel region detection method is compared with the existing state of art methods on TIMIT acoustic-phonetic speech corpus. It is reported that this method produced significant improvement in vowel region detection in clean and noisy environments.
Curriculum learning based approach for noise robust language identification using DNN with attention
VUDDAGIRI RAVI KUMAR,VYDANA HARI KRISHNA,Anil Kumar Vuppala
Expert Systems with Applications, ESWA, 2018
@inproceedings{bib_Curr_2018, AUTHOR = {VUDDAGIRI RAVI KUMAR, VYDANA HARI KRISHNA, Anil Kumar Vuppala}, TITLE = {Curriculum learning based approach for noise robust language identification using DNN with attention}, BOOKTITLE = {Expert Systems with Applications}. YEAR = {2018}}
Automatic language identification (LID) in practical environments is gaining a lot of scientific attention due to rapid developments in multilingual speech processing applications. When an LID is operated in noisy environments a degradation in the performance can be observed and it can be majorly attributed to mismatch between the training and operating environments. This work is aimed towards developing an LID system that can robustly operate in clean and noisy environments. Traditionally, to reduce the mismatch between training and operating environments, noise is synthetically induced to the training corpus and these models are termed as multi-SNR models. In this work, various curriculum learning strategies are explored to train multi-SNR models, such that the trained models have better generalization in performance over varying background environments. I-vector, Deep neural networks (DNN) and DNN With Attention (DNN-WA) architectures are used in this work for developing LID systems. Experimental verification of the proposed approach is carried out using IIIT-H Indian database and AP17-OLR database. The performance of LID system is tested at different signal-to-noise ratio (SNR) levels using white and vehicular noises from NOISEX dataset. In comparison to multi-SNR models, the LID systems trained with curriculum learning have performed better in terms of equal error rate (EER) and generalization in EER across varying background environments. The degradation in the performance of LID systems due to environmental noise has been effectively reduced by training multi-SNR models using curriculum learning.
Automatic Detection of Palatalized Consonants in Kashmiri
THIRUMURU RAMA KRISHNA,GURUGUBELLI KRISHNA,Anil Kumar Vuppala
Workshop on Spoken Language Technologies for Under-Resourced Languages, SLTU-W, 2018
@inproceedings{bib_Auto_2018, AUTHOR = {THIRUMURU RAMA KRISHNA, GURUGUBELLI KRISHNA, Anil Kumar Vuppala}, TITLE = {Automatic Detection of Palatalized Consonants in Kashmiri}, BOOKTITLE = {Workshop on Spoken Language Technologies for Under-Resourced Languages}. YEAR = {2018}}
In this study, the acoustic-phonetic attributes of palatalization in the Kashmiri speech is investigated. It is a unique phonetic feature of Kashmiri in the Indian context. An automated approach is proposed to detect this unique phonetic feature from the continuous Kashmiri speech. The i-matra vowel has the impact of palatalizing the consonant connected to it. Therefore, these consonants investigated in synchronous with vowel regions, which are spotted using the instantaneous energy computed from the envelope-derivative of the speech signal. The resonating characteristics of the vocal-tract system framework that reflect the formant dynamics are used to differentiate palatalized consonants from the other consonants. In this regard, the Hilbert envelope of the numerator of the group-delay function that provides good time-frequency resolution used to extract formants. The palatalization detection experimentation carried out in various vowel contexts using the acoustic cues, and it produced a promising result with a detection accuracy of 92.46 %.
IIITH-ILSC Speech Database for Indain Language Identification.
VUDDAGIRI RAVI KUMAR,GURUGUBELLI KRISHNA,PRIYAM JAIN,VYDANA HARI KRISHNA,Anil Kumar Vuppala
Workshop on Spoken Language Technologies for Under-Resourced Languages, SLTU-W, 2018
@inproceedings{bib_IIIT_2018, AUTHOR = {VUDDAGIRI RAVI KUMAR, GURUGUBELLI KRISHNA, PRIYAM JAIN, VYDANA HARI KRISHNA, Anil Kumar Vuppala}, TITLE = {IIITH-ILSC Speech Database for Indain Language Identification.}, BOOKTITLE = {Workshop on Spoken Language Technologies for Under-Resourced Languages}. YEAR = {2018}}
This work focuses on the development of speech data comprising 23 Indian languages for developing language identification (LID) systems. Large data is a pre-requisite for developing state-of-the-art LID systems. With this motivation, the task of developing multilingual speech corpus for Indian languages has been initiated. This paper describes the composition of the data and the performances of various LID systems developed using this data. In this paper, Mel frequency cepstral feature representation is used for language identification. In this work, various state-of-the-art LID systems are developed using i-vectors, deep neural network (DNN) and deep neural network with attention (DNN-WA) models. The performance of the LID system is observed in terms of the equal error rate for i-vector, DNN and DNN-WA is 17.77%, 17.95%, and 15.18% respectively. Deep neural network with attention model shows a better performance over i-vector and DNN models
Improved Language Identification Using Stacked SDC Features and Residual Neural Network.
VUDDAGIRI RAVI KUMAR,VYDANA HARI KRISHNA,Anil Kumar Vuppala
Workshop on Spoken Language Technologies for Under-Resourced Languages, SLTU-W, 2018
@inproceedings{bib_Impr_2018, AUTHOR = {VUDDAGIRI RAVI KUMAR, VYDANA HARI KRISHNA, Anil Kumar Vuppala}, TITLE = {Improved Language Identification Using Stacked SDC Features and Residual Neural Network.}, BOOKTITLE = {Workshop on Spoken Language Technologies for Under-Resourced Languages}. YEAR = {2018}}
Language identification (LID) systems, which can model highlevel information such as phonotactics have exhibited superior performance. State-of-the-art models use sequential models to capture the high-level information, but these models are sensitive to the length of the utterance and do not equally generalize over variable length utterances. To effectively capture this information, a feature that can model the long-term temporal context is required. This study aims to capture the long-term temporal context by appending successive shifted delta cepstral (SDC) features. Deep neural networks have been explored for developing LID systems. Experiments have been performed using AP17-OLR database. LID systems developed by stacking SDC features have shown significant improvement compared to the system trained with SDC features. The proposed feature with residual connections in the feed-forward networks reduced the equal error rate from 21.04, 18.02, 16.45 to 14.42, 11.14 and 10.11 on the 1-second, 3-seconds and > 3-second test utterances respectively.
An Exploration towards Joint Acoustic Modeling for Indian Languages: IIIT-H Submission for Low Resource Speech Recognition Challenge for Indian Languages, INTERSPEECH 2018.
VYDANA HARI KRISHNA,GURUGUBELLI KRISHNA,VISHNU VIDYADHARA RAJU V,Anil Kumar Vuppala
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018
@inproceedings{bib_An_E_2018, AUTHOR = {VYDANA HARI KRISHNA, GURUGUBELLI KRISHNA, VISHNU VIDYADHARA RAJU V, Anil Kumar Vuppala}, TITLE = {An Exploration towards Joint Acoustic Modeling for Indian Languages: IIIT-H Submission for Low Resource Speech Recognition Challenge for Indian Languages, INTERSPEECH 2018.}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2018}}
India being a multilingual society, a multilingual automatic speech recognition system (ASR) is widely appreciated. Despite different orthographies, Indian languages share same phonetic space. To exploit this property, a joint acoustic model has been trained for developing multilingual ASR system using a common phone-set. Three Indian languages namely Telugu, Tamil and, Gujarati are considered for the study. This work studies the amenability of two different acoustic modeling approaches for training a joint acoustic model using common phone-set. Sub-space Gaussian mixture models (SGMM), and recurrent neural networks (RNN) trained with connectionst temporal classification (CTC) objective function are explored for training joint acoustic models. From the experimental results, it can be observed that the joint acoustic models trained with RNN-CTC have performed better than SGMM system even on 120 hours of data (approx 40 hrs per language). The joint acoustic model trained with RNN-CTC has performed better than monolingual models, due to an efficient data sharing across the languages. Conditioning the joint model with language identity had a minimal advantage. Sub-sampling the features by a factor of 2 while training RNN-CTC models has reduced the training times and has performed better. Index Terms: Speech recognition, Joint acoustic model, lowresource, common phone set, Indian languages, RNN-CTC, SGMM
DNN-HMM Acoustic Modeling for Large Vocabulary Telugu Speech Recognition
VISHNU VIDYADHARA RAJU V,GURUGUBELLI KRISHNA,VYDANA HARI KRISHNA,Bhargav Pulugundla,Manish Srivastava,Anil Kumar Vuppala
International Conference on Mining Intelligence and Knowledge Exploration, MIKE, 2017
@inproceedings{bib_DNN-_2017, AUTHOR = {VISHNU VIDYADHARA RAJU V, GURUGUBELLI KRISHNA, VYDANA HARI KRISHNA, Bhargav Pulugundla, Manish Srivastava, Anil Kumar Vuppala}, TITLE = {DNN-HMM Acoustic Modeling for Large Vocabulary Telugu Speech Recognition}, BOOKTITLE = {International Conference on Mining Intelligence and Knowledge Exploration}. YEAR = {2017}}
The main focus of this paper is towards the development of a large vocabulary Telugu speech database. Telugu is a low resource language where there exists no standardized database for building the speech recognition system (ASR). The database consists of neutral speech samples collected from 100 speakers for building the Telugu ASR system and it was named as IIIT-H Telugu speech corpus. The speech and text corpus design and the procedure followed for the collection of the database have been discussed in detail. The preliminary ASR system results for the models built in this database are reported. The architectural choices of deep neural networks (DNNs) play a crucial role in improving the performance of ASR systems. ASR trained with hybrid DNNs (DNNHMM) with more hidden layers have shown better performance over the conventional GMMs (GMM-HMM). Kaldi tool kit is used for building the acoustic models required for the ASR system.
Importance of non-uniform prosody modification for speech recognition in emotion conditions
VISHNU VIDYADHARA RAJU V,VYDANA HARI KRISHNA,Suryakanth Gangashetty,Anil Kumar Vuppala
Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), APSIPA, 2017
@inproceedings{bib_Impo_2017, AUTHOR = {VISHNU VIDYADHARA RAJU V, VYDANA HARI KRISHNA, Suryakanth Gangashetty, Anil Kumar Vuppala}, TITLE = {Importance of non-uniform prosody modification for speech recognition in emotion conditions}, BOOKTITLE = {Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)}. YEAR = {2017}}
A mismatch in training and operating environments causes a performance degradation in speech recognition systems (ASR). One major reason for this mismatch is due to the presence of expressive (emotive) speech in operational environments. Emotions in speech majorly inflict the changes in the prosody parameters of pitch, duration and energy. This work is aimed at improving the performance of speech recognition systems in the presence of emotive speech. This work focuses on improving the speech recognition performance without disturbing the existing ASR system. The prosody modification of pitch, duration and energy is achieved by tuning the modification factors values for the relative differences between the neutral and emotional data sets. The neutral version of emotive speech is generated using uniform and non-uniform prosody modification methods for speech recognition. During the study, IITKGP-SESC corpus is used for building the ASR system. The speech recognition system for the emotions (anger, happy and compassion) is evaluated. An improvement in the performance of ASR is observed when the prosody modified emotive utterance is used for speech recognition in place of original emotive utterance. An average improvement around 5% in accuracy is observed due to the use of non-uniform prosody modification methods.
Sentiment analysis using relative prosody features
HARIKA ABBURI,KNRK RAJU ALLURI,Anil Kumar Vuppala,Manish Srivastava,Suryakanth Gangashetty
International Conference on Contemporary Computing, IC3, 2017
@inproceedings{bib_Sent_2017, AUTHOR = {HARIKA ABBURI, KNRK RAJU ALLURI, Anil Kumar Vuppala, Manish Srivastava, Suryakanth Gangashetty}, TITLE = {Sentiment analysis using relative prosody features}, BOOKTITLE = {International Conference on Contemporary Computing}. YEAR = {2017}}
Recent improvement in usage of digital media has led people to share their opinions about specific entity through audio. In this paper, an approach to detect the sentiment of an online spoken reviews based on relative prosody features is presented. Most of the existing systems for audio based sentiment analysis use conventional audio features, but they are not problem specific features to extract the sentiment. In this work, relative prosody features are extracted from normal and stressed regions of audio signal to detect the sentiment. Stressed regions are identified using the strength of excitation. Support Vector Machine (SVM) and Gaussian Mixture Model (GMM) classifiers are used to build the sentiment models. MOUD database is used for the proposed study. Experimental results show that, the rate of detecting the sentiment is improved with relative prosody features compared with the prosody and Mel Frequency Cepstral Coefficients (MFCC) because the relative prosody features has more sentiment specific discrimination compared to prosody features.
Significance of neural phonotactic models for large-scale spoken language identification
BRIJ MOHAN LAL SRIVASTAVA,VYDANA HARI KRISHNA,Anil Kumar Vuppala,Manish Srivastava
International Joint Conference on Neural Networks, IJCNN, 2017
@inproceedings{bib_Sign_2017, AUTHOR = {BRIJ MOHAN LAL SRIVASTAVA, VYDANA HARI KRISHNA, Anil Kumar Vuppala, Manish Srivastava}, TITLE = {Significance of neural phonotactic models for large-scale spoken language identification}, BOOKTITLE = {International Joint Conference on Neural Networks}. YEAR = {2017}}
Language identification (LID) is vital frontend for spoken dialogue systems operating in diverse linguistic settings to reduce recognition and understanding errors. Existing LID systems which use low-level signal information for classification do not scale well due to exponential growth of parameters as the classes increase. They also suffer performance degradation due to the inherent variabilities of speech signal. In the proposed approach, we model the language-specific phonotactic information in speech using recurrent neural network for developing an LID system. The input speech signal is tokenized to phone sequences by using a common language-independent phone recognizer with varying phonetic coverage. We establish a causal relationship between phonetic coverage and LID performance. The phonotactics in the observed phone sequences are modeled using statistical and recurrent neural network language models to predict language-specific symbol from a universal phonetic inventory. Proposed approach is robust, computationally light weight and highly scalable. Experiments show that the convex combination of statistical and recurrent neural network language model (RNNLM) based phonotactic models significantly outperform a strong baseline system of Deep Neural Network (DNN) which is shown to surpass the performance of i-vector based approach for LID. The proposed approach outperforms the baseline models in terms of mean F1 score over 176 languages. Further we provide significant information-theoretic evidence to analyze the mechanism of the proposed approach.
SFF Anti-Spoofer: IIIT-H Submission for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2017.
KNRK RAJU ALLURI,SIVANAND A,KADIRI SUDARSANA REDDY,Suryakanth Gangashetty,Anil Kumar Vuppala
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017
@inproceedings{bib_SFF__2017, AUTHOR = {KNRK RAJU ALLURI, SIVANAND A, KADIRI SUDARSANA REDDY, Suryakanth Gangashetty, Anil Kumar Vuppala}, TITLE = {SFF Anti-Spoofer: IIIT-H Submission for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2017.}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2017}}
The ASVspoof 2017 challenge is about the detection of replayed speech from human speech. The proposed system makes use of the fact that when the speech signals are replayed, they pass through multiple channels as opposed to original recordings. This channel information is typically embedded in low signal to noise ratio regions. A speech signal processing method with high spectro-temporal resolution is required to extract robust features from such regions. The single frequency filtering (SFF) is one such technique, which we propose to use for replay attack detection. While SFF based feature representation was used at front-end, Gaussian mixture model and bi-directional long short-term memory models are investigated at the backend as classifiers. The experimental results on ASVspoof 2017 dataset reveal that, SFF based representation is very effective in detecting replay attacks. The score level fusion of back end classifiers further improved the performance of the system which indicates that both classifiers capture complimentary information
Detection of Replay Attacks using Single Frequency Filtering Cepstral Coefficients
KNRK RAJU ALLURI,SIVANAND A,KADIRI SUDARSANA REDDY,Suryakanth Gangashetty,Anil Kumar Vuppala
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017
@inproceedings{bib_Dete_2017, AUTHOR = {KNRK RAJU ALLURI, SIVANAND A, KADIRI SUDARSANA REDDY, Suryakanth Gangashetty, Anil Kumar Vuppala}, TITLE = {Detection of Replay Attacks using Single Frequency Filtering Cepstral Coefficients}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2017}}
Automatic speaker verification systems are more vulnerable to spoofing attacks. Recently, various countermeasures have been developed for detecting high technology attacks such as speech synthesis and voice conversion. However, there is a wide gap in dealing with replay attacks. In this paper, we propose a new feature for replay attack detection based on single frequency filtering (SFF), which provides high temporal and spectral resolution at each instant. Single frequency filtering cepstral coefficients (SFFCC) with Gaussian mixture model classifier is used for the experimentation on the standard BTAS-2016 corpus. The previously reported best result, which is based on constant Q cepstral coefficients (CQCC) has achieved a half total error rate of 0.67% on this data-set. Our proposed method outperforms the state of the art (CQCC) with a half total error rate of 0.0002%.
Investigative Study of Various Activation functions for Speech Recognition
VYDANA HARI KRISHNA,Anil Kumar Vuppala
National Conference on Communications, NCC, 2017
@inproceedings{bib_Inve_2017, AUTHOR = {VYDANA HARI KRISHNA, Anil Kumar Vuppala}, TITLE = {Investigative Study of Various Activation functions for Speech Recognition}, BOOKTITLE = {National Conference on Communications}. YEAR = {2017}}
Significant developments in deep learning methods have been achieved with the capability to train more deeper networks. The performance of speech recognition system has been greatly improved by the use of deep learning techniques. Most of the developments in deep learning are associated with the development of new activation functions and the corresponding initializations. The development of Rectified linear units (ReLU)has revolutionized the use of supervised deep learning methods for speech recognition. Recently there has been a great deal of research interest in the development of activation functionsLeaky-ReLU (LReLU), Parametric-ReLU (PReLU), Exponentia lLinear units (ELU) and Parametric-ELU (PELU). This work isaimed at studying the influence of various activation functions onspeech recognition system. In this work, a hidden Markov model-Deep neural network (HMM-DNN) based speech recognitionis used, where deep neural networks with different activationfunctions have been employed to obtain the emission probabilitiesof hidden Markov model. In this work, two datasets i.e., TIMITand WSJ are employed to study the behavior of various speechrecognition systems with different sized datasets. During thestudy, it is observed that the performance of ReLU-networksis superior compared to the other networks for the smaller sizeddataset (i.e., TIMIT dataset). For the datasets of sufficientlylarger size (i.e., WSJ) performance of ELU-networks is superiorto the other networks.
Residual neural networks for speech recognition
VYDANA HARI KRISHNA,Anil Kumar Vuppala
European Signal Processing Conference, EUSIPCO, 2017
@inproceedings{bib_Resi_2017, AUTHOR = {VYDANA HARI KRISHNA, Anil Kumar Vuppala}, TITLE = {Residual neural networks for speech recognition}, BOOKTITLE = {European Signal Processing Conference}. YEAR = {2017}}
Recent developments in deep learning methods have greatly influenced the performances of speech recognition systems. In a Hidden Markov model-Deep neural network (HMM-DNN) based speech recognition system, DNNs have been employed to model senones (context dependent states of HMM),where HMMs capture the temporal relations among senones. Due to the use of more deeper networks significant improvement in the performances has been observed and developing deep learning methods to train more deeper architectures has gained a lot of scientific interest. Optimizing a deeper network is more complex ask than to optimize a less deeper network, but recently residual network have exhibited a capability to train a very deep neural network architectures and are not prone to vanishing/exploding gradient problems. In this work, the effectiveness of residual networks have been explored for of speech recognition. Along with the depth of the residual network, the criticality of width of the residual network has also been studied. It has been observe dthat at higher depth, width of the networks is also a crucial parameter for attaining significant improvements. A 14-hoursubset of WSJ corpus is used for training the speech recognition systems, it has been observed that the residual networks have shown much ease in convergence even with a depth much higher than the deep neural network. In this work, using residual networks an absolute reduction of 0.4 in WER error rates (8%reduction in the relative error) is attained compared to the best performing deep neural network.
A study on the minimum dominating set problem approximation in parallel
MAHAK GAMBHIR,Anil Kumar Vuppala
International Conference on Contemporary Computing, IC3, 2017
@inproceedings{bib_A_st_2017, AUTHOR = {MAHAK GAMBHIR, Anil Kumar Vuppala}, TITLE = {A study on the minimum dominating set problem approximation in parallel}, BOOKTITLE = {International Conference on Contemporary Computing}. YEAR = {2017}}
A dominating set of a small size is useful in several settings including wireless networks, document summarization, secure system design, and the like. In this paper, we start by studying three distributed algorithms that produce a small sized dominating sets in a few rounds. We interpret these algorithms in the natural shared memory setting and experiment with these algorithms on a multi-core CPU. Based on the observations from these experimental results, we propose variations to the three algorithms and also show how the proposed variations offer interesting trade-offs with respect to the size of the dominating set produced and the time taken.
Starting Small Learning Strategies for Speech Recognition
VYDANA HARI KRISHNA,BRIJ MOHAN LAL SRIVASTAVA,Manish Srivastava,Anil Kumar Vuppala
India Council International Conference, INDICON, 2016
@inproceedings{bib_Star_2016, AUTHOR = {VYDANA HARI KRISHNA, BRIJ MOHAN LAL SRIVASTAVA, Manish Srivastava, Anil Kumar Vuppala}, TITLE = {Starting Small Learning Strategies for Speech Recognition}, BOOKTITLE = {India Council International Conference}. YEAR = {2016}}
Designing various learning strategies has been gaining a lot of scientific interest during the recent progress of deep learning methodologies. Curriculum learning is a learning strategy aimed at training the neural network model by presenting the samples in a specific meaningful order rather than randomly sampling the training examples from the data distribution. In this work, we have explored starting small paradigm of curriculum learning technique for speech recognition. The starting small paradigm of curriculum learning is performed by a two step learning strategy. Training dataset is re-organized as a set of easily classifiable examples followed by the actual training dataset and the model is trained on the re-organized dataset. We hypothesize that by following the starting small learning paradigm the learning gets initialized in a better way and progresses to attain a better convergence. We propose to rank the toughness of the training example based on the posterior probabilities obtained using a previously trained model. Apart from re-arranging the training corpus starting small paradigm of curriculum learning is applied at model level. We consider the broad manner-class classification objective function as the smoother version of the phone class classification objective function. The model initially trained for broad class classification is later adapted for phone classification.In this work, we have used TIMIT and a subset of WallStreet Journal (WSJ) corpus to validate the experiments, both the learning strategies have shown consistently better performances across the two datasets compared to the baseline system trained by randomly sampling the dataset.
Analysis of source and system features for speaker recognition in emotional conditions
KNRK RAJU ALLURI,VISHNU VIDYADHARA RAJU V,Suryakanth Gangashetty,Anil Kumar Vuppala
IEEE Region 10 Conference, TENCON, 2016
@inproceedings{bib_Anal_2016, AUTHOR = {KNRK RAJU ALLURI, VISHNU VIDYADHARA RAJU V, Suryakanth Gangashetty, Anil Kumar Vuppala}, TITLE = {Analysis of source and system features for speaker recognition in emotional conditions}, BOOKTITLE = {IEEE Region 10 Conference}. YEAR = {2016}}
Source and system features are extensively used in building Speaker Recognition (SR) systems. In this paper, we investigate the influence of source and system features on the performance of the SR system in emotional conditions. The Linear Prediction Residual Cepstral Coefficients (LPRCC) which corresponds to source features, and Mel Frequency Cepstral Coefficients (MFCC) and Linear Prediction Cepstral Coefficients (LPCC) that correspond to system features are used for modeling SR system. A maximum-likelihood classifier based on Gaussian mixture density functions is used and experiments are carried out on 3 standard emotional speech databases (Indian Institute of Technology-Simulated Emotion speech corpus (IITKGP-SESC): Hindi, IITKGP-SESC: Telugu and German Emotional Speech Database (EMO-DB)). The speaker models are trained with neutral utterances and tested with 3 types (anger, happy and sad) of emotional speech utterances. Based on experimental results, the performance degradation was observed in mismatched (trained with one emotion and tested with other emotion) case and the average percentage of degradation in SR task using source features is approximately 16% more compared to system features. The performance degradation of SR system is nullified when trained and tested with the same emotion.
A Study on Text-Independent Speaker Recognition Systems in Emotional Conditions Using Different Pattern Recognition Models
KNRK RAJU ALLURI,SIVANAND A,Rajendra Prasath,Suryakanth Gangashetty,Anil Kumar Vuppala
International Conference on Mining Intelligence and Knowledge Exploration, MIKE, 2016
@inproceedings{bib_A_St_2016, AUTHOR = {KNRK RAJU ALLURI, SIVANAND A, Rajendra Prasath, Suryakanth Gangashetty, Anil Kumar Vuppala}, TITLE = {A Study on Text-Independent Speaker Recognition Systems in Emotional Conditions Using Different Pattern Recognition Models}, BOOKTITLE = {International Conference on Mining Intelligence and Knowledge Exploration}. YEAR = {2016}}
The present study focuses on the text-independent speaker recognition in emotional conditions. In this paper, both system and source features are considered to represent speaker specific information. At the model level, Gaussian Mixture Models (GMMs), Gaussian Mixture Model-Universal Background Model (GMM-UBM) and Deep Neural Networks (DNN) are explored. The experiments are performed using 3 emotional databases, i.e. German emotional speech database (EMO-DB), IITKGP-SESC: Hindi and IITKGP-SESC: Telugu databases. The emotions considered in the present study are neutral, anger, happy and sad. The results show that, the performance of a speaker recognition system trained with clean speech is degrading while testing with emotional data irrespective of feature used or model used to build the system. The best results are obtained for the score level fusion of system and source features based systems when speakers are modeled with DNNs.
Application of prosody modification for speech recognition in different emotion conditions
VISHNU VIDYADHARA RAJU V,PAIDI GANGAMOHAN,Suryakanth Gangashetty,Anil Kumar Vuppala
IEEE Region 10 Conference, TENCON, 2016
@inproceedings{bib_Appl_2016, AUTHOR = {VISHNU VIDYADHARA RAJU V, PAIDI GANGAMOHAN, Suryakanth Gangashetty, Anil Kumar Vuppala}, TITLE = {Application of prosody modification for speech recognition in different emotion conditions}, BOOKTITLE = {IEEE Region 10 Conference}. YEAR = {2016}}
The main focus of this study is to analyze the performance of Automatic Speech Recognition (ASR) in different emotional environments using prosody modification. The majority of ASR systems are trained using neutral speech and the performance of such systems degrade when tested with the emotional speech. In this paper, the various components of speech that contribute to the emotion characteristics are studied. The prosody features of the source emotional utterancesare modified according to the target neutral utterances using Flexible Analysis Synthesis Tool (FAST). In the FAST, Dynamic Time Warping (DTW) is used to align the source emotional and target neutral utterances. Components of the prosody such as intonation, duration and excitation source are manipulated to incorporate the desired features into the source utterance. The modified (source emotional) utterances are then used for testing the ASR system which is trained using neutral speech. In this study, three emotions (compassion, happiness and anger) are considered for the analysis. Experimental results indicate an average improvement in the speech recognition system performance by considering prosody modified speech.
Significance of automatic detection of vowel regions for automatic shout detection in continuous speech
Vinay Kumar Mittal,Anil Kumar Vuppala
International Symposium on Chinese Spoken Language Processing, ISCSLP, 2016
@inproceedings{bib_Sign_2016, AUTHOR = {Vinay Kumar Mittal, Anil Kumar Vuppala}, TITLE = {Significance of automatic detection of vowel regions for automatic shout detection in continuous speech}, BOOKTITLE = {International Symposium on Chinese Spoken Language Processing}. YEAR = {2016}}
Automatic detection of shout prosody in continuous speech signal involves examining changes in its production characteristics. Our recent study of electroglottograph signals highlighted that significant changes occur in the glottal excitation source characteristics during production of shouted speech, especially in the vowel contexts. But the differences between normal and shouted speech, in the production features derived over utterances or word segments, may be masked sometimes by pauses or unvoiced regions related variations. Also, for such a real-time system, these vowel regions need to be found automatically. In this paper, changes in the shout production features are examined in the automatically detected vowel regions. Production of a vowel involves periodic impulse-like excitation and relatively high signal energy. Hence, the knowledge of epochs using zero-frequency filtering, and accurate vowel onset points can be used for detecting these regions. Changes in two excitation source features, the instantaneous fundamental frequency and strength of excitation, and in a vocal tract filter feature the dominant frequency, are examined for five steady vowel regions. Larger changes in these distinguishing features are observed in the automatically found vowel regions, than in word segments. This approach can help improving the systems for automatic detection of shout regions in continuous speech, and in paralinguistic applications that involve detection of prosody or emotions
Changes in shout features in automatically detected vowel regions
Vinay Kumar Mittal,Anil Kumar Vuppala
International Conference on Signal Processing and Communications, SPCOM, 2016
@inproceedings{bib_Chan_2016, AUTHOR = {Vinay Kumar Mittal, Anil Kumar Vuppala}, TITLE = {Changes in shout features in automatically detected vowel regions}, BOOKTITLE = {International Conference on Signal Processing and Communications}. YEAR = {2016}}
Shouted speech signals have been studied mostlyfor utterances or word segments. In the production features derived over these segments, the differences between normal and shouted speech may sometimes be masked by variations due to pauses and unvoiced regions. Also, our recent study of electroglottograph signals has highlighted the usefulness of examining changes in the glottal excitation source characteristics during production of shouted speech. In this paper we examine changes in the shout production features in the vowel regions,that are found automatically by using the knowledge of accurate vowel onset points and epochs. Vowel regions are produced by periodic impulse-like excitation and contain relatively high signal energy. Hence the source features the instantaneous fundamental frequency and strength of excitation are examined, along with the filter feature the dominant frequency, for five different steady vowel regions. Changes in these features, that distinguish between normal and shouted speech, are more prominent in the automatically found vowel regions than for the entire word segments. This insight can help improving the automatic shout detection and other paralinguistic applications where acoustic cues need to be found in the vowel regions, e.g., detection of emotion categories.
Vowel-Based Non-uniform Prosody Modification for Emotion Conversion
VYDANA HARI KRISHNA,KADIRI SUDARSANA REDDY,Anil Kumar Vuppala
Circuits, Systems, and Signal Processing, CSSP, 2016
@inproceedings{bib_Vowe_2016, AUTHOR = {VYDANA HARI KRISHNA, KADIRI SUDARSANA REDDY, Anil Kumar Vuppala}, TITLE = {Vowel-Based Non-uniform Prosody Modification for Emotion Conversion}, BOOKTITLE = {Circuits, Systems, and Signal Processing}. YEAR = {2016}}
The objective of this work is to develop a rule-based emotion conversion method for a better emotional perception. In this work, performance of emotion conversion using the linear modification model is improved by using vowel-based non-uniform prosody modification. In the present approach, attempts were made to integrate features like position and identity for addressing the non-uniformity in prosody generated due to the emotional state of the speaker. We mainly concentrate on the parameters such as strength, duration and pitch contour of vowels at different parts of the sentence. The influence of emotions on the above parameters is exploited to convert the speech from neutral emotion to the target emotion. Non-uniform prosody modification factors for emotion conversion are based on the position of vowels in the word, and the position of the word in the sentence. This study is carried out by using Indian Institute of Technology-Simulated Emotion speech corpus. Evaluation of the proposed algorithm is carried out by a subjective listening test. From the listening tests, it is observed that the performance of the proposed approach is better than the existing approaches
A LANGUAGE MODEL BASED APPROACH TOWARDS LARGE SCALE AND LIGHT WEIGHT LANGUAGE IDENTIFICATION SYSTEMS
BRIJ MOHAN LAL SRIVASTAVA,VYDANA HARI KRISHNA,Anil Kumar Vuppala,Manish Srivastava
Technical Report, arXiv, 2015
@inproceedings{bib_A_LA_2015, AUTHOR = {BRIJ MOHAN LAL SRIVASTAVA, VYDANA HARI KRISHNA, Anil Kumar Vuppala, Manish Srivastava}, TITLE = {A LANGUAGE MODEL BASED APPROACH TOWARDS LARGE SCALE AND LIGHT WEIGHT LANGUAGE IDENTIFICATION SYSTEMS}, BOOKTITLE = {Technical Report}. YEAR = {2015}}
Multilingual spoken dialogue systems have gained prominence in the recent past necessitating the requirement for a front-end Language Identification (LID) system. Most of the existing LID systems rely on modeling the language discriminative information from low-level acoustic features. Due tothe variabilities of speech (speaker and emotional variabilities, etc.), large-scale LID systems developed using low-levelacoustic features suffer from a degradation in the performance. In this approach, we have attempted to model the higher level language discriminative phonotactic information for developing an LID system. In this paper, the input speech signal is tokenized to phone sequences by using a language independent phone recognizer. The language discriminative phonotactic information in the obtained phone sequences are modeled using statistical and recurrent neural network based language modeling approaches. As this approach, relies on higher level phonotactical information it is more robust to variabilities of speech. Proposed approach is computationally light weight, highly scalable and it can be used in complement with the existing LID systems
Analysis of Constraints on Segmental DTW for the Task of Query-by-Example Spoken Term Detection
SRI HARSHA DUMPALA,KNRK RAJU ALLURI,Suryakanth Gangashetty,Anil Kumar Vuppala
India Council International Conference, INDICON, 2015
@inproceedings{bib_Anal_2015, AUTHOR = {SRI HARSHA DUMPALA, KNRK RAJU ALLURI, Suryakanth Gangashetty, Anil Kumar Vuppala}, TITLE = {Analysis of Constraints on Segmental DTW for the Task of Query-by-Example Spoken Term Detection}, BOOKTITLE = {India Council International Conference}. YEAR = {2015}}
Query-by-example spoken term detection (QbE-STD) refers to the task of determining the subsequence of a reference which matches with a query, where both the queryand the reference are in audio format. Dynamic time warping(DTW) based techniques are explored to match the two sequences with different lengths in an unsupervised manner. In this paper,a completely unsupervised approach based on Segmental DTW(SDTW), a variant of DTW, is considered for the task of QbE-STD where both reference and query utterances are represented using a sequence of Gaussian posteriorgram vectors. SDTW using two different types of bands i.e., Sakoe-Chiba band and Itakura parallelogram is considered to compare the Gaussian posterior-grams of the query and the reference sequence. The effect o fvarying different local constraints of the DTW algorithm on the performance of SDTW is also analyzed in this paper . Results obtained on MediaEval 2012 dataset indicate that SDTW using a band with variable speaking rate, as in Itakura parallelogram,performs better compared to that of using a band with fixed speaking rate, as in Sakoe-Chiba band, across all variations in local constraints.
Detection of Emotionally Significant Regions of Speech For Emotion Recognition
VYDANA HARI KRISHNA,Peddakota Vikash,Tallam Vamsi,Kolla Pavan Kumar,Anil Kumar Vuppala
India Council International Conference, INDICON, 2015
@inproceedings{bib_Dete_2015, AUTHOR = {VYDANA HARI KRISHNA, Peddakota Vikash, Tallam Vamsi, Kolla Pavan Kumar, Anil Kumar Vuppala}, TITLE = {Detection of Emotionally Significant Regions of Speech For Emotion Recognition}, BOOKTITLE = {India Council International Conference}. YEAR = {2015}}
Emotions in human speech are short lived. In an emotive utterance, the emotive gestures produced due to the emotive state of the speaker persists only to a shorter duration. In this study, the regions of an utterance that are highly influenced by the emotive state of the speaker are detected. These regions are labeled as emotionally significant regions. Data from the detected emotionally significant regions is used for developing an emotion recognition system. Physiological constraints of human speech production system are explored for detecting the emotionally significant regions of an utterance. Spectral features from the detected emotionally significant regions are used to develop an emotion recognition system. A significant improvement in the performance of the emotion recognition system is observed inthe present approach. An average improvement of 11% is in noted owing to the use of data from emotionally significant regions while developing an emotion recognition system. Gaussian mixture modelling (GMM) technique is employed to develop an emotion recognition system. During the present study, speech samples from Berlin emotion speech database (EMO-DB) are used. Four basic emotions such as anger, happy, neutral and fear are considered for study
Improved Emotion Recognition using GMM-UBMs
VYDANA HARI KRISHNA,P. Phani Kumar , K. Sri Rama Krishna,Anil Kumar Vuppala
International Conference on Signal Processing and Communication Engineering Systems, SPACES, 2015
@inproceedings{bib_Impr_2015, AUTHOR = {VYDANA HARI KRISHNA, P. Phani Kumar , K. Sri Rama Krishna, Anil Kumar Vuppala}, TITLE = {Improved Emotion Recognition using GMM-UBMs}, BOOKTITLE = {International Conference on Signal Processing and Communication Engineering Systems}. YEAR = {2015}}
In recent past a lot of scientific attention is paid on recognizing the emotional state of the speaker from his speech. Emotion recognition is a challenging task as human emotions are complex, subtle and emotive state in human speech does not persist long. So it is important to study the presence of emotion identifiable information in smaller segments of speech. This study is aimed at studying the presence of emotional specific information with relevance to the position of the word in the utterance. During the present study, spectral features are employed to represent emotion specific information in speech. Spectral features from smaller speech segments of speech based on their position in the utterance are employed to study the presence of emotion in speech. Due to the lack of adequate data in small speech segments to support conventional GMM during the course of present study Gaussian mixture modeling with a universal background model (GMM-UBM) is used for developing a emotion recognition system. Speech data from IITKGP-SESC is used during the course of the present study. During the present study 4 (Anger, Fear, Happy and Neutral) emotions are considered.
Significance of GMM-UBM based Modelling for Indian Language Identification
VUDDAGIRI RAVI KUMAR,VYDANA HARI KRISHNA,Anil Kumar Vuppala
Procedia Computer Science, PCS, 2015
@inproceedings{bib_Sign_2015, AUTHOR = {VUDDAGIRI RAVI KUMAR, VYDANA HARI KRISHNA, Anil Kumar Vuppala}, TITLE = {Significance of GMM-UBM based Modelling for Indian Language Identification}, BOOKTITLE = {Procedia Computer Science}. YEAR = {2015}}
Most of the Indian languages are originated from Devanagari, the script of the Sanskrit language. In-spite of similarity in phoneme sets, every language its own influence on the phonotactic constraints of speech in that language. A modelling technique that is capable of capturing the slightest variations imparted by the language is a prerequisite for developing a language identification system (LID). Use of Gaussian mixture modelling technique with a large number of mixture components demands a large training data for each language class, which is hard to collect and handle. In this work, phonotactic variations imparted by the different languages are modelled using Gaussian mixture modelling with a universal background model (GMM-UBM) technique.In GMM-UBM based modelling certain amount of data from all the language classes is pooled to develop a universal background model (UBM) and the model is adapted to each class. Spectral features (MFCC) are employed to represent the language specific phonotactic information of speech in different languages. During the present study, LID systems are developed using the speech samples from IITKGP-MLILSC. In this work, performance of the proposed GMM-UBM based LID system is compared with conventional GMM based LID system. An average improvement of 7–8% is observed due to the use of UBM-based modelling of developing a LID system.
Significance of Speech Enhancement and Sonorant Regions of Speech for Robust Language Identification
Anil Kumar Vuppala,K V MOUNIKA,VYDANA HARI KRISHNA
International Conference on Signal Processing, Informatics, Communication and Energy Systems, SPICES, 2015
@inproceedings{bib_Sign_2015, AUTHOR = {Anil Kumar Vuppala, K V MOUNIKA, VYDANA HARI KRISHNA}, TITLE = {Significance of Speech Enhancement and Sonorant Regions of Speech for Robust Language Identification}, BOOKTITLE = {International Conference on Signal Processing, Informatics, Communication and Energy Systems}. YEAR = {2015}}
A high degree of robustness is a prerequisite to operate speech and language processing systems in practical environments. Performance of these systems is highly influenced by varying and mixed background environments. In this pa-per, we put forward a robust method for automatic language identification in various background environments. Combined temporal and spectral processing method is used as a preprocessing technique for enhancing the degraded speech. Language discriminative information in high sonority regions of speech is used for the task of language identification. Sonority regions are regions of speech whose signal energy is high and these regions are less influenced by background environments.Spectral energy of formants in the glottal closure regions is employed as an acoustic correlate for the detection of sonority regions of speech. In this paper performance of the LID system is studied in various background environments like clean room, car factory,high frequency,pink and white noise environments. In this work, Indian Institute of Technology Kharagpur - Multi LingualI ndian Language Speech Corpus (IITKGP-MLILSC) is used for building language identification system. Noise speech samples from the NOISEX database are employed in the present study.The performance of the proposed method is quite satisfactory compared to existing approaches.
Neutral to Anger Speech Conversion Using Non-Uniform Duration Modification
Anil Kumar Vuppala,KADIRI SUDARSANA REDDY
International Conference on Industrial and Information Systems, ICIIS, 2014
@inproceedings{bib_Neut_2014, AUTHOR = {Anil Kumar Vuppala, KADIRI SUDARSANA REDDY}, TITLE = {Neutral to Anger Speech Conversion Using Non-Uniform Duration Modification}, BOOKTITLE = {International Conference on Industrial and Information Systems}. YEAR = {2014}}
In this paper, the non-uniform duration modification is exploited along with other prosody features for neutral speech to anger speech conversion. The non-uniform duration modification method modifies the durations of vowel and pause segments by different modification factors. Vowel segments are modified by factors based on their identities, and pause segments by uniform factors. Consonant and transition segments are not modified. These modification factors are derived from the analysis of neutral and anger speech. For this purpose, a well known Indian database named as the Indian Institute of Technology Kharagpur Simulated Emotion Speech Corpus (IITKGP-SESC) is chosen for analysis of emotions and synthesis of emotions from neutral speech. The prosodic features used in this study for emotion conversion are pitch contour, intensity contour, and duration contour. Subjective listening test results show that the effectiveness of perception of emotion is better in case of non-uniform duration modification than uniform duration modification
Automatic detection of breathy voiced vowels in Gujarati speech
Anil Kumar Vuppala,Bhaskara Rao Peri
International Journal of Speech Technology, IJST, 2014
@inproceedings{bib_Auto_2014, AUTHOR = {Anil Kumar Vuppala, Bhaskara Rao Peri}, TITLE = {Automatic detection of breathy voiced vowels in Gujarati speech}, BOOKTITLE = {International Journal of Speech Technology}. YEAR = {2014}}
This paper proposes a method for automatic detection of breathy voiced vowels in continuous Gujarati speech. As breathy voice is a specific phonetic feature predominantly present in Gujarati among Indian languages, it can be used for identifying Gujarati language. The objective of this paper is to differentiate breathy voiced vowels from modal voiced vowels based on loudness measure. Excitation source characteristics represented by loudness measure are used for differentiating the voice quality. In the proposed method, initially vowel regions in continuous speech are determined by using the knowledge of vowel onset points and epochs. Later, hypothesized vowel segments are classified by using loudness measure. Performance of the proposed method is evaluated on Gujarati speech utterances containing around 47 breathy and 192 modal vowels spoken by 5 male and 5 female speakers. Classification of vowels into breathy or modal voice is achieved with an accuracy of around 94 %.
Vowel Onset Point Detection for Noisy Speech using Spectral Energy at Formant Frequencies
Anil Kumar Vuppala,K. Sreenivasa Rao
International Journal of Speech Technology, IJST, 2014
@inproceedings{bib_Vowe_2014, AUTHOR = {Anil Kumar Vuppala, K. Sreenivasa Rao}, TITLE = {Vowel Onset Point Detection for Noisy Speech using Spectral Energy at Formant Frequencies}, BOOKTITLE = {International Journal of Speech Technology}. YEAR = {2014}}
Most of the existing vowel onset point (VOP) detection methods are developed for clean speech. In this chapter, we propose methods for detection of VOPs in the presence of speech coding and background noise conditions. VOP detection method for coded speech is based on the spectral energy between 500 and 2,500 Hz frequency band of the speech segments present in glottal closure region. In case of noisy speech, the proposed VOP detection method exploits the spectral energy at the formant locations of the speech segments present in glottal closure region. The proposed VOP detection methods are evaluated using objective measures and consonant vowel (CV) unit recognition experiments.
Neural network based feature transformation for emotion independent speaker identification
Sreenivasa Rao Krothapalli ·,Jaynath Yadav,Sourjya Sarkar,Shashidhar G Koolagudi,Anil Kumar Vuppala
International Journal of Speech Technology, IJST, 2012
@inproceedings{bib_Neur_2012, AUTHOR = {Sreenivasa Rao Krothapalli ·, Jaynath Yadav, Sourjya Sarkar, Shashidhar G Koolagudi, Anil Kumar Vuppala}, TITLE = {Neural network based feature transformation for emotion independent speaker identification}, BOOKTITLE = {International Journal of Speech Technology}. YEAR = {2012}}
In this paper we are proposing neural network based feature transformation framework for developing emotion independent speaker identification system. Most of the present speaker recognition systems may not perform well during emotional environments. In real life, humans extensively express emotions during conversations for effectively conveying the messages. Therefore, in this work we propose the speaker recognition system, robust to variations in emotional moods of speakers. Neural network models are explored to transform the speaker specific spectral features from any specific emotion to neutral. In this work, we have considered eight emotions namely, Anger, Sad, Disgust, Fear, Happy, Neutral, Sarcastic and Surprise. The emotional databases developed in Hindi, Telugu and German are used in this work for analyzing the effect of proposed feature transformation on the performance of speaker identification system. In this work, spectral features are represented by mel-frequency cepstral coefficients, and speaker models are developed using Gaussian mixture models. Performance of the speaker identification system is analyzed with various feature mapping techniques. Results have demonstrated that the proposed neural network based feature transformation has improved the speaker identification performance by 20 %. Feature transformation at the syllable level has shown the better performance, compared to sentence level.