IIITH

Detecting Multiple Disfluencies from Speech using Pre-linguistic Automatic Syllabification withAcoustic and Prosody Features

Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), APSIPA, 2021

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Dete_2021, AUTHOR = {Mehrotra, Utkarsh and SPARSH, and KRISHNA, GURUGUBELLI and Vuppala, Anil Kumar }, TITLE = {Detecting Multiple Disfluencies from Speech using Pre-linguistic Automatic Syllabification withAcoustic and Prosody Features}, BOOKTITLE = {Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)}. YEAR = {2021}}

Detecting Multiple Disfluencies from Speech using Pre-linguistic Automatic Syllabification withAcoustic and Prosody Features

Abstract

In this paper, a new method to detect disfluencies directly from speech is explored. The method makes use of pre-linguistic automatic syllabification - the process of segmenting input speech signals into perceptually distinct syllable-like regions, to develop syllable-level disfluency detection systems. Statistical prosody features related to fundamental frequency, energy and duration are extracted from each syllable-like region and used to train a DNN classifier for automatic detection of speech disfluencies. Further, complementary information useful for the task of disfluency detection is added to the pipeline with the help of acoustic features. A BiLSTM feature extractor is used to get complex acoustic representation from the baseline MFCC features for each syllable-like region. This acoustic representation is concatenated with the prosody features and used in the proposed system for detecting multiple speech disfluencies. Experiments are conducted for four types of disfluencies in the UCLASS and the IIITH-IED datasets to test the proposed disfluency detection system. Overall, it is found that the proposed system gives a detection accuracy of 88.75% for the disfluencies in the UCLASS dataset, whereas for the IIITH-IED dataset, the accuracy obtained is 91.24%, showing the effectiveness of considering perceptually distinct syllable-like regions as representational units for detecting disfluencies.

IE-CPS Lexicon: An Automatic Speech Recognition Oriented Indian English Pronunciation Dictionary

International Conference on Natural Language Processing., ICON, 2021

Core Rank : - Google Rank :5

Abs PDF bibTex

@inproceedings{bib_IE-C_2021, AUTHOR = {Jain, Shelly and Yadavalli, Aditya and Ganesh, Mirishkar Sai and Yarra, Chiranjeevi and Vuppala, Anil Kumar }, TITLE = {IE-CPS Lexicon: An Automatic Speech Recognition Oriented Indian English Pronunciation Dictionary}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2021}}

IE-CPS Lexicon: An Automatic Speech Recognition Oriented Indian English Pronunciation Dictionary

Abstract

Indian English (IE), on the surface, seems quite similar to standard English. However, closer observation shows that it has actually been influenced by the surrounding vernacu- lar languages at several levels from phonology to vocabulary and syntax. Due to this, automatic speech recognition (ASR) systems developed for American or British varieties of English result in poor performance on Indian English data. The most prominent feature of Indian English is the characteristic pronunciation of the speakers. The systems are unable to learn these acoustic variations while modelling and cannot parse the non-standard articulation of non-native speakers. For this purpose, we propose a new phone dictionary de- veloped based on the Indian language Com- mon Phone Set (CPS). The dictionary maps the phone set of American English to existing Indian phones based on perceptual similarity. This dictionary is named Indian English Com- mon Phone Set (IE-CPS). Using this, we build an Indian English ASR system and compare its performance with an American English ASR system on speech data of both varieties of En- glish. Our experiments on the IE-CPS show that it is quite effective at modelling the pro- nunciation of the average speaker of Indian En- glish. ASR systems trained on Indian English data perform much better when modelled us- ing IE-CPS, achieving a reduction in the word error rate (WER) of upto 3.95% when used in place of CMUdict. This shows the need for a different lexicon for Indian English.

ACOUSTIC FEATURES, BERT Model AND THEIR COMPLEMENTARY NATURE FOR ALZHEIMER’S DEMENTIA DETECTION

International Conference on Contemporary Computing, IC3, 2021

Core Rank : - Google Rank :20

Abs PDF bibTex

@inproceedings{bib_ACOU_2021, AUTHOR = {Vats, Nayan Anand and Yadavalli, Aditya and KRISHNA, GURUGUBELLI and Vuppala, Anil Kumar }, TITLE = {ACOUSTIC FEATURES, BERT Model AND THEIR COMPLEMENTARY NATURE FOR ALZHEIMER’S DEMENTIA DETECTION}, BOOKTITLE = {International Conference on Contemporary Computing}. YEAR = {2021}}

ACOUSTIC FEATURES, BERT Model AND THEIR COMPLEMENTARY NATURE FOR ALZHEIMER’S DEMENTIA DETECTION

Abstract

Dementia is a syndrome chronic or progressive that usually affects the cognitive functioning of the subjects. Alzheimer’s, a neurodegenerative disorder, is the leading cause of dementia. One of the many symptoms of Alzheimer’s Dementia is the inability to speak and understand language clearly. The last decade has seen a surge in the research done in Alzheimer’s Dementia detection using Linguistics and acoustic features. This paper takes up the Alzheimer’s Dementia classification task of ADReSS INTERSPEECH-2020 challenge, "Alzheimer’s Dementia Recognition through Spontaneous Speech: The ADReSS Challenge". It uses eight different acoustic features to find the attributes in the human speech production system (vocal track and excitation source) affected by Alzheimer’s Dementia. In this study, the Alzheimer’s dementia classification is performed using five different Machine Learning models using ADReSS INTERSPEECH-2020 challenge dataset. Since most of the studies in the previous literature have used linguistic features successfully for Alzheimer’s dementia classification, the current study also demonstrates the performance of the BERT model for the dementia classification task. The maximum accuracy obtained by the acoustic feature is 64.5%, and the BERT Model provides a classification accuracy of 79.1% over the test dataset. Finally, the score-level fusion of the acoustic model with the BERT Model shows an improvement of 6.1% classification accuracy over the BERT Model, which indicates the complementary nature of acoustic features to linguistic features.

Towards a Database For Detection of Multiple Speech Disfluencies in Indian English

National Conference on Communications, NCC, 2021

Core Rank : - Google Rank :16

Abs PDF bibTex

@inproceedings{bib_Towa_2021, AUTHOR = {SPARSH, and Mehrotra, utkarsh and KRISHNA, GURUGUBELLI and Vuppala, Anil Kumar }, TITLE = {Towards a Database For Detection of Multiple Speech Disfluencies in Indian English}, BOOKTITLE = {National Conference on Communications}. YEAR = {2021}}

Towards a Database For Detection of Multiple Speech Disfluencies in Indian English

Abstract

The detection and removal of disfluencies from speech is an important task since the presence of disfluencies can adversely affect the performance of speech-based applications such as Automatic Speech Recognition (ASR) systems and speech-to-speech translation systems. From the perspective of Indian languages, there is a lack of studies pertaining to speech disfluencies, their types and frequency of occurrence. Also, the resources available to perform such studies in an Indian context are limited. Through this paper, we attempt to address this issue by introducing the IIITH-Indian English Disfluency (IIITH-IED) Dataset. This dataset consists of 10-hours of lecture mode speech in Indian English. Five types of disfluencies - filled pause, prolongation, word repetition, part-word repetition and phrase repetition were identified in the speech signal and annotated in the corresponding transcription to prepare this dataset. The IIITH-IED dataset was then used to develop framelevel automatic disfluency detection systems. Two sets of features were extracted from the speech signal and then used to train classifiers for the task of disfluency detection. Amongst all the systems employed, Random Forest with MFCC features resulted in the highest average accuracy of 89.61% and F1-score of 0.89.

Reed: An Approach Towards Quickly Bootstrapping Multilingual Acoustic Models

IEEE Spoken Language Technology Workshop, SLT-W, 2021

Core Rank : - Google Rank :35

Abs PDF bibTex

@inproceedings{bib_Reed_2021, AUTHOR = {Sen, Bipasha and Agarwal, Aditya and Ganesh, Mirishkar Sai and Vuppala, Anil Kumar }, TITLE = {Reed: An Approach Towards Quickly Bootstrapping Multilingual Acoustic Models}, BOOKTITLE = {IEEE Spoken Language Technology Workshop}. YEAR = {2021}}

Reed: An Approach Towards Quickly Bootstrapping Multilingual Acoustic Models

Abstract

Multilingual automatic speech recognition (ASR) system is a single entity capable of transcribing multiple languages sharing a common phone space. Performance of such a system is highly dependent on the compatibility of the languages. State of the art speech recognition systems are built using sequential architectures based on recurrent neural networks (RNN) limiting the computational parallelization in training. This poses a significant challenge in terms of time taken to bootstrap and validate the compatibility of multiple languages for building a robust multilingual system. Complex architectural choices based on self-attention networks are made to improve the parallelization thereby reducing the training time. In this work, we propose Reed, a simple system based on 1D convolutions which uses very short context to improve the training time. To improve the performance of our system, we use raw time-domain speech signals directly as input. This enables the convolutional layers to learn feature representations rather than relying on handcrafted features such as MFCC. We report improvement on training and inference times by atleast a factor of 4× and 7.4× respectively with comparable WERs against standard RNN based baseline systems on SpeechOcean's multilingual low resource dataset.

Comparative Study of Different Epoch Extraction Methods for Speech Associated with Voice Disorders

International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2021

Core Rank : B Google Rank :129

Abs PDF bibTex

@inproceedings{bib_Comp_2021, AUTHOR = {Barche, Purva and KRISHNA, GURUGUBELLI and Vuppala, Anil Kumar }, TITLE = {Comparative Study of Different Epoch Extraction Methods for Speech Associated with Voice Disorders}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2021}}

Comparative Study of Different Epoch Extraction Methods for Speech Associated with Voice Disorders

Abstract

Accurate detection of epoch locations is important in extracting the features from the speech signal for automatic detection and assessment of voice disorders. Therefore, this study aimed to compare the various algorithms for detecting epoch locations from the speech associated with voice disorders. In this regard, nine state-of-the-art epoch extraction algorithms were considered, and their performance for different categories of voice disorders was evaluated on the SVD dataset. Experimental results indicate that most of the epoch extraction methods showed better performance for healthy speech; however, their performance was degraded for speech associated with voice disorders. Furthermore, the performance of epoch extraction methods was degraded for the speech of structural and neurogenic disorders compared to the speech of psychogenic and functional disorders. Among the different epoch extraction algorithms, zero phase-zero frequency filtering showed the best performance in terms identification rate (90.37%) and identification accuracy (0.34ms), for speech associated with voice disorders.

Towards Automatic Assessment of Voice Disorders: A Clinical Approach

Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020

Core Rank : A Google Rank :111

Abs PDF bibTex

@inproceedings{bib_Towa_2020, AUTHOR = {Barche, Purva and KRISHNA, GURUGUBELLI and Vuppala, Anil Kumar }, TITLE = {Towards Automatic Assessment of Voice Disorders: A Clinical Approach}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2020}}

Towards Automatic Assessment of Voice Disorders: A Clinical Approach

Abstract

Automatic detection and assessment of voice disorders is important in diagnosis and treatment planning of voice disorders. This work proposes an approach for automatic detection and assessment of voice disorders from a clinical perspective. To accomplish this, a multi-level classification approach was explored in which four binary classifiers were used for the assessment of voice disorders. The binary classifiers were trained using support vector machines with excitation source features, vocal-tract system features, and state-of-art OpenSMILE features. In this study source features namely, glottal parameters obtained from glottal flow waveform, perturbation measures obtained from epoch locations, and cepstral features obtained from linear prediction residual and zero frequency filtered signal were explored. The present study used the Saarbucken voice disorders database to evaluate the performance of proposed approach. The OpenSMILE features namely ComParE and eGEMAPS feature sets shown better performance in terms of classification accuracies of 82.8% and 76%, respectively for voice disorder detection. The combination of excitation source features with baseline feature sets further improved the performance of detection and assessment systems, that highlight the complimentary nature of exciting source features.

Toward Improving the Performance of Epoch Extraction from Telephonic Speech

Circuits, Systems, and Signal Processing, CSSP, 2020

Core Rank : - Google Rank :42

Abs PDF bibTex

@inproceedings{bib_Towa_2020, AUTHOR = {KRISHNA, GURUGUBELLI and Javid, Mohammad Hashim and ALLURI, KNRK RAJU and Vuppala, Anil Kumar }, TITLE = {Toward Improving the Performance of Epoch Extraction from Telephonic Speech}, BOOKTITLE = {Circuits, Systems, and Signal Processing}. YEAR = {2020}}

Toward Improving the Performance of Epoch Extraction from Telephonic Speech

Abstract

Epoch is an abrupt closure event within a glottal cycle at which significant excitation to the vocal-tract system happens during the production of voiced speech. The state-of-the-art zero frequency filtering technique is a simple and efficient method that shows robustness in extracting the epochs from cleanspeech.However,this methodhasshownpoor performance for telephonic quality speech, due to the presence of spurious zero crossings in epoch evidence, which leads to a high false alarm rate. Recently,zero-phase zero frequency resonator(ZP-ZFR) an alternative tozero frequency filter isproposed for stable implementation of zero frequency filtering technique. In this study,higher-order ZP-ZFR is investigated to improve the performance of zero frequency filtering for epoch extraction from telephonic speech. The performance of the proposedZP-ZFR method is quantitatively evaluated on telephonic speech simulated usingsix standard databases having simultaneous electroglottograph recordings as ground truth. Experimental results suggest that the performance of the proposed method is significantly better than the state-of-the-art methods in terms of identification rate and false alarm rate.

Detection of Fricative Landmarks Using Spectral Weighting: A Temporal Approach

Circuits, Systems, and Signal Processing, CSSP, 2020

Core Rank : - Google Rank :42

Abs PDF bibTex

@inproceedings{bib_Dete_2020, AUTHOR = {KRISHNA, VYDANA HARI and Vuppala, Anil Kumar }, TITLE = {Detection of Fricative Landmarks Using Spectral Weighting: A Temporal Approach}, BOOKTITLE = {Circuits, Systems, and Signal Processing}. YEAR = {2020}}

Detection of Fricative Landmarks Using Spectral Weighting: A Temporal Approach

Abstract

Fricatives are characterized by two prime acoustic properties, i.e., having high-frequency spectral concentration and possessing noisy nature. Spectral domain approaches for detecting fricatives employ a time–frequency representation to compute acoustic cues such as band energy ratio, spectral centroid, and dominant resonant frequency. The detection accuracy of these approaches depends on the efficiency of the employed time–frequency representation. An approach that would not require any time–frequency representation for detecting fricatives from speech has been explored in this work. In this study, a time-domain operation is proposed which emphasizes the high-frequency spectral characteristics of fricatives implicitly. The proposed approach aims to scale the spectrum of the speech signal using a scaling function k2, where k is the discrete frequency. The spectral weighting function used in the proposed approach can be approximated as a cascaded temporal difference operation over speech signal. The emphasized regions in spectrally weighted speech signal are quantified to detect fricative regions. Contrasting the spectral domain approaches, the predictability measure-based approach in literature relies on capturing the noisy nature of fricatives. The proposed approach and the predictability measure-based approaches rely on two complementary properties for detecting fricatives, and a combination of these approaches is put forth in this work. The proposed approach has performed better than the state-of-the-art fricative detectors. To study the significance of the proposed evidence, an early fusion between the proposed evidence and the feature-space maximum log-likelihood transform features is explored for developing speech recognition systems.

Duration of the rhotic approximant/ɹ/in spastic dysarthria of different severity levels

Speech Communication, SpComm, 2020

Core Rank : - Google Rank :33

Abs PDF bibTex

@inproceedings{bib_Dura_2020, AUTHOR = {KRISHNA, GURUGUBELLI and Vuppala, Anil Kumar and Narendra, NP and Alku, Paavo }, TITLE = {Duration of the rhotic approximant/ɹ/in spastic dysarthria of different severity levels}, BOOKTITLE = {Speech Communication}. YEAR = {2020}}

Duration of the rhotic approximant/ɹ/in spastic dysarthria of different severity levels

Abstract

Dysarthria is a motor speech disorder leading to imprecise articulation of speech. Acoustic analysis capable of detecting and assessing articulation errors is useful in dysarthria diagnosis and therapy. Since speakers with dysarthria experience difficulty in producing rhotics due to complex articulatory gestures of these sounds, the hypothesis of the present study is that duration of the rhotic approximant /ɹ/ distinguishes dysarthric speech of different severity levels. Duration measurements were conducted using the third formant (F3) trajectories estimated from quasi-closed-phase (QCP) spectrograms. Results indicate that the severity level of spastic dysarthria has a significant effect on duration of /ɹ/. In addition, the phonetic context has a significant effect on duration of /ɹ/, the ɪ-r-ɛ context showing the largest difference in /ɹ/ duration between dysarthric speech of the highest severity levels and healthy speech. The results of this preliminary study can be used in the future to develop signal processing and machine learning methods to automatically predict the severity level of spastic dysarthria from speech signals.