Abstract
Direct acoustic feature-based speaking rate estimation is useful in applications including pronunciation assessment, dysarthria detection and automatic speech recognition. Most of the existing works on speaking rate estimation have steps which are heuristically designed. In contrast to the existing works, in this work a data-driven approach with convolutional neural network-bidirectional long short-term memory (CNN-BLSTM) network is proposed to jointly optimize all steps in speaking rate estimation through a single framework. Also, unlike existing deep learning-based methods for speaking rate estimation, the proposed approach estimates the speaking rate for an entire speech utterance in one go instead of considering segments of a fixed duration. We consider the traditional 19 sub-band energy (SBE) contours as the low-level features as the input of the proposed CNN-BLSTM network. The state-of-the-art direct acoustic feature-based speaking rate estimation techniques are developed based on 19 SBEs as well. Experiments are performed separately using three native English speech corpora (Switchboard, TIMIT and CTIMIT) and a non-native English speech corpus (ISLE). Among these, TIMIT and Switchboard are used for training the network. However, testing is carried out on all the four corpora as well as TIMIT and Switchboard with additive noise, namely white, car, high-frequency-channel, cockpit, and babble at 20, 10 and 0 dB signal-to-noise ratios. The proposed CNN-BLSTM approach outperforms the best of the existing techniques in clean as well as noisy conditions for all four corpora.