Abstract
Speech prominence refers to the relative emphasis placed on certain units in speech to express the
communicative intent. This prominence is realized at two different levels. One, at the syllable level,
where one syllable within a word is made more prominent than the rest, referred to as lexical or syllablelevel prominence. Lexical prominence is generally produced correctly by native speakers and is commonly incorporated into speech synthesis through pronunciation lexicons, but non-native speakers frequently misplace it due to the influence of their native language, motivating the need for automatic
lexical stress detection. Two, at the word level, where certain words within an utterance are emphasized more than others, commonly referred to as sentence or word-level prominence. Unlike lexical
prominence, automatic detection and natural generation of sentence prominence remain challenging
problems for both native and non-native speech. Despite their importance, existing prominence modeling approaches continue to face several limitations. First, most methods rely on handcrafted prosodic
features, such as pitch, energy, and duration, which may not fully capture prominence-related information. Second, prominence modeling at both syllable and word levels typically depends on accurate
boundary annotations, obtained either through costly manual labeling or error-prone forced alignment.
Third, training these systems requires expensive prominence labels, limiting scalability to low resource
language backgrounds.
In this thesis, these limitations are addressed progressively, moving from supervised to unsupervised learning and from boundary-dependent to boundary-independent modeling across both linguistic
levels. At the syllable level, it is first demonstrated that self-supervised speech representations provide a substantially stronger foundation for prominence modeling than handcrafted features, with sequential modeling further improving performance by capturing inter-syllable dependencies. The dependence on boundary annotations is then removed through hierarchical temporal compression, introducing
boundary-independent frameworks that match and often exceed boundary-dependent counterparts. The
reliance on prominence labels is further addressed through Post-Net and Post-Net2.0, which leverage
pseudo-label generation and linguistic constraints to enable effective unsupervised prominence detection. A fully unsupervised and boundary-independent framework, SRAMA, is proposed, combining
Adaptive Local Monotonic Attention (AMA) with iterative deep clustering, approaching supervised
performance without requiring boundaries, transcriptions, or prominence labels. Finally, the scope is
extended to the word level through ProTTS, which integrates prominence discovery directly into SOTA
TTS, namely, FastSpeech 2, using word-level prominence inferred from prosodic predictions to guide
synthesis -- producing more natural and expressive speech and bridging the two central applications of
this thesis: automatic speech prominence detection and expressive speech synthesis.