Abstract
Background
Music Emotion Recognition (MER) systems primarily rely on audio features (Wang et al., 2021), with recent approaches incorporating lyrics to analyze sentiment and structure for improved accuracy (Agrawal et al., 2021). Musical content, particularly melody, has been found to have a stronger capacity to convey perceived emotional expression than lyrics, although lyrics tend to enhance negative emotions more easily than positive emotions (Ali & Peynircioğlu, 2006). This variability raises important questions: Which component carries the greatest emotional information? Do listeners rely on music, vocals, or lyrics when assessing emotional content?
Aim
Here, we systematically analyse the contribution of music, vocals, and lyrics to perception of emotion in music.
Methods
For this study, we used the DEAM (Soleymani et al., 2013) and PMEmo (Zhang et al., 2018) datasets, comprising a total of 2,700 songs. The DEAM dataset includes 1,802 audio items annotated for both dynamic and static emotion using continuous valence and arousal ratings. Participants utilised a two-dimensional interface based on the Self‑Assessment Manikins (SAM) to continuously rate the emotional content at a 2‑Hz sampling rate. The PMEmo dataset contains 794 popular music choruses sourced from international music charts (Billboard Hot 100, iTunes Top 100, and UK Top 40) and was annotated using a similar interface to capture continuous dynamic ratings for valence and arousal at the same sampling rate.
Songs were source-separated using Ultimate Vocal Remover (Takahashi et al., 2017) to isolate music and vocals, while lyrics were transcribed using OpenAI’s Whisper model (Radford et al., 2022) and verified manually. Deep learning models trained on diverse song and speech datasets predicted VA values for musical and vocal components, while lyrics were analyzed using models trained on general and lyrically annotated texts (Çano, 2017). Spearman's correlation was calculated between predicted and human-annotated (HA) VA values. Additionally, quadrant-based (Q1: positive V, high A; Q2: negative V, high A; Q3: negative V, low A; Q4: positive V, low A) concurrency analyses evaluated prediction alignment with human-identified emotional quadrants.
Results
The musical component exhibited the highest correlation with HA ratings (valence: r = 0.70, arousal: r = 0.74), followed by vocals (valence: r = 0.54, arousal: r = 0.65; all p < .001), while lyrics contributed the least (valence: r = 0.11, arousal: r = 0.01). In terms of overall quadrant concurrency, the musical component showed the highest concurrency (60.41%) followed by vocals (48.84%) and then lyrics (31.45%). Specifically, the highest quadrant concurrency was observed in Q3 (musical: 90.53%, vocal: 93.53%, lyrics: 66.16%) followed by Q1 for musical (64.51%) and vocals (39.10%), and Q4 for lyrics (42.50%).
Discussion
Our findings emphasize the dominant role of the musical component in shaping perceived emotional expression in Western tonal music, aligning with prior work highlighting melody’s emotional salience (Ali & Peynircioğlu, 2006). Furthermore, for music signifying negative valence and low arousal (Q3), there is a higher degree of congruence between the components in conveying sadness and its related emotions. These results suggest that MER systems could prioritize musical and vocal components over lyrical content (Wang et al., 2021; Agrawal et al., 2021), as they appear to primarily drive the perception of emotional expression.
References
Agrawal, Y., Shanker, R. G. R., & Alluri, V. (2021). Transformer-based approach towards music emotion recognition from lyrics. In V. U. Rao & R. Kailash (Eds.), Advances in information retrieval (pp. 167–175). Springer. https://doi.org/10.1007/978-3-030-72240-1_12
Ali, S. O., & Peynircioğlu, Z. F. (2006). Songs and emotions: Are lyrics and melodies equal partners? Psychology of Music, 34(4), 511–534. https://doi.org/10.1177/0305735606067168
Çano, E. (2017). MoodyLyrics: A sentiment-annotated lyrics dataset. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2017) (pp. 118–124). Association for Computational Linguistics. https://doi.org/10.18653/v1/S17-1017
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv. https://doi.org/10.48550/arXiv.2212.04356
Soleymani, M., Caro, M., Schmidt, E., Sha, C.-Y., & Yang, Y.-H. (2013). 1000 songs for emotional analysis of music. In P. Herrera, E. Gómez, & T. Kaneko (Eds.), Proceedings of the 2nd ACM International Workshop on Music Information Retrieval with User-Centered and Multimodal Strategies (pp. 1–6). Association for Computing Machinery.
Takahashi, N., Huang, P.-S., & Lee, P. (2017). Multi-scale multi-band DenseNets for audio source separation. arXiv. https://doi.org/10.48550/arXiv.1706.09588
Wang, S., Xu, C., Ding, A. S., & Tang, Z. (2021). A Novel Emotion-Aware Hybrid Music Recommendation Method Using Deep Neural Network. Electronics, 10(15), 1769. https://doi.org/10.3390/electronics10151769
Zhang, K., Zhang, H., Li, S., Yang, C., & Sun, L. (2018). The PMEmo dataset for music emotion recognition. In Proceedings of the 2018 ACM International Conference on Multimedia Retrieval (pp. 135–142). Association for Computing Machinery. https://doi.org/10.1145/3206025.3206037