Abstract
Most previous work in music emotion recognition assumes a single or a few song-level labels for the whole song. While it is known that different emotions can vary in intensity within a song, annotated data for this setup is scarce and difficult to obtain. In this work, we propose a method to predict emotion dynamics in song lyrics without song-level supervision. We frame each song as a time series and employ a State Space Model (SSM), combining a sentence-level emotion predictor with an Expectation-Maximization (EM) procedure to generate the full emotion dynamics. Our experiments show that applying our method consistently improves the performance of sentence-level baselines without requiring any annotated songs, making it ideal for limited training data scenarios. Further analysis through case studies shows the benefits of our method while also indicating the limitations and pointing to future directions.
1 Introduction
Music and emotions are intimately connected, with almost all music pieces being created to express and induce emotions (Juslin and Laukka, 2004). As a key factor of how music conveys emotion, lyrics contain part of the semantic information that the melodies cannot express (Besson et al., 1998). Lyrics-based music emotion recognition has attracted increasing attention driven by the demand to process massive collections of songs automatically, which is an important task for streaming and media service providers (Kim et al., 2010; Malheiro et al., 2016; Agrawal et al., 2021).
Vanilla emotion recognition studies in Natural Language Processing (NLP) assume the text instance expresses a static and single emotion (Mohammad and Bravo-Márquez, 2017; Nozza et al., 2017; Mohammad et al., 2018). However, emotion is non-static and highly correlated with the contextual information, making the single-label assumption too simplistic in dynamic scenarios, not just in music (Schmidt and Kim, 2011) but also in other domains such as conversations (Poria et al., 2019b). Figure 1 shows an example of this dynamic behavior, where the intensities of three different emotions vary within a song. Accurate emotion recognition systems should ideally generate the full emotional dynamics for each song, as opposed to simply predicting a single label.
An illustration of emotion dynamics of a song in the LyricsEmotions dataset of Mihalcea and Strapparava (2012). Note the intensities of each emotion vary from verse to verse within the song.
An illustration of emotion dynamics of a song in the LyricsEmotions dataset of Mihalcea and Strapparava (2012). Note the intensities of each emotion vary from verse to verse within the song.
A range of datasets and corpora for modeling dynamic emotion transitions has been developed in the literature (McKeown et al., 2011; Li et al., 2017; Hsu et al., 2018; Poria et al., 2019a; Firdaus et al., 2020), but most of them do not use song lyrics as the domain and assume discrete, categorical labels for emotions (either the presence or absence of one emotion). To the best of our knowledge, the dataset from Mihalcea and Strapparava (2012) is the only one that provides full fine-grained emotion intensity annotations for song lyrics at the verse1 level. The lack of large-scale datasets for this task poses a challenge for traditional supervised methods. While previous work proposed methods for the similar sequence-based emotion recognition task, they all assume the availability of some levels of annotated data at training time, from full emotion dynamics (Kim et al., 2015) to coarse, discrete document-level labels (Täckström and McDonald, 2011b).
The data scarcity problem motivates our main research question: “Can we predict emotion dynamics in song lyrics without requiring annotated lyrics?” In this work, we claim that the answer is affirmative. To show this, we propose a method consisting of two major stages: (1) a sentence or verse-level regressor that leverages existing emotion lexicons, pre-trained language models and other sentence-level datasets, and (2) a State Space Model (SSM) that constructs a full song-level emotional dynamics given the initial verse-level scores. Intuitively, we treat each verse as a time step and the emotional intensity sequence as a latent time series that is inferred without any song-level supervision, directly addressing the limited data problem. To the best of our knowledge, this scenario was never addressed before either for song lyrics or other domains.
To summarize, our main contributions are:
We propose a hybrid approach for verse-level emotion intensity prediction that combines emotion lexicons with a pre-trained language model (BERT [Devlin et al., 2019 ] used in this work), which is trained on available sentence-level data.
We show that by using SSMs to model song-level emotion dynamics, we can improve the performance of the verse-level approach without requiring any annotated lyrics.
We perform a qualitative analysis of our best models, highlighting its limitations and pointing to directions for future work.
2 Background and Related Work
Emotion Models.
Human emotion is a long-standing research field in psychology, with many studies aiming at defining a taxonomy for emotions. In NLP, emotion analysis mainly employs the datasets which are annotated based on the categorical or the dimensional model.
The categorical model assumes a fixed set of discrete emotions that can vary in intensity. Emotions can overlap but are assumed to be separate “entities” from each other, such as anger, joy, and surprise. Taxonomies using the categorical model include Ekman’s basic emotions (Ekman, 1993), Plutchik’s wheel of emotions (Plutchik, 1980), and the OCC model (Ortony et al., 1988). The dimensional models place emotions in a continuous space: The VAD (Valence, Arousal, and Dominance) taxonomy of Russell (1980) is the most commonly used in NLP. In this work, we focus on the Ekman taxonomy for purely experimental purposes, as it is the one used in the available data we employ. However, our approach is general and could be applied to other taxonomies.
Dynamic Emotion Analysis.
Emotion Recognition in Conversation (ERC, Poria et al., 2019b), which focuses on tracking dynamic shifts of emotions, is the most similar task to our work. Within a conversation, the emotional state of each utterance is influenced by the previous state of the party and the stimulation from other parties (Li et al., 2020; Ghosal et al., 2021). Such an assumption of the real-time dynamic emotional changes also exists in music: The affective state of the current lyrics verse is correlated with the state of the previous verse(s) as a song progresses.
Contextual information in the ERC task is generally captured by deep learning models, which can be roughly categorized into sequence-based, graph-based and reinforcement learning-based methods. Sequence-based methods encode conversational context features using established methods like Recurrent Neural Networks (Poria et al., 2017; Hazarika et al., 2018a, b; Majumder et al., 2019; Hu et al., 2021) and Transformer-based architectures (Zhong et al., 2019; Li et al., 2020). They also include more advanced and tailored methods such as Hierarchical Memory Network (Jiao et al., 2020), Emotion Interaction Network (Lu et al., 2020), and Causal Aware Network (Zhao et al., 2022). Graph-based methods apply specific graphical structures to model dependencies in conversations (Ghosal et al., 2019; Zhang et al., 2019; Lian et al., 2020; Ishiwatari et al., 2020; Shen et al., 2021) using Graph Neural Networks (Kipf and Welling, 2017). Reinforcement Learning (RL)-based methods (Zhang et al., 2021; Huang et al., 2021) model the influence of the previous emotional state on current utterance’s emotion by using agent-environment nature of dialogue systems. In contrast to these methods, we capture contextual information using a SSM, mainly motivated by the need for a method that can train without supervision. Extending and/or combining an SSM with a deep learning model is theoretically possible but non-trivial, and care must be taken in a low-data situation such as ours.
The time-varying nature of music emotions has been investigated in music information retrieval (Caetano et al., 2012). To link the human emotions with the music acoustic signal, the emotion distributions were modeled as 2D Gaussian distributions in the Arousal-Valence (A-V) space, which were used to predict A-V responses through multi-label regression (Schmidt et al., 2010; Schmidt and Kim, 2010). Building on previous studies, Schmidt and Kim (2011) applied structured prediction methods to model complex emotion-space distributions as an A-V heatmap. These studies focus on the mapping between emotions and acoustic features/signals, while our work focuses on the lyrics component. Wu et al. (2014) developed a hierarchical Bayesian model that utilized both acoustic and textual features, but it was only applied to predict emotions as discrete labels (presence or absence) instead of fine-grained emotion intensities as in our work.
Combining Pre-trained Language Models with External Knowledge.
Pre-trained language models (LMs) including BERT (Devlin et al., 2019), XLNet (Yang et al., 2019), and GPT (Brown et al., 2020) have achieved state-of-the-art performance in numerous NLP tasks. Considerable effort has been made towards combining context-sensitive features of LMs with factual or commonsense knowledge from structured sources, including commonsense knowledge (Zhong et al., 2019; Ghosal et al., 2020), domain-specific knowledge (Ying et al., 2019), structured semantic information (Zhang et al., 2020), language-specific knowledge (Alghanmi et al., 2020; De Bruyne et al., 2021), and linguistic features (Koufakou et al., 2020; Mehta et al., 2020). This auxiliary knowledge is usually infused into the architecture by concatenating them with the Transformer-based representation before the prediction layer for downstream tasks. Our method proposes to utilize the rule-based representations derived from a collection of affective lexicons to improve the performance of BERT by incorporating task-specific knowledge. The motivation for our proposal is the hypothesis that the extension of lexicon-based information will compensate for BERT’s lack of proper representations of semantic and world knowledge (Rogers et al., 2021), making the model more stable across domains.
State Space Models.
In NLP tasks such as Part-of-Speech (POS) tagging and Named Entity Recognition, contextual information is widely acknowledged to play an important role in prediction. This led to the adoption of structured prediction approaches such as Hidden Markov Model (HMM, Rabiner and Juang, 1986), Maximum Entropy Markov Model (MEMM, McCallum et al., 2000), and Conditional Random Field (CRF, Lafferty et al., 2001), which relate a set of observable variables to a set of latent variables (e.g., words and their POS tags). State Space Models are similar to HMMs but assume continuous variables. The Linear Gaussian SSM (LG-SSM) is a particular case of SSM in which the conditional probability distributions are Gaussian.
There are other latent variable models to estimate temporal dynamics of emotions and sentiments in product reviews (McDonald et al., 2007; Täckström and McDonald, 2011a, b) and blogs (Kim et al., 2015). McDonald et al. (2007) and Täckström and McDonald (2011a, b) combined document-level and sentence-level supervision as the observed variables to condition on the latent sentence-level sentiment. Kim et al. (2015) introduced a continuous variable yt to solely determine the sentiment polarity zt, while zt is conditioned on both yt and zt−1 for each t in the LG-SSM.
3 Method
We propose a two-stage method to predict emotion dynamics without requiring annotated song lyrics. The first stage is a verse-level model that predicts initial scores for each verse, where we use a hybrid approach combining lexicons and sentence-level annotated data from a different domain (§ 3.1). The second stage contextualizes these scores in the entire song, incorporating them into an LG-SSM trained in an unsupervised way (§ 3.2).
Task Formalization.
Let indicate the real-valued intensity of emotion y for sentence/verse x, where and . Note that = {y1,y2,…,yc} is a set of c labels, each of which represents one of the basic emotions (c = 6 for the datasets we used). Given a source dataset = {(x1,E1),(x2,E2),…,(xM,EM)}, where xi is a sentence, Ei = and M = . The target dataset is = , where is the number of sequences (i.e., songs) and Si = is a song consisting of |Si| verses. In the song Si, the j-th verse vj is also associated with c emotion intensities as Ej = . Given the homogeneity of label spaces of and , the model trained by using can be applied to predict directly. The output of verse-level model is the emotion intensity predictions , where N is the total number of verses in . Finally, we use as the input sequences of the song-level model to produce optimized emotion intensity sequences .
3.1 Verse-Level Model
Emotion lexicons provide information on associations between words and emotions (Ramachandran and de Melo, 2020), which are beneficial in recognizing textual emotions (Mohammad et al., 2018; Zhou et al., 2020). Given that we would like to acquire accurate initial predictions at the verse level, we opted for a hybrid methodology that combines learning-based and lexicon-based approaches to enhance feature representation.
Overview.
The verse-level model architecture is called BERTLex, as illustrated in Figure 2. It consists of three phases: (1) the embedding phase, (2) the integration phase, and (3) the prediction phase. In the embedding phase, the input sequence is represented as both contextualized embeddings from BERT and static word embeddings from lexicons. In the integration phase, contextualized and static word embeddings are concatenated at the sentence level by taking the pooling operations on the two embeddings separately. The prediction phase encodes the integrated sequence of feature vectors and performs the verse-level emotion intensity regression by using the as the training/development set and the as the test set.
Embedding Phase.
The input sentence S is tokenized in two ways: one for the pre-trained language model and the other for the lexicon-based word embedding. These two tokenized sequences are denoted as Tcxt and Tlex, respectively. Then, Tcxt is fed into the pre-trained language model to produce a sequence of contextualized word embeddings , where and Dcxt is the embedding vector dimension.
To capture task-specific information, a lexicon embedding layer encodes a sequence of emotion and sentiment word associations for Tlex, generating a sequence of lexicon-based embeddings , where and Dlex is the lexical embedding vector dimension. We first build the vocabulary from the text of and . For each word vi in of Tlex, we use d lexicons to generate the rule-based feature vectors ℓi = {}, where is the lexical feature vector for word vi derived from the j-th lexicon and Dlex = |ℓi|. Additionally, we perform a degree-p polynomial expansion on the feature vector .
Integration Phase.
Prediction Phase.
The prediction phase outputs the emotion intensity predictions = by using a single dropout (Srivastava et al., 2014) layer and a linear regression layer. During training, the mean squared error loss is computed and backpropagated to update the model parameters.
3.2 Song-Level Model
After obtaining initial verse-level predictions, the next step involves incorporating these into a song-level model using an LG-SSM. We take one emotion as an example. Specifically, we consider the predicted scores of this emotion of each song as an observed sequence . That is, we group the N predictions of as sequences of predictions as . For the i-th song, the observed sequence is then used in an LG-SSM to obtain the latent sequence that represents the song-level emotional dynamics, where T is the number of verses in the song.
Standard applications of LG-SSM assume a temporal ordering in the sequence. This means that estimates of should only depend on the observed values up to the verse step t (i.e., y1:t), which is the central assumption of the Kalman Filter algorithm. Given the sequence of observations, we recursively apply the Kalman Filter to calculate the mean and variance of the hidden states, whose computation steps are displayed in Algorithm 1.
Since we have obtained initial predictions for all verses in a song, we can assume that observed emotion scores are available for the sequence of an entire song a priori. In other words, we can include the “future” data (i.e., yt +1:T) to estimate the latent posteriors . This is achieved by using the Kalman smoothing algorithm, also known as RTS smoother (Rauch et al., 1965), shown in Algorithm 2.
As opposed to most other algorithms, the Kalman Filter and Kalman Smoother algorithms are used with already known parameters. Hence, learning the SSM involves estimating the parameters Θ. If a set of ground truth values for the complete z1:T is available, they can be learned using a Maximum Likelihood Estimation (MLE). If only the noisy, observed sequences y1:T are present, the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) provides an iterative method for finding the MLEs of Θ by successively maximizing the conditional expectation of the complete data likelihood until convergence.
4 Experiments
Our experiments aim to evaluate the method proposed to predict the emotional dynamics of song lyrics without utilizing any annotated lyrics data. We introduce datasets, lexicon resources, and the evaluation metric used (§4.1), and discuss the implementation details and experiment settings of the verse-level model (§4.2) and the song-level model (§4.3).
4.1 Datasets and Evaluation
LyricsEmotions.
This corpus was developed by Mihalcea and Strapparava (2012), consisting of 100 popular English songs with 4,975 verses in total. The number of verses for each song varies from 14 to 110. The LyricsEmotions dataset was constructed by extracting the parallel alignment of musical features and lyrics from MIDI tracks. These lyrics were annotated using Mechanical Turk at verse level with real-valued intensity scores ranging from 0 to 10 of six Ekman’s emotions (Ekman, 1993): ANGER, DISGUST, FEAR, JOY, SADNESS, and SURPRISE. Given that our goal is to predict emotions without relying on song-level dynamics, we use this dataset for evaluation purposes only.
NewsHeadlines.
Lexicons.
Following Goel et al. (2017) and Meisheri and Dey (2018), we use nine emotion and sentiment related lexicons to obtain the feature vectors from the text in NewsHeadlines and LyricsEmotions, summarized in Table 1.
Lexicons used to build lexicon-based feature vectors: PT is the size of feature vector after polynomial feature expansion.
. | Scope . | Size (PT) . | Label . | Reference . |
---|---|---|---|---|
NRC-Emo-Int | Emotion | 1 (4) | Numerical | Mohammad (2018) |
SentiWordNet | Sentiment | 2 (10) | Numerical | Esuli and Sebastiani (2007) |
NRC-Emo-Lex | Emotion | 1 (4) | Nominal | Mohammad and Turney (2013) |
NRC-Hash-Emo | Emotion | 1 (4) | Numerical | Mohammad and Kiritchenko (2015) |
Sentiment140 | Sentiment | 3 (20) | Numerical | Mohammad et al. (2013) |
Emo-Aff-Neg | Sentiment | 3 (20) | Numerical | Zhu et al. (2014) |
Hash-Aff-Neg | Sentiment | 3 (20) | Numerical | Mohammad et al. (2013) |
Hash-Senti | Sentiment | 3 (20) | Numerical | Kiritchenko et al. (2014) |
DepecheMood | Emotion | 8 (165) | Numerical | Staiano and Guerini (2014) |
. | Scope . | Size (PT) . | Label . | Reference . |
---|---|---|---|---|
NRC-Emo-Int | Emotion | 1 (4) | Numerical | Mohammad (2018) |
SentiWordNet | Sentiment | 2 (10) | Numerical | Esuli and Sebastiani (2007) |
NRC-Emo-Lex | Emotion | 1 (4) | Nominal | Mohammad and Turney (2013) |
NRC-Hash-Emo | Emotion | 1 (4) | Numerical | Mohammad and Kiritchenko (2015) |
Sentiment140 | Sentiment | 3 (20) | Numerical | Mohammad et al. (2013) |
Emo-Aff-Neg | Sentiment | 3 (20) | Numerical | Zhu et al. (2014) |
Hash-Aff-Neg | Sentiment | 3 (20) | Numerical | Mohammad et al. (2013) |
Hash-Senti | Sentiment | 3 (20) | Numerical | Kiritchenko et al. (2014) |
DepecheMood | Emotion | 8 (165) | Numerical | Staiano and Guerini (2014) |
Evaluation.
In line with Mihalcea and Strapparava (2012), we use the Pearson correlation coefficient (r) as the evaluation metric to measure the correlation between the predictions and ground truth emotion intensities. To assess statistical significance, we conduct the Williams test (Williams, 1959) in the differences between the Pearson correlations of each pair of models.
For baselines, our method is unsupervised at the song level, and we are not aware of previous work that tackles similar cases. Therefore, we use the results of the verse-level model as our main baseline. We argue that this is a fair baseline since the SSM-based model does not require additional data.
4.2 Verse-level Experiments
Setup.
For the pre-trained model, we choose the BERTbase uncased model in English with all parameters frozen during training. All models are trained on an NVIDIA T4 Tensor Core GPU with CUDA (version 11.2).
BERTLex.
The sequence of token embeddings, including [CLS] and [SEP] at the output of the last layer of the BERTbase model, is fed into a Conv1D layer with 128 filters and a kernel size of 3, followed by a 1D global average pooling layer.
We concatenate nine vector representations for every word in the established vocabulary by using the lexicons in Table 1 in the identical order to form a united feature vector. As a result, the whole word embedding has shape (3309, 25), where 3309 is the vocabulary size and 25 is the number of lexicon-based features. To validate the benefit of adding polynomial features, we also perform a polynomial expansion with a degree of 3, extending the shape of vector representations to (3309, 267). Then, static word embeddings are fed a Conv1D layer with 128 filters and a kernel size of 3, followed by a global max-pooling layer.
The two pooled vectors are then concatenated through a Concatenate layer. The verse-level emotion intensities are predicted by using a Linear layer with a single neuron4 for regression.
Training.
Instead of using the standard train/dev/test split of the NewsHeadlines dataset, we apply 10-fold cross-validation to tune the hyperparameters of BERT-based models. Empirically tuned hyperparameters are listed in Table 2 and are adopted in the subsequent experiments. After tuning, the final models using this set of hyperparameters are trained on the full NewsHeadlines data. We use an ensemble of five runs, taking the mean of the predictions as the final output.
4.3 Song-Level Experiments
We apply the library pykalman (version 0.9.2),5 which implements the Kalman Filter, the Kalman Smoother, and the EM algorithm to train SSMs. We fix the initial state mean as the first observed value in the sequence (i.e., each song’s first verse-level prediction) and the initial state covariance as 2. We then conduct experiments with several groups of parameters transition matrices A, transition covariance Q, observation matrices C, and observation covariance R to initialize the Kalman Filter and Kalman Smoother. For parameter optimization, we experiment n_iter = {1,3,5,7,10} to control the number of EM algorithm iterations. Additionally, we apply 10-fold cross-validation when optimizing parameters via EM, which means each fold (containing 10 songs) is processed by a Kalman Filter or Kalman Smoother defined by the optimal parameters that we obtained from training on the other folds (containing 90 songs).
5 Results and Analysis
In this section, we first compare the results of our lexicon-based, learning-based and hybrid methods at the verse level (§ 5.1). We then provide the results of the song-level models and investigate the impact of the initial predictions from verse-level models, SSM parameters, and parameter optimization (§ 5.2). We additionally show the qualitative case analysis results to understand our model’s abilities and shortcomings (§ 5.3). Finally, we compare the results of supervised and unsupervised methods on LyricsEmotions (§ 5.4).
5.1 Results of Verse-level Models
Table 3 shows the results of verse-level models on NewsHeadlines (average of 10-fold cross-validation) and LyricsEmotions (as a test set).6
Pearson correlations between ground truth labels and predictions of the verse-level models in the NewsHeadlines (NH) and LyricsEmotions (LE) datasets: The subscript cv means the average results of the 10-fold cross-validation experiments, and the subscript test means the results of using the dataset as the test set.
. | Dataset . | ANG . | DIS . | FEA . | JOY . | SAD . | SUR . |
---|---|---|---|---|---|---|---|
Lexicon only | NHcv | 0.197 | 0.106 | 0.231 | 0.219 | 0.112 | 0.056 |
LEcv | 0.212 | 0.091 | 0.185 | 0.209 | 0.175 | 0.031 | |
BERT only | NHcv | 0.740 | 0.651 | 0.792 | 0.719 | 0.808 | 0.469 |
LEtest | 0.311 | 0.261 | 0.314 | 0.492 | 0.306 | 0.071 | |
BERTLex | NHcv | 0.865 | 0.828 | 0.840 | 0.858 | 0.906 | 0.771 |
LEtest | 0.340 | 0.287 | 0.336 | 0.472 | 0.338 | 0.066 | |
BERTLexpoly | NHcv | 0.838 | 0.788 | 0.833 | 0.840 | 0.885 | 0.742 |
LEtest | 0.345 | 0.268 | 0.350 | 0.503 | 0.350 | 0.089 |
. | Dataset . | ANG . | DIS . | FEA . | JOY . | SAD . | SUR . |
---|---|---|---|---|---|---|---|
Lexicon only | NHcv | 0.197 | 0.106 | 0.231 | 0.219 | 0.112 | 0.056 |
LEcv | 0.212 | 0.091 | 0.185 | 0.209 | 0.175 | 0.031 | |
BERT only | NHcv | 0.740 | 0.651 | 0.792 | 0.719 | 0.808 | 0.469 |
LEtest | 0.311 | 0.261 | 0.314 | 0.492 | 0.306 | 0.071 | |
BERTLex | NHcv | 0.865 | 0.828 | 0.840 | 0.858 | 0.906 | 0.771 |
LEtest | 0.340 | 0.287 | 0.336 | 0.472 | 0.338 | 0.066 | |
BERTLexpoly | NHcv | 0.838 | 0.788 | 0.833 | 0.840 | 0.885 | 0.742 |
LEtest | 0.345 | 0.268 | 0.350 | 0.503 | 0.350 | 0.089 |
The domain difference is significant in news and lyrics, as we can observe from the different performance of the BERT-based models on the two datasets. Overall, our BERTLex method outperforms the lexicon-only and BERT-only baselines and reaches the highest Pearson score (0.503, BERTLexpoly for JOY) in LyricsEmotions.
Having a closer look at the results of LyricsEmotions, we also observe the following:
The addition of lexicons for incorporating external knowledge consistently promotes the performance of BERT-based models.
BERTLex models that add polynomial feature expansion are better than those that do not, when using LyricsEmotions as a test set (except for DISGUST). However, in the cross-validation of NewsHeadlines, the models without polynomial features outperform those with.
5.2 Results of Song-level Models
Extensive experiments confirm that our song-level models utilizing the Kalman Filter and Kalman Smoother can improve the initial predictions from verse-level models (see Table 4 and Table 5). The LG-SSMs with EM-optimized parameters always perform better than those without using EM. Furthermore, the performance improvements of the strongest SSMs from their corresponding verse-level baselines are statistically significant at 0.05 confidence (marked with *), except for SURPRISE.
Pearson correlations between ground truth emotion intensities and predictions of BERTLex models and SSMs, respectively. The default parameters in pykalman are used: A = 1, Q = 1, C = 1, R = 5, and n_iter = 10.
. | ANG . | DIS . | FEA . | JOY . | SAD . | SUR . |
---|---|---|---|---|---|---|
BERTLex | 0.338 | 0.280 | 0.336 | 0.468 | 0.338 | 0.066 |
Filter | 0.359* | 0.287* | 0.352* | 0.498* | 0.361* | 0.069 |
Smoother | 0.362* | 0.282 | 0.352* | 0.501* | 0.366* | 0.064 |
Filter-EM | 0.396* | 0.293* | 0.357* | 0.522* | 0.387* | 0.069 |
Smoother-EM | 0.405* | 0.280 | 0.339 | 0.522* | 0.385* | 0.060 |
BERTLexpoly | 0.315 | 0.261 | 0.350 | 0.503 | 0.347 | 0.083 |
Filter | 0.334* | 0.267 | 0.367* | 0.538* | 0.374* | 0.088 |
Smoother | 0.332* | 0.258 | 0.368* | 0.542* | 0.380* | 0.082 |
Filter-EM | 0.358* | 0.270* | 0.371* | 0.568* | 0.405* | 0.087 |
Smoother-EM | 0.356* | 0.251 | 0.355 | 0.570* | 0.405* | 0.079 |
. | ANG . | DIS . | FEA . | JOY . | SAD . | SUR . |
---|---|---|---|---|---|---|
BERTLex | 0.338 | 0.280 | 0.336 | 0.468 | 0.338 | 0.066 |
Filter | 0.359* | 0.287* | 0.352* | 0.498* | 0.361* | 0.069 |
Smoother | 0.362* | 0.282 | 0.352* | 0.501* | 0.366* | 0.064 |
Filter-EM | 0.396* | 0.293* | 0.357* | 0.522* | 0.387* | 0.069 |
Smoother-EM | 0.405* | 0.280 | 0.339 | 0.522* | 0.385* | 0.060 |
BERTLexpoly | 0.315 | 0.261 | 0.350 | 0.503 | 0.347 | 0.083 |
Filter | 0.334* | 0.267 | 0.367* | 0.538* | 0.374* | 0.088 |
Smoother | 0.332* | 0.258 | 0.368* | 0.542* | 0.380* | 0.082 |
Filter-EM | 0.358* | 0.270* | 0.371* | 0.568* | 0.405* | 0.087 |
Smoother-EM | 0.356* | 0.251 | 0.355 | 0.570* | 0.405* | 0.079 |
Pearson correlations between ground truth and SSMs with different values of transition matrices A, based on BERTLexpoly models (as listed in the bottom half of Table 4). The other parameters are fixed as Q = 1, C = 1, R = 5, and n_iter = 5.
. | ANG . | DIS . | FEA . | JOY . | SAD . | SUR . |
---|---|---|---|---|---|---|
A = 0.5 | ||||||
Filter | 0.265 | 0.223 | 0.350 | 0.455 | 0.351 | 0.074 |
Smoother | 0.277 | 0.223 | 0.351 | 0.471 | 0.362* | 0.071 |
Filter-EM | 0.357* | 0.273* | 0.373* | 0.563* | 0.397* | 0.089 |
Smoother-EM | 0.360* | 0.262 | 0.364* | 0.569* | 0.402* | 0.083 |
A = 2 | ||||||
Filter | 0.354* | 0.270 | 0.369* | 0.560* | 0.393* | 0.085 |
Smoother | 0.075 | 0.055 | 0.160 | 0.205 | 0.174 | 0.003 |
Filter-EM | 0.355* | 0.272* | 0.375* | 0.562* | 0.399* | 0.089 |
Smoother-EM | 0.358* | 0.260 | 0.364* | 0.568* | 0.403* | 0.083 |
. | ANG . | DIS . | FEA . | JOY . | SAD . | SUR . |
---|---|---|---|---|---|---|
A = 0.5 | ||||||
Filter | 0.265 | 0.223 | 0.350 | 0.455 | 0.351 | 0.074 |
Smoother | 0.277 | 0.223 | 0.351 | 0.471 | 0.362* | 0.071 |
Filter-EM | 0.357* | 0.273* | 0.373* | 0.563* | 0.397* | 0.089 |
Smoother-EM | 0.360* | 0.262 | 0.364* | 0.569* | 0.402* | 0.083 |
A = 2 | ||||||
Filter | 0.354* | 0.270 | 0.369* | 0.560* | 0.393* | 0.085 |
Smoother | 0.075 | 0.055 | 0.160 | 0.205 | 0.174 | 0.003 |
Filter-EM | 0.355* | 0.272* | 0.375* | 0.562* | 0.399* | 0.089 |
Smoother-EM | 0.358* | 0.260 | 0.364* | 0.568* | 0.403* | 0.083 |
Theoretically, the Kalman Smoother is supposed to perform better than the Kalman Filter, since the former uses all observations in the whole sequence. According to our experimental results, however, the best-performing algorithm depends on emotion. Furthermore, applying EM consistently improves the results of SSMs that use the initial values, except for SURPRISE.
Combining the results in Table 3, Table 4, and Table 5, we observe that all models perform poorly when predicting the emotion intensities of SURPRISE (r ¡ 0.1). The overall worst results for SURPRISE can also be observed from other work in LyricsEmotions and NewsHeadlines as well as similar work in different datasets annotated with the Ekman taxonomy. SURPRISE has significantly lower inter-annotator agreement than other emotions (Strapparava and Mihalcea, 2007; Schuff et al., 2017; Buechel and Hahn, 2017; Dang et al., 2021; Edmonds and Sedoc, 2021), which implies that SURPRISE is especially difficult to model and occurs less frequently (Mohammad et al., 2018; Bostan and Klinger, 2018). This might indicate the underlying problems in the definition of SURPRISE as an emotion category (Schuff et al., 2017).
Impact of Verse-level Predictions.
The performance of applying Kalman Filter, Kalman Smoother, and EM algorithm are associated with the initial scores predicted by verse-level models. For the same emotion, we compare the results based on the mean predictions of the BERTLex models with and without polynomial expansion on lexical features, respectively (shown in Table 4). We observe that the higher the Pearson correlation between the ground truth and the verse-level predictions, the more accurate the estimates obtained after using LG-SSMs accordingly. The strongest SSMs also differ with the different types of emotions and initial predictions, as denoted in boldface.
Impact of Initial Parameters.
The results of Kalman Filter and Kalman Smoother are sensitive to the initial model parameters. As displayed in Table 5, when we only change the value of transition matrices A and fix the other parameters, running either the Filter or Smoother can actually decrease the performance. Fortunately, this kind of diminished performance can be diluted by optimizing the parameters with an EM algorithm.
Impact of Parameter Optimization.
For either Kalman Filter or Kalman Smoother, using EM to optimize the parameters increases Pearson’s r in most cases. Through experiments, the number of iterations does not significantly influence the performance of the EM algorithm, and 5 ∼ 10 iterations usually produce the strongest results.
5.3 Qualitative Case Studies
Domain Discrepancy.
As displayed in Section 5.2, the Pearson scores between the ground-truth labels and estimates of SURPRISE are lower than 0.1, which means our verse-level and song-level models both underperform in predicting this emotion. Upon closer inspection, we observe that there are a great number of zeros in the ground-truth annotations of SURPRISE in the target domain dataset. For example, Figure 3 shows the emotion curves of If You Love Somebody Set Them Free by Sting, where all the ground-truth labels of SURPRISE are zeros in the whole song. Statistically, there are 1,933 zeros out of 4,975 (38.85%) SURPRISE ground truth labels in LyricsEmotions but only 148 of 1,250 (11.84%) zeros in NewsHeadlines. The models trained on NewsHeadlines would not assume such a large absences of SURPRISE when predicting for LyricsEmotions. This domain discrepancy clearly affects the performance of our method.
The SURPRISE emotion intensities of ground truth (all zeros), BERTLex model, and SSM in an example song.
The SURPRISE emotion intensities of ground truth (all zeros), BERTLex model, and SSM in an example song.
Characteristics of Kalman Filter and Kalman Smoother.
Our experiments indicate that initial predictions of at least 50 to 70 of 100 songs have been enhanced after modeling them with LG-SSMs. We summarize two trending types from the emotional dynamics of the songs whose predictions are weakened by LG-SSMs. One is that the ground truth emotional dynamics fluctuate more sharply than those of the verse-level predictions, as displayed in the first and the second sub-figures in Figure 4. The other is the opposite that verse-level models produce an emotion intensity curve with more sudden changes than the ground truth (see the third sub-figure in Figure 4). The emotional dynamics trend of estimates by song-level models is similar to verse-level models. Due to the Gaussian assumption, Kalman Filter and Kalman Smoother tend to flatten or smooth the curves of verse-level predictions. This means that applying LG-SSMs can somewhat reduce errors in the second type of emotion dynamic curves. For the first type, however, the Kalman Filter and Kalman Smoother make the results worse, as smoother estimations are not desirable in this situation.
Emotional dynamics of ANGER, DISGUST and SURPRISE in Bad Romance by Lady Gaga: Pearson’s r between ground truth and predictions of BERTLexpoly, estimates of Kalman Filter, are reported, respectively.
Emotional dynamics of ANGER, DISGUST and SURPRISE in Bad Romance by Lady Gaga: Pearson’s r between ground truth and predictions of BERTLexpoly, estimates of Kalman Filter, are reported, respectively.
Using Text Solely.
The lyrics in LyricsEmotions are synchronized with acoustic features, where some verses with identical text are labeled as different emotional intensities. For instance, in Table 6, the verse “When it rain and rain, it rain and rain” repeats multiple times in the song Rain by Mika, and their ground truth SADNESS labels differ in different verses due to the melody. However, the verse-level models can only produce the same predictions since these verses contain the same text, and the models do not consider the context of the whole song. Consequently, the emotion scores of different verses predicted by LG-SSMs are close, as the results of song-level models are highly related to the initial predictions from BERTLex.
SADNESS scores of verses with the same lyrics verse “When it rain and rain, it rain and rain” but different ground truth labels in the song.
Verse ID . | Truth . | BERTLex . | Smoother-EM . |
---|---|---|---|
s55v15 | 4.33 | 8.65 | 1.68 |
s55v31 | 7.66 | 8.65 | 1.68 |
s55v32 | 7.33 | 8.65 | 1.63 |
Verse ID . | Truth . | BERTLex . | Smoother-EM . |
---|---|---|---|
s55v15 | 4.33 | 8.65 | 1.68 |
s55v31 | 7.66 | 8.65 | 1.68 |
s55v32 | 7.33 | 8.65 | 1.63 |
5.4 Comparison with a Supervised Model
Our last experiment aims to understand the degree of difficulty in solving the task by training a supervised model, which serves as a performance upper bound. We keep the same 10-fold cross-validation splits, but now use the training folds to fine-tune a BERTLex model at the verse level.
We compare the results of the supervised model with our best results of song-level models in Table 7, showing there is still a substantial performance gap in all emotions. In particular, the supervised model shows strong numbers for SURPRISE, the most challenging emotion to predict in our experiments. While our SSM models have the benefit of being readily applicable to new domains (such as songs in genres other than pop and languages other than English), this result demonstrates that practical systems could benefit with some level of annotations for SURPRISE. More generally, it also motivates extending SSMs to a semi-supervised setting, which we leave for future work.
Pearson correlations of predictions from supervised BERTLex models (10-fold cross validation) and predictions of the best SSMs.
. | ANG . | DIS . | FEA . |
---|---|---|---|
BERTLex | 0.837 | 0.736 | 0.790 |
SSM | 0.405 | 0.293 | 0.375 |
JOY | SAD | SUR | |
BERTLex | 0.879 | 0.831 | 0.739 |
SSM | 0.570 | 0.405 | 0.089 |
. | ANG . | DIS . | FEA . |
---|---|---|---|
BERTLex | 0.837 | 0.736 | 0.790 |
SSM | 0.405 | 0.293 | 0.375 |
JOY | SAD | SUR | |
BERTLex | 0.879 | 0.831 | 0.739 |
SSM | 0.570 | 0.405 | 0.089 |
6 Conclusion and Future Work
This paper presents a two-stage BERTLex-SSM framework for sequence-labeling emotion intensity recognition tasks, especially in label-scarce scenarios. Combining the contextualized embeddings with static word embeddings and then modeling the initial predicted intensity scores as a State Space Model, our method can utilize context-sensitive features with external knowledge and capture the emotional dynamic transitions. Experimental results show that our proposed BERTLex-SSM effectively predicts emotion intensities in the lyrics without requiring annotated lyrics data.
Our findings and analysis point to a range of directions for future work:
Domain Adaptation.
While our method could apply any general verse-level model, including a pure lexicon-based one, in practice, we obtained the best results by leveraging annotated sentence-level datasets. This naturally leads a domain discrepancy: in our particular case, between news and lyrics domains. Given that unlabeled song lyrics are relatively easy to obtain, one direction is to incorporate unsupervised domain adaptation techniques (Ramponi and Plank, 2020) to improve the performance of the verse-level model. Semi-supervised learning (similar to Täckström and McDonald, 2011b) is another promising direction, although methods would need to be modified to incorporate the continuous nature of the emotion labels.
SSM Extensions.
Despite being able to optimize the estimates through Kalman Filter and Kalman Smoother, the simplicity of the LG-SSM makes it difficult to deal with the wide variations in emotion space dynamics, given that it is a linear model. We hypothesize that non-linear SSM extensions (Julier and Uhlmann, 1997; Ito and Xiong, 2000; Julier and Uhlmann, 2004) might be a better fit for modeling emotion dynamics.
Multimodal Grounding.
Since the LyricsEmotions dataset is annotated on parallel acoustic and text features, using lyrics solely as the feature can cause inconsistencies in the model. Extending our method to a multi-modal setting would remedy this issue when the identical lyrics are companions with different musical features to appear in various verses. Taking the knowledge of song structure (e.g., Intro - Verse - Bridge - Chorus) into account has the potential to advance the modeling of emotion dynamics, assuming the way (up or down) that emotion intensities change is correlated with which part of the song the verses locate.
Acknowledgments
The authors would like to thank Rada Mihalcea for sharing the LyricsEmotions dataset with us and the anonymous reviewers and editors for their constructive and helpful comments.
Notes
According to Mihalcea and Strapparava (2012), a “verse” is defined as a sentence or a line of lyrics.
We omit control matrix B and control vector ut in the transition equation, assuming no external influence.
We experimented with a multi-task model that predicted all six emotions jointly, but preliminary results showed that building separate models for each emotion performed better.
We perform five runs with different random seeds, using the mean, median, maximum or minimum to pool the results. Here we show the result of the best pooling method, but in practice we did not see any significant difference compared to the mean pooling.
References
Author notes
Action Editor: Sara Rosenthal