Modeling Emotion Dynamics in Song Lyrics with State Space Models

Most previous work in music emotion recognition assumes a single or a few song-level labels for the whole song. While it is known that different emotions can vary in intensity within a song, annotated data for this setup is scarce and difficult to obtain. In this work, we propose a method to predict emotion dynamics in song lyrics without song-level supervision. We frame each song as a time series and employ a State Space Model (SSM), combining a sentence-level emotion predictor with an Expectation-Maximization (EM) procedure to generate the full emotion dynamics. Our experiments show that applying our method consistently improves the performance of sentence-level baselines without requiring any annotated songs, making it ideal for limited training data scenarios. Further analysis through case studies shows the benefits of our method while also indicating the limitations and pointing to future directions.


Introduction
Music and emotions are intimately connected, with almost all music pieces being created to express and induce emotions (Juslin and Laukka, 2004).As a key factor of how music conveys emotion, lyrics contain part of the semantic information that the melodies cannot express (Besson et al., 1998).Lyrics-based music emotion recognition has attracted increasing attention driven by the demand to process massive collections of music tracks automatically, which is an important task to streaming and media service providers (Kim et al., 2010;Malheiro et al., 2016;Agrawal et al., 2021).
Most emotion recognition studies in Natural Language Processing (NLP) assume the text in-stance expresses a static and single emotion (Mohammad and Bravo-Márquez, 2017;Nozza et al., 2017;Mohammad et al., 2018a).However, emotion is non-static and highly correlated with the contextual information, which makes the singlelabel assumption too simplistic in dynamic scenarios, not just in music (Schmidt and Kim, 2011) but also in other domains such as conversations (Poria et al., 2019b).Figure 1 shows an example of this dynamic behaviour, where the intensities of three different emotions vary within a song.Accurate emotion recognition systems should ideally be able to generate the full emotional dynamics for each song, as opposed to simply predicting a single label.
A range of datasets and corpora for modelling dynamic emotion transitions has been developed in the literature (McKeown et al., 2011;Li et al., 2017;Hsu et al., 2018;Poria et al., 2019a;Firdaus et al., 2020), but most of them do not use song lyrics as the domain and assume discrete, categorical labels for emotions (either the presence or absence of one emotion).To the best of our knowledge, the dataset from Mihalcea and Strapparava (2012) is the only one that provides full fine-grained emotion intensity annotations for song lyrics at the verse1 level.The lack of largescale datasets for this task poses a challenge for traditional supervised methods.While previous work proposed methods for the similar sequencebased emotion recognition task, they all assume the availability of some levels of annotated data at training time, from full emotion dynamics (Kim et al., 2015) to coarse, discrete document-level labels (Täckström and McDonald, 2011b).
The data scarcity problem motivates our main research question: "Can we predict emotion dynamics in song lyrics without requiring annotated lyrics?".In this work, we claim that the an-  Mihalcea and Strapparava (2012).Note the intensities of each emotion vary from verse to verse within the song.
swer is affirmative.To show this, we propose a method consisting of two major stages: (1) a sentence or verse-level regressor that leverages existing emotion lexicons, pre-trained language models and other sentence-level datasets and (2) a State Space Model (SSM) that constructs a full songlevel emotional dynamics given the initial verselevel scores.Intuitively, we treat each verse as a time step and the emotional intensity sequence as a latent time series that is inferred without any songlevel supervision, directly addressing the limited data problem.To the best of our knowledge, this scenario was never addressed before either for song lyrics or other domains.
To summarize, our main contributions are: • We propose a hybrid approach for verse-level emotion intensity prediction that combines emotion lexicons with a pre-trained language model (BERT (Devlin et al., 2019) used in this work), which is trained on available sentence-level data.
• We further show that by using SSMs to model the song-level emotion dynamics, we can improve the performance of the verselevel approach without requiring any annotated lyrics.
• We perform a qualitative analysis of our best models, highlighting its limitations and pointing to directions for future work.

Background and Related Work
Emotion Models.Human emotion is a longstanding research field in psychology, with many studies aiming at defining a taxonomy for emotions.In NLP, emotion analysis mainly employs the datasets which are annotated based on the categorical or the dimensional model.The categorical model assumes a fixed set of discrete emotions which can vary in intensity.Emotions can overlap but are assumed to be separate "entities" for each other, such as anger, joy and surprise.Taxonomies using the categorical model include Ekman's basic emotions (Ekman, 1993) and Plutchik's wheel of emotions (Plutchik, 1980).The dimensional models place emotions in a continuous space: the VAD (Valence, Arousal and Dominance) taxonomy of Russell (1980) is the most commonly used in NLP.In this work, we focus on the Ekman taxonomy for purely experimental purposes, as it is the one used in the available data we employ.However, our approach is general and could be applied to other taxonomies.Dynamic Emotion Analysis.Emotion Recognition in Conversation (ERC) (Poria et al., 2019b), which focuses on tracking dynamic shifts of emotions, is the most similar task to our work.Within a conversation, the emotional state of each utterance is influenced by the previous state of the party and the stimulation from other parties (Li et al., 2020;Ghosal et al., 2021).Such an assumption of the real-time dynamic emotional changes also exists in music: the affective state of the current lyrics verse is correlated with the state of the previous verse(s) as a song progresses.
Contextual information in the ERC task is generally captured by deep learning models, which can be roughly categorized into sequence-based and graph-based methods (Hu et al., 2021).Sequence-based methods encode conversational context features using established methods like Recurrent Neural Networks (Poria et al., 2017;Hazarika et al., 2018a,b;Majumder et al., 2019;Hu et al., 2021) and Transformer-based architectures (Zhong et al., 2019;Li et al., 2020).They also include more advanced and tailored methods like Hierarchical Memory Network (Jiao et al., 2020), Emotion Interaction Network (Lu et al., 2020) and Causal Aware Network (Zhao et al., 2022).Graph-based methods apply specific graphical structures to model dependencies in conversations (Ghosal et al., 2019;Zhang et al., 2019;Lian et al., 2020;Ishiwatari et al., 2020;Shen et al., 2021) by utilizing Graph Neural Networks (Kipf and Welling, 2017).In contrast to these methods, we capture contextual information using a State Space Model, mainly motivated by the need for a method that can train without supervision.Extending and/or combining an SSM with a deep learning model is theoretically possible but non-trivial, and care must be taken in a low-data situation such as ours.
The time-varying nature of music emotions has been investigated in music information retrieval (Caetano et al., 2012).To link the human emotions with the music acoustic signal, the emotion distributions were modelled as 2D Gaussian distributions in the Arousal-Valence (A-V) space, which were used to predict A-V responses through multilabel regression (Schmidt et al., 2010;Schmidt and Kim, 2010).Building on previous studies, Schmidt and Kim (2011) applied structured prediction methods to model complex emotion-space distributions as an A-V heatmap.These studies focus on the mapping between emotions and acoustic features/signals, while our work focuses on the lyrics component.Wu et al. (2014) developed a hierarchical Bayesian model that utilized both acoustic and textual features, but it was only applied to predict emotions as discrete labels (presence or absence) instead of fine-grained emotion intensities as in our work.
Combining pre-trained Language Models with External Knowledge.Pre-trained language models (LMs) including BERT (Devlin et al., 2019), XLNet (Yang et al., 2019) and GPT (Brown et al., 2020) have achieved state-of-the-art performance in numerous NLP tasks.Considerable effort has been made towards combining context-sensitive features of LMs with factual or commonsense knowledge from structured sources, including domain-specific knowledge (Ying et al., 2019), structured semantic information (Zhang et al., 2020), language-specific knowledge (Alghanmi et al., 2020;De Bruyne et al., 2021) and linguistic features (Koufakou et al., 2020;Mehta et al., 2020).This auxiliary knowledge is usually infused into the architecture by concatenating them with the Transformerbased representation before the prediction layer for downstream tasks.Our method proposes to utilize the rule-based representations derived from a bunch of affective lexicons to improve the performance of BERT by incorporating task-specific knowledge.The motivation for our proposal is the hypothesis that the extension of lexicon-based information will compensate for BERT's lack of proper representations of semantic and world knowledge (Rogers et al., 2021), making the model more stable across domains.
State Space Models.In NLP tasks such as Part-of-Speech (POS) tagging and Named Entity Recognition, contextual information is widely acknowledged to play an important role in prediction.This leads to the adoption of structured prediction approaches such as Hidden Markov Model (HMM) (Rabiner and Juang, 1986), Maximum Entropy Markov Model (MEMM) (McCallum et al., 2000) and Conditional Random Field (CRF) (Lafferty et al., 2001), which relate a set of observable variables to a set of latent variables (e.g., words and their POS tags).State Space Models (SSMs) are similar to HMMs but assume continuous variables.Linear Gaussian SSM (LG-SSM) is a particular case of SSM in which all the conditional probability distributions are linear and Gaussian.
Following the notation from Murphy (2012, Chap.18), we briefly introduce the LG-SSM that we employ in our work.LG-SSMs assume a sequence of observed variables y 1:T as input, and the goal is to draw inferences about the corresponding hidden states z 1:T , where T is the length of the sequence.Their relationship is given at each step t by the equations as: where Θ = (A, C, Q, R) are the model parameters, t is the system noise and δ t is the observation noise.The equations above are also referred as transition2 and observation equations, respectively.Given Θ and a sequence y 1:T , the goal is to obtain the posteriors p(z t ) for each step t.In an LG-SSM, this posterior is a Gaussian and can be obtained in closed form by applying the celebrated Kalman Filter (Kalman, 1960) .
There exist some other latent variable models to estimate temporal dynamics of emotions and sentiments in product reviews (McDonald et al., 2007;Täckström and McDonald, 2011a,b) and blogs (Kim et al., 2015).McDonald et al. (2007) and Täckström and McDonald (2011a,b) combined document-level and sentence-level supervision as the observed variables to condition on the latent sentence-level sentiment.Kim et al. (2015) introduced a continuous variable y t to solely determine the sentiment polarity z t , while z t is conditioned on both y t and z t−1 for each t in the LG-SSM.

Method
We propose a two-stage method to predict emotion dynamics without requiring annotated song lyrics.The first stage is a verse-level model that predicts initial scores for each song verse, where we use a hybrid approach combining lexicons and sentence-level annotated data from a different domain ( §3.1).The second stage contextualizes these scores in the entire song, incorporating them into an LG-SSM trained in an unsupervised way ( §3.2).

Task Formalization. Let d y
x indicate the realvalued intensity of emotion y for sentence/ verse x, where x ∈ X and y ∈ Y.Note that Y = {y 1 , y 2 , . . ., y c } is a set of c labels, each of which represents one of the basic emotions (c = 6 for the datasets we used).Given a source dataset where |D t | is the number of sequences (i.e., songs) and In the song S i , the j-th verse v j is also associated with c emotion intensities as E j = {d y 1 v j , d y 2 v j , . . ., d yc v j }.Given the homogeneity of label spaces of D s and D t , the model trained by using D s can be applied to predict for D t directly.The output of verselevel model is the emotion intensity predictions Ŷ ∈ R N ×c , where N is the total number of verses in D t .Finally, we use Ŷ as the input sequences of the song-level model to produce optimized emotion intensity sequences Ẑ ∈ R |Dt|×c .

Verse-Level Model
Emotion lexicons provide information on associations between words and emotions (Ramachandran and de Melo, 2020), which are proven to be beneficial in recognising textual emotions (Mohammad et al., 2018b;Zhou et al., 2020).Given that we would like to acquire accurate initial predictions at the verse level, we opted for a hybrid methodology that combines learning-based and lexicon-based approaches to enhance feature representation.
Overview.The verse-level model architecture is called BERTLex, as illustrated in Figure 2. The BERTLex model consists of three phases: (1) the embedding phase, (2) the integration phase, and (3) the prediction phase.In the embedding phase, the input sequence is represented as both contextualized embeddings from BERT and static word embeddings from lexicons.In the integration phase, contextualized and static word embeddings are concatenated at the sentence level by taking the pooling operations on the two embeddings separately.The prediction phase encodes the integrated sequence of feature vectors and performs the verse-level emotion intensity regression by using the D s as the training/development set and the D t as the test set.
Embedding Phase.The input sentence S is tokenized in two ways: one for the pre-trained language model and the other for the lexicon-based word embedding.These two tokenized sequences are denoted as T cxt and T lex , respectively.Then, T cxt is fed into the pre-trained BERT to produce a sequence of contextualized word embeddings To capture task-specific information, a Lexicon embedding layer encodes a sequence of emotion and sentiment word associations for T lex , generating a sequence of lexicon-based embeddings and D lex is the lexical embedding vector dimension.We first build the vocabulary V from the text of D s and D t .For each word v i in V of T lex , we use d lexicons to generate the rule-based feature vectors i = { i 1 , i 2 , . . ., i d }, where i j is the lexical feature vector for word v i derived from the j-th lexicon and D lex = | i |.Additionally, we perform a degree-p polynomial expansion on the feature vector i j .
Integration Phase.As BERT uses the Word-Piece tokenizer (Wu et al., 2016) to split a number of words into a sequence of subwords, the contextualized embedding cannot be directly concatenated with the different-size static word embedding.Inspired by Alghanmi et al. (2020), we combine the contextualized embeddings and static word embeddings at the sentence level by pooling the two embeddings E cxt and E lex separately.To perform initial feature extraction from the raw where W 1 , b 1 , W 2 and b 2 are trainable parameters and k is the kernel size.We then apply the average pooling and max pooling on the feature maps, respectively: Finally, the contextualised embedding and the lexicon-based embedding are merged via a concatenation layer as Ẽcxt ⊕ Ẽlex .
Prediction Phase.The prediction phase outputs the emotion intensity predictions Ŷ = {ŷ 1 , ŷ2 , . . ., ŷN } by using a single dropout (Srivastava et al., 2014) layer and a linear regression layer.During training, the mean squared error loss is computed and backpropagated to update the model parameters.

Song-Level Model
After obtaining initial verse-level predictions, the next step involves incorporating these into a songlevel model using an LG-SSM.We take one type of emotion as an example.Specifically, we consider the predicted scores of this emotion of each song as an observed sequence ŷi .That is, we group the N predictions of Ŷ as |D t | sequences of predictions as {ŷ 1 , ŷ2 , . . ., ŷ|Dt| }.For the i-th song, the observed sequence ŷi = y 1:T is then used in an LG-SSM to obtain the latent sequence ẑ1:T that represents the song-level emotional dynamics, where T is the number of verses in the song.
Standard applications of LG-SSM assume a temporal ordering in the sequence.This means that estimates of p(ẑ t ) should only depend on the observed values up to the verse step t (i.e., y 1:t ), which is the central assumption to the Kalman Filter algorithm.Given the sequence of observations, we recursively apply the Kalman Filter to calculate the mean and variance of the hidden states, whose computation steps are displayed in Algorithm 1.
Since we have obtained initial predictions for all verses in a song, we can assume that observed emotion scores are available for the sequence of an entire song a priori.In other words, we can include the "future" data (i.e., y t+1:T ) to estimate the latent posteriors p(ẑ t ).This is achieved by using the Kalman smoothing algorithm, also known as RTS smoother (Rauch et al., 1965), which is shown in Algorithm 2.
As opposed to most other algorithms, the Algorithm 1: Kalman Filter Input : Apply the Kalman Filter (refer to Algorithm 1); Kalman Filter and Kalman Smoother algorithms are used with already known parameters.Hence, learning the SSM involves estimating the parameters Θ.If a set of gold-standard values for the complete z 1:T is available, they can be learned using a Maximum Likelihood Estimation (MLE).If only the noisy, observed sequences y 1:T are present, the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) provides an iterative method for finding the MLEs of Θ by successively maximizing the conditional expectation of the complete data likelihood until convergence.

Experiments
Our experiments aim to evaluate the method proposed to predict the emotional dynamics of song lyrics without utilizing any annotated lyrics data.We introduce datasets, lexicon resources and evaluation metric used ( §4.1), and discuss the imple-mentation details and experiment settings of verselevel model ( §4.2) and song-level model ( §4.3).

Datasets and Evaluation
LyricsEmotions.This corpus was developed by Mihalcea and Strapparava (2012), consisting of 100 popular English songs with 4,975 verses in total.The number of verses for each song varies from 14 to 110.The LYRICSEMOTIONS dataset was constructed by extracting the parallel alignment of musical features and lyrics from MIDI tracks.These lyrics were annotated using Mechanical Turk at verse level with real-valued intensity scores ranging from 0 to 10 of six Ekman's emotions (Ekman, 1993): ANGER, DISGUST, FEAR, JOY, SADNESS and SURPRISE.Given that our goal is to predict emotions without relying on song-level dynamics, we use this dataset for evaluation purposes only.
NewsHeadlines.To train the verse-level model, we employ the NEWSHEADLINES 3 dataset (Strapparava and Mihalcea, 2007), which is a collection of 1,250 news headlines.Each headline is annotated with six scores ranging from 0 to 100 for each of Ekman's emotions and one score ranging from -100 to 100 for valence.
Lexicons.Following Goel et al. (2017) and Meisheri and Dey (2018), we use nine emotion and sentiment related lexicons to obtain the feature vectors from the text in NEWSHEADLINES and LYRICSEMOTIONS, summarized in Table 1.
Evaluation.In line with Mihalcea and Strapparava (2012), we use the Pearson correlation coefficient (r) as the evaluation metric to measure the correlation between the predictions and ground truth emotion intensities.To assess statistical significance, we conduct the Williams test (Williams, 1959) in the differences between the Pearson correlations of each pair of models.
For baselines, our method is unsupervised at the song level, and we are not aware of prior work tackling similar cases.Therefore, we use the results of the verse-level model as our main baseline.We argue that this is a fair baseline since the SSMbased model does not require additional data.BERTLex.The sequence of embeddings for each token, including [CLS] and [SEP] at the output of the last layer of the BERT base model, is fed into a Conv1D layer with 128 filters and a kernel size of 3, followed by a 1D global average pooling layer.
We concatenate nine vector representations for every word in the established vocabulary by using the lexicons in Table 1 in the identical order to form a united feature vector.As a result, the whole word embedding is in the shape of (3309,25), where 3309 is the vocabulary size and 25 is the number of lexicon-based features.To validate if adding polynomial features can make better predictions, we also perform a polynomial feature expansion with a degree of 3, extending the shape of vector representations to (3309,267).Then, static word embeddings are fed a Conv1D layer with 128 filters and a kernel size of 3, followed by a global max-pooling layer.
The two pooled vectors are then concatenated through a Concatenate layer as they are in the same dimensionality.We generate the predictions of emotion intensities by using a Linear layer with a single neuron4 for regression.Training.Instead of using the standard train/dev/test split of the NEWSHEADLINES dataset, we apply 10-fold cross-validation to tune the hyperparameters of BERT-based models.Empirically tuned hyperparameters are listed in Table 2 and are adopted in the subsequent experiments unless otherwise specified.After tuning, the final models using this set of hyperparameters are trained on the full NEWSHEADLINES data.We use an ensemble of five runs, taking the mean of the predictions as the final output.

Song-Level Experiments
We apply the library pykalman (version 0.9.2)5 , which implements the Kalman Filter, the Kalman Smoother and the EM algorithm to train SSMs.We fix the initial state mean as the first observed value in the sequence (i.e., each song's first verselevel prediction) and the initial state covariance as 2. We then conduct experiments with several groups of parameters transition matrices timization, we experiment n_iter = {1,3,5,7,10} to control the number of EM algorithm iterations.Additionally, we apply 10-fold cross-validation when choosing the optimal parameters via EM, which means each song is processed by a Kalman Filter or Kalman Smoother defined by the optimal parameters that we obtained from training on the other 90 songs.

Results and Analysis
In this section, we report and discuss the results of the experiments.We first compare the results of our lexicon-based, learning-based and hybrid methods at the verse level ( §5.1).We then provide the results of song-level models and investigate the impact of initial predictions from verse-level models, SSM parameters, and parameter optimization ( §5.2).We additionally show the qualitative case analysis results to understand our model's abilities and shortcomings ( §5.3).

Results of Verse-level Models
Table 3 shows the results of verse-level models on the NEWSHEADLINES (average of 10-fold crossvalidation) and LYRICSEMOTIONS (as the test set) datasets.The domain difference is significant in news and lyrics, as we can observe from the different performances of BERT-based models on the two datasets.Overall, our BERTLex method outperforms the lexicon-only and BERT-only baselines and exhibits the highest Pearson correlation of 0.503 (BERTLex poly for JOY) in LYRICSEMO-TIONS.
Having a closer look at the results of LYRICSE-MOTIONS, we also observe the following: • The addition of lexicons for incorporating external knowledge consistently promotes the performance of BERT-based models.
• BERTLex models with polynomial feature expansion are better than those without, except for DISGUST.
• Our models are worst at predicting the emotion intensities of SURPRISE (lower than 0.1), which is in line with similar work in other datasets annotated with the Ekman taxonomy.

Results of Song-level Models
Extensive experiments confirm that our song-level models utilizing the Kalman Filter and Kalman Smoother can improve the initial predictions from verse-level models combining BERT and lexicons (see Table 4 and Table 5).The LG-SSMs with EM-optimized parameters always perform better than those without using EM.Furthermore, the performance improvements of the strongest SSMs from their corresponding verse-level baselines are statistically significant at 0.05 confidence (marked with *), except for SURPRISE.
Theoretically, the Kalman Smoother is supposed to perform better than the Kalman Filter since the former utilizes all observations in the whole sequence.According to our experimental results, however, the best-performing algorithm depends on emotion.On the other hand, running the EM algorithm consistently improves the results of SSMs that simply use the initial values, except for SURPRISE.
Smoother and EM algorithm are associated with the initial scores predicted by verse-level models.For the same emotion, we compare the results based on the mean predictions from the BERTLex models with and without polynomial expansion on lexical features, respectively (shown in Table 4).We observe that the higher the Pearson correlation between the ground truth and the verse-level predictions, the more accurate the estimates obtained after using LG-SSMs accordingly.The strongest SSMs also differ with the different emotion types and initial predictions, as denoted in boldface.
Impact of initial parameters.The results of Kalman Filter and Kalman Smoother are sensitive to the initial model parameters.As displayed in Table 5, when we only change the value of tran-sition matrices A and fix the other parameters, the performance of Kalman Filter and Kalman Smoother can be decreased even worse than without them.Fortunately, this kind of diminished performance due to the initial parameter values can be diluted by optimizing the parameters with an EM algorithm.
Impact of parameter optimization.For either Kalman Filter or Kalman Smoother, using the EM algorithm to optimize the parameters increases Pearson's r in most cases.Through experiments, the number of iterations does not significantly influence the performance of the EM algorithm, and 5 ∼ 10 iterations usually produce the strongest results.

Qualitative Case Studies
To obtain some insights into further improvement, we examine the errors that our models are making.emotional dynamics (see the third sub-figure in Figure 4).The emotional dynamics trend of estimates by song-level models is similar to verselevel models.Due to the Gaussian assumption, Kalman Filter and Kalman Smoother tend to flatten or smooth the curves of verse-level predictions.This means that applying LG-SSMs can somewhat reduce errors in the second type of emotion dynamic curves.For the first type, however, the Kalman Filter and Kalman Smoother make the results worse, as smoother estimations are not desirable in this situation.
Using text solely.Lyrics in LYRICSEMOTIONS are synchronised to acoustic features, where some verses with identical text are labelled as different emotional intensities.For instance, in Table 6, the verse "When it rain and rain, it rain and rain" repeats multiple times in the song Rain by Mika, and their gold-standard SADNESS labels differ in different verses as the emotion progresses with music.However, the verse-level models can only produce the same predictions since these verses share exactly the same text, and the models do not consider the context of the whole song.Consequently, the emotion scores of different verses predicted by LG-SSMs are close, as the results of song-level models are highly related to the initial predictions from BERTLex.• While our method could apply any general verse-level model, including a pure lexiconbased one, in practice, we obtained the best results by leveraging annotated sentencelevel datasets.This naturally leads to the domain discrepancy: in our particular case, between the news and lyrics domains.Given that unlabelled song lyrics are relatively easy to obtain, one direction is to incorporate unsupervised domain adaptation techniques (Ramponi and Plank, 2020) to improve the performance of the verse-level model.Semisupervised learning (similar to Täckström and McDonald (2011b)) is another promising direction in this avenue, although methods would need to be modified to incorporate the continuous nature of the emotion labels.
• Despite being able to optimize the estimates through Kalman Filter and Kalman Smoother, the simplicity of the LG-SSM makes it difficult to deal with the wide variations in emotion space dynamics, given that it is a linear model.We hypothesize that non-linear SSM extensions (Julier and Uhlmann, 1997;Ito and Xiong, 2000;Julier and Uhlmann, 2004) might be a better fit for modelling emotion dynamics.
• As the LYRICSEMOTIONS dataset is annotated on parallel acoustic and text features, using lyrics solely as the feature can cause inconsistencies in the model.Extending our method to a multi-modal setting would remedy this issue when the identical lyrics are companions with different musical features to appear in various verses.Taking the knowledge of song structure (e.g., Intro -Verse -Bridge -Chorus) into account has the potential to advance the recognition of emotion dynamics, assuming the way (up or down) that emotion intensities change is correlated with which part of the song the verses locate.

Figure 1 :
Figure1: An illustration of emotion dynamics of a song in the LYRICSEMOTIONS dataset ofMihalcea and Strapparava (2012).Note the intensities of each emotion vary from verse to verse within the song.

Figure 2 :
Figure 2: BERTLex architecture used for the verse-level model.

Figure 3 :
Figure 3: The SURPRISE emotion intensities of ground truth (all zeros), BERTLex model and SSM in an example song.

Figure 4 :
Figure 4: Emotional dynamics of ANGER, DISGUST and SURPRISE in Bad Romance by Lady Gaga: Pearson's r between ground truth and predictions of BERTLex poly , estimates of Kalman Filter, are reported, respectively.

Table 1 :
Lexicons used to build lexicon-based feature vectors: PT is the feature vector size after the polynomial feature expansion.

Table 2 :
Hyperparameter settings of BERT and CNN models.

Table 3 :
Pearson correlations between gold-standard labels and predictions of the verse-level models in the NEW-SHEADLINES (NH) and LYRICSEMOTIONS (LE) datasets.
A, transition covariance Q, observation matrices C and observation covariance R to initialise the Kalman Filter and Kalman Smoother.For parameter op-

Table 5 :
Pearson correlations between gold-standard labels and SSMs with different values of transition matrices A, on the basis of BERTLex poly models (as listed in the bottom half of Table

Table 6 :
SADNESS scores of verses with the same lyrics verse "When it rain and rain, it rain and rain" but different gold-standard labels in the song.6ConclusionandFuture WorkThis paper presents a two-stage BERTLex-SSM framework for the sequence-labelling emotion intensity recognition tasks.Combining the contextualized embeddings with static word embeddings and then modelling the initial predicted intensity scores as a State Space model, our method can utilize context-sensitive features with external knowledge and capture the emotional dynamic transitions.Experimental results show that our proposed BERTLex-SSM effectively predicts emotion intensities in the lyrics without requiring annotated lyrics data.Our analysis in Section 5.3 points to a range of directions for future work: