Abstract
Majority voting and averaging are common approaches used to resolve annotator disagreements and derive single ground truth labels from multiple annotations. However, annotators may systematically disagree with one another, often reflecting their individual biases and values, especially in the case of subjective tasks such as detecting affect, aggression, and hate speech. Annotator disagreements may capture important nuances in such tasks that are often ignored while aggregating annotations to a single ground truth. In order to address this, we investigate the efficacy of multi-annotator models. In particular, our multi-task based approach treats predicting each annotators’ judgements as separate subtasks, while sharing a common learned representation of the task. We show that this approach yields same or better performance than aggregating labels in the data prior to training across seven different binary classification tasks. Our approach also provides a way to estimate uncertainty in predictions, which we demonstrate better correlate with annotation disagreements than traditional methods. Being able to model uncertainty is especially useful in deployment scenarios where knowing when not to make a prediction is important.
1 Introduction
Obtaining multiple annotator judgements on the same data instances is a common practice in NLP in order to improve the quality of final labels (Snow et al., 2008; Nowak and Rüger, 2010). In case of disagreements between annotations, they are often aggregated by majority voting, averaging (Sabou et al., 2014), or adjudicating by an “expert” (Waseem and Hovy, 2016), to derive a singular ground truth or gold label that is later used for training supervised machine learning models. However, in many subjective tasks, there often exists no such single “right” answer (Alm, 2011) and enforcing a single ground truth sacrifices the valuable nuances embedded in annotator’s assessments of the stimuli and their disagreements (Aroyo and Welty, 2013; Cheplygina and Pluim, 2018).
Annotators’ sociodemographic factors, moral values, and lived experiences often influence their interpretations, especially in subjective tasks such as identifying political stances (Luo et al., 2020), sentiment (Díaz et al., 2018), and online abuse and hate speech (Cowan and Khatchadourian, 2003; Waseem, 2016; Patton et al., 2019). For instance, Waseem (2016) found that feminist and anti-racist activists systematically disagree with crowd workers on their hate speech annotations. Similarly, annotators’ political affiliation affects how they annotate the neutrality of political stances (Luo et al., 2020). An adverse effect of majority vote in such cases is limiting representation of minority perspectives in data (Prabhakaran et al., 2021), potentially reinforcing societal disparities and harms.
Another consequence of majority voting, when applied to subjective annotations, is that the resulting labels may not be internally consistent. For example, consider a scenario where a sentence in a hate-speech dataset is annotated by a set of annotators, the majority of whom consider a phrase in it to be offensive, yet another sentence with the same phrase is annotated by a different set of annotators, the majority of whom do not find the phrase to be offensive. Upon majority vote, the first sentence would be labeled as hate speech and the second sentence would not, despite containing similar content. Such inconsistencies in the majority label will add noise to the learning step, while the systematicity in the individual annotations is lost.
Finally, majority vote and similar aggregation approaches assume that an annotator’s judgements about different instances are independent from one another. However, as outlined above, annotators’ decisions are often correlated, reflecting their subjective biases. Prior work has investigated Bayesian methods to account for such systematic differences between annotators (Paun et al., 2018), however, they approach this as an alternate means to derive a single ground truth label, thereby masking the degree to which annotators disagreed.
Our proposed solution is simple: We introduce multi-annotator architectures to preserve and model the internal consistency in each annotators’ labels as well as their systematic disagreements with other annotators. We show that the multi-task framework (Liu et al., 2019) provides an efficient way to implement a multi-annotator architecture that captures the differences between individual annotators’ perspectives using the subset of data instances they labeled, while also benefiting from the shared underlying layers fine-tuned for the task using the entire dataset. Preserving different annotators’ perspectives until the prediction step provides better flexibility for downstream applications. In particular, we demonstrate that it provides better estimates for uncertainty in predictions. This will improve decision making in practice—for instance, to determine when not to make a prediction or when to recommend a manual review.
Our contributions in this paper are three-fold: (1) We develop an efficient multi-annotator strategy that matches or outperforms baseline models on seven different subjective tasks by preserving annotators’ individual and collective perspectives throughout the training process. (2) We obtain an interpretable way to estimate model uncertainty that better correlates with annotator disagreements than traditional uncertainty estimates across all seven tasks. (3) We demonstrate that model uncertainty correlates with certain types of error, providing a useful signal to avoid erroneous predictions in real-world deployments.
2 Literature Review
Learning to recognize and interpret subjective language has a long history in NLP (Wiebe et al., 2004; Alm, 2011). While all human judgements embed some degree of subjectivity, it is commonly agreed that certain NLP tasks tend to be more subjective in nature. Examples of such relatively subjective tasks include sentiment analysis (Pang and Lee, 2004; Liu et al., 2010), affect modeling (Alm, 2008; Liu et al., 2003), emotion detection (Hirschberg et al., 2003; Mihalcea and Liu, 2006), and hate speech detection (Warner and Hirschberg, 2012). Alm (2011) argues that achieving a single real “ground truth” is not possible, nor essential, in subjective tasks, and calls for finding ways to model subjective interpretations of annotators, rather than seeking to reduce the variability in annotations. While which NLP tasks count as subjective may be contested, we focus on two tasks that are markedly subjective in nature.
2.1 Detecting Online Abuse
NLP-aided approaches to detect abusive behavior online is an active research area (Schmidt and Wiegand, 2017; Mishra et al., 2019; Corazza et al., 2020). Researchers have developed typologies of online abuse (Waseem et al., 2017), constructed datasets annotated with different types of abusive language (Warner and Hirschberg, 2012; Price et al., 2020; Vidgen et al., 2021), and built NLP models to detect them efficiently (Davidson et al., 2017; Mozafari et al., 2019). Researchers have also expanded the focus to more subtle forms of abuse such as condescension and microaggressions (Breitfeller et al., 2019; Jurgens et al., 2019).
However, recent research has demonstrated that these models tend to reflect and propagate various societal biases, causing disparate harms to marginalized groups. For instance, toxicity prediction models were shown to have biases towards mentions of certain identity terms (Dixon et al., 2018), specific named entities (Prabhakaran et al., 2019), and disabilities (Hutchinson et al., 2020). Similarly these models are shown to overestimate the prevalence of toxicity in African American Vernacular English (Sap et al., 2019; Davidson et al., 2019; Zhou et al., 2021). Most of these studies demonstrate association biases present in data; for instance, Hutchinson et al. (2020) show that discussions about mental illness are often associated with topics such as gun violence, homelessness, and drugs, likely the reason for the learned association of mental illness related terms with toxicity. While whether a piece of text is hateful or not depends also on the context (Prabhakaran et al., 2020), not much work has investigated the human annotator biases present in the training labels, and how they impact downstream predictions.
2.2 Detecting Emotions
Detecting emotions from language has been a significant area of research in NLP for the past two decades (Liscombe et al., 2003; Aman and Szpakowicz, 2007; Desmet and Hoste, 2013; Hirschberg and Manning, 2015; Poria et al., 2019). Annotated datasets used for training emotion detection models vary across domains, and use different taxonomies of emotions. While several datasets (Strapparava and Mihalcea, 2007; Buechel and Hahn, 2017) include a small set of labels representing the six Ekman emotions (Ekman, 1992—anger, disgust, fear, joy, sadness, and surprise), or bipolar dimensions of affect (arousal and valence; Russell, 2003), others such as Demszky et al. (2020) and Crowdflower (2016) include a wider range of emotion labels according to the Plutchik emotion wheel (Plutchik, 1980) or the complex semantic space of emotions (Cowen et al., 2019). Perceiving emotions is a subjective task affected by various contextual factors, such as time, speaker, mood, personality, and culture (Mower et al., 2009). Since aggregating annotations of emotion expressions loses such contextual nuances, some researchers provide a distributional representation of emotions (Fayek et al., 2016; Ando et al., 2018). Here, we use annotations for the six Ekman emotions present in the dataset released by Demszky et al. (2020) to demonstrate how our multi-annotator approach can capture emotions in a disaggregated fashion.
2.3 Annotation Disagreement
Researchers have studied different sources of annotator disagreements. Krippendorff (2011) argued that there are at least two types of disagreement in content coding: random variation, which comes as an unavoidable by-product of human coding, and systematic disagreement, which is influenced by features of the data or annotators. Dumitrache (2015) identifies different sources of disagreement as (a) the clarity of an annotation label (i.e., task descriptions), (b) the ambiguity of the text, and (c) differences in workers. Aroyo and Welty (2013) also studied inter-annotator disagreement in association with features of the input, showing that it reflects semantic ambiguity of the training instances. Textual features have been shown to predict annotators’ disagreement in determining the meaning of ambiguous words (Alonso et al., 2015). Acknowledging inter-annotator disagreement as an indicator of annotator differences, Kairam and Heer (2016) clustered crowd-workers based on their annotation behaviors, and proposed a method for interpreting annotation disagreements and its sources.
For highly subjective tasks such as hate speech and emotions detection, annotation disagreements can be rooted in the differing subjectivities and value systems of annotators. In these cases, annotators build a subjective social reality as a basis for social judgements and behaviors (Greifeneder et al., 2017), which explains their labeling procedure. For example, in interviews with annotators in an aggression labeling task, Patton et al. (2019) found that expert annotators from communities discussed in gang-related tweets drew on their lived experience to produce different label judgements compared with graduate student researchers. Such annotators whose lived experiences bring important perspectives to the task would be dramatically underrepresented on generic crowd work platforms and, by definition, would be outvoted in disagreements subject to majority vote. Majority vote also necessarily obfuscates differences among groups underrepresented in annotator pools, such as older adults who can exhibit views on aging distinct from crowd workers (Díaz, 2020), the majority of whom tend to be younger (Ross et al., 2010).
Some studies have proposed alternatives to majority voting when aggregating multiple annotations. In early work, Dawid and Skene (1979) used the EM algorithm to obtain maximum likelihood estimates of the “true” label to account for annotator errors. De Marneffe et al. (2012) used the individual annotation distributions to predict areas of uncertainty in veridicality assessment. Hovy et al. (2013) proposed an approach based on item-response model that uses posterior entropy to choose which annotators are trustworthy. Waterhouse (2013) developed a pointwise mutual information metric to quantify the amount of information in an annotator’s judgement that can be used to estimate the “correct” label of an instance. Gordon et al. (2021) explore multiple annotators judgements to disentangle stable opinions from noise by estimating intra-annotator consistency. All these approaches aim to obtain the “correct” label, accounting for erroneous or non-trustworthy annotators, whereas we focus on retaining the annotator disagreements through the modeling process.
A few studies have explored approaches for utilizing annotation disagreement during model training. Prabhakaran et al. (2012) explored applying higher cost for errors made on unanimous annotations to decrease the penalty of mis-labeling inputs with higher disagreement. Similarly, (Plank et al., 2014) incorporated annotator disagreement into the loss function of a structured perceptron model for better predicting part-of-speech tags. Our work also utilizes annotator disagreements rather than resolving them in the data stage; however, we use a multi-task architecture using a shared representation to model annotator disagreements, rather than using it in loss function. Cohn and Specia (2013) use a multi-task approach to model annotator differences in machine translation annotations. While they use a Gaussian Process approach, we use the multi-task approach on top of pre-trained language models (Liu et al., 2019). Chou and Lee (2019) proposed an approach where they model individual annotators separately in an inner layer to improve the final prediction. In contrast, our method uses the multi-task architecture, and provides the additional ability to utilize multiple predictions during deployment, for instance, to measure uncertainty. Fornaciari et al. (2021) also leveraged annotator disagreement using a multi-task model that adds an auxiliary task to predict the soft label distribution over annotator labels, which improves the performance even in less subjective tasks such as part-of-speech tagging. In contrast, our approach models several annotators’ labels as multiple tasks and obtains their disagreement.
2.4 Prediction Uncertainty
Model uncertainty denotes the confidence of model predictions, which has specific applications in non-deterministic machine learning tasks. For instance, interpreting model outputs and its confidence is critical in autonomous vehicle driving, where wrong predictions are costly or harmful (Schwab and Karlen, 2019). In subjective tasks, uncertainty embeds additional information that supports result interpretation (Ghandeharioun et al., 2019). For example, the level of uncertainty could help determine when and how moderators take part in a human-in-the-loop content moderation (Chandrasekharan et al., 2019; Liu, 2020).
The simplest approach for uncertainty estimation is through prediction probability from a Softmax distribution (Hendrycks and Gimpel, 2017). However, as the input data gets farther from the training data, this probability estimation naturally yields extrapolations with unsupported high confidence (Gal and Ghahramani, 2016). Instead, Gal and Ghahramani (2016) proposed the Monte Carlo dropout approach to estimate uncertainty by iteratively applying dropouts to all layers of the model and calculating the variance of generated outputs. Such estimations based on the probability of a single ground truth label overlooks the many factors that contribute to uncertainty (Kläs and Vollmer, 2018). In contrast, Passonneau and Carpenter (2014) demonstrate the benefits of measuring uncertainty for the ground truth label by fitting a probabilistic model to individual annotators’ observed labels. Similarly, we demonstrate that calculating annotation disagreement by predicting a set of annotations for the input yields a better estimation of uncertainty than estimations based on the probability of the majority label.
3 Methodology
We define the classification task on an annotated dataset D = (X,A,Y ), in which X is a set of text instances, A is the set of annotators and Y is the annotation matrix, in which each entry yij ∈{0,1} represents the label assigned to xi ∈ X by aj ∈ A. In most annotated datasets Y includes many missing values, because each annotator only labels a subset of all instances. We use to refer to the annotations present for item xi. Similarly, we use to refer to the annotations made by annotator aj. The classification task aims to predict , which is the label assigned to xi based on the majority vote over . We use majority vote, the most commonly used aggregation method; however, our proposed approach leaves open the choice of the aggregation method depending on deployment contexts.
We consider three different multi-annotator architectures: ensemble, multi-label, and multi-task. Figure 1 shows the schematic differences between these three variations. All variations use Bidirectional Encoder Representations from Transformers (BERT-base; Devlin et al., 2019). For each instance xi, a generic representation hi ∈ℝd is generated by the pre-trained BERT-base, and then fine-tuned along with other components of the classifier during training. The size of the representation vector, d, is defined by the BERT configuration and is set to 768 for the pre-trained BERT-base. While our experiments are all performed with BERT-base, our methods are not restricted to BERT in their nature, and can be implemented with other pre-trained language models, for example, RoBERTa (Zhu et al., 2020).
3.1 Baseline Model Using Majority Labels
The baseline model captures the most common approach: a single-task classifier trained to predict the aggregated label for each instance (i.e., majority vote, in our case). It is built by adding a fully connected layer to BERT-base outputs (hi). The fully connected layer applies a linear transformation followed by Softmax function to generate the probability of the majority label, . Compared to the other models described in this section, the baseline model does not make use of annotation matrix Y, as it directly predicts the aggregated label .
3.2 Ensemble Approach
An intuitive approach towards multi-annotator models might be to train an ensemble of models, each trained on different annotators’ labels. This approach is not always practical, as it may increase the training time prohibitively. The ensemble approach applies |A| single-task classifiers, each for training and predicting the annotations generated by one annotator. During training, the j-th classifier is independently fine-tuned to predict , which includes all annotations provided by the j-th annotator. During test time, we aggregate the outputs by the majority vote of all |A| models to predict .1
3.3 Multi-label Approach
A more practical approach for multi-annotator modeling is to consider the problem as a multi-label problem where each label denotes individual annotators’ labels. More specifically, the multi-label approach attempts to learn to predict |A| labels for each input using a multi-label classification framework. The model first adds a fully connected layer to transform each hi to a |A|-dimensional vector, and then applies a Sigmoid function to the j-th dimension to generate yij. Since Y includes many missing values, the classification loss is calculated based on the available labels . However, during test time, all |A| outputs are aggregated to predict .
3.4 Multi-task Approach
The multi-task based multi-annotator approach attempts to learn multiple annotators’ perspectives (labels) as separate classification tasks, all of which share encoder layers to generate the same representation of the input sentence hi, each with its separate fully connected layer and softmax activation. Compared with the multi-label approach, the multi-task model includes a fully connected layer explicitly fine-tuned for each annotator. However, compared with the ensemble approach, the representation layers which generate hi are fine-tuned based on the outputs of all annotation tasks. The loss function is created as the summation of all available labels for each instance xi. During test time, the model considers the outputs of all annotation tasks to predict the majority label .
4 Experiments
4.1 Data
For this study, we perform experiments on two datasets annotated for subjective tasks: Gab Hate Corpus (GHC; (Kennedy et al., 2020)) and GoEmotions dataset (Demszky et al., 2020). Both datasets capture per-annotator labels for instances along with corresponding annotators’ anonymous ID, allowing us to model each annotator separately.
4.1.1 Gab Hate Corpus (GHC)
GHC (Kennedy et al., 2020), includes |X| = 27,665 social-media posts collected from a public corpus of Gab.com (Gaffney, 2018), each annotated for whether or not they contain hate speech. Kennedy et al. (2020) define hate speech as language that dehumanizes, attacks human dignity, derogates, incites violence, or supports hateful ideology, such as white supremacy. Each instance in GHC is annotated by at least three annotators from a set of |A| = 18 annotators. The number of annotations varies for each instance (, ), and in total, there are 86,529 annotations. The number of annotated instances per annotator also varies significantly (, ).
4.1.2 GoEmotions
We use a subset of the GoEmotions dataset (Demszky et al., 2020) which contains Reddit posts annotated for 28 emotions, split across pre-defined train (|X|train = 43,410), test (|X|test = 5,427), and validation (|X|val = 5,426) subsets. Our experiments focus on the emotion annotations for the six Ekman (Ekman, 1992) emotions:anger, disgust, fear, joy, sadness, and surprise. Each instance in GoEmotions is annotated by three to five annotators from a set of |A| = 82 annotators. The number of annotations varies for each instance (, ), and in total, there are 194,412 annotations. The number of annotated instances varies significantly across annotators (, ).
4.2 Experimental Setup
We implemented the classification models using the transformers (v3.1) library from HuggingFace (Wolf et al., 2020). The training steps employ the Adam optimizer (Kingma and Ba, 2015). Our experiment settings are configured similar to Kennedy et al. (2020) and Demszky et al. (2020), GHC experiments are conducted with a learning rates of e − 7 and are trained for three epochs, whereas experiments on GoEmotions apply early stopping with a learning rate of 5e − 6. Since GHC does not have specific train and test subsets, we conducted 5 iterations of stratified 5-fold cross-validations for evaluation, changing only the random state for each iteration. GoEmotions experiments are performed as six different binary classification tasks, also repeated for 5 iterations, using the pre-defined train and test sets.
4.3 Results on GHC
4.3.1 Prediction Results
Table 1 reports the average and standard deviation of the precision, recall, and F1-scores for various models, across the 5 iterations. The baseline model, which is trained using the majority vote as ground truth, is also tested against the majority vote labels. For the ensemble, multi-label, and multi-task models, we conduct two types of evaluation: First, we test how well the majority vote of predicted labels match the majority vote of annotations (columns 2-4 in Table 1); second, we report how well the individual predicted labels for each instance match the annotations (where available) by annotators (columns 5-7 in Table 1).
. | Majority Vote . | Individual Labels . | ||||
---|---|---|---|---|---|---|
Model . | Precision . | Recall . | F1 . | Precision . | Recall . | F1 . |
Baseline | 49.53± 3.8 | 68.78± 4.4 | 57.32± 1.2 | – | – | – |
Ensemble | 63.98± 1.1 | 46.09± 1.9 | 53.54± 1.0 | 60.92± 0.7 | 60.97± 0.8 | 60.94± 0.3 |
Multi-label | 66.02± 2.2 | 50.16± 2.0 | 56.94± 1.0 | 67.22 ± 1.4 | 55.33± 2.0 | 60.65± 0.7 |
Multi-task | 59.03± 0.9 | 59.98± 0.6 | 59.49± 0.2 | 63.71± 1.3 | 62.76 ± 1.5 | 63.20 ± 0.3 |
. | Majority Vote . | Individual Labels . | ||||
---|---|---|---|---|---|---|
Model . | Precision . | Recall . | F1 . | Precision . | Recall . | F1 . |
Baseline | 49.53± 3.8 | 68.78± 4.4 | 57.32± 1.2 | – | – | – |
Ensemble | 63.98± 1.1 | 46.09± 1.9 | 53.54± 1.0 | 60.92± 0.7 | 60.97± 0.8 | 60.94± 0.3 |
Multi-label | 66.02± 2.2 | 50.16± 2.0 | 56.94± 1.0 | 67.22 ± 1.4 | 55.33± 2.0 | 60.65± 0.7 |
Multi-task | 59.03± 0.9 | 59.98± 0.6 | 59.49± 0.2 | 63.71± 1.3 | 62.76 ± 1.5 | 63.20 ± 0.3 |
We observe that the ensemble model performs significantly worse (F1 = 53.54) than the baseline single-task model (F1 = 57.32) in predicting majority label. This is presumably due to the fact that each base model in the ensemble is trained using only the examples labeled by the corresponding annotator. Since the number of annotations varies significantly for different annotators (see Section 4.1), many base models end up with lower performance, resulting in lower overall performance.
Multi-label and multi-task models share most layers across different annotator heads. Thus, each annotator head benefits from the updates to the shared layers owing to all instances, regardless of whether they annotated it or not. The multi-label model performs slightly worse (F1 = 56.94) than the baseline model. In contrast, the multi-task model, which has a fully connected layer fine-tuned for each annotator, posted a significantly higher F-score (F1 = 59.49) than the baseline model. In other words, fine-tuning each annotator head separately and then taking the majority vote performs better than taking the majority vote first and then training on that noisier label.
Moreover, the baseline model yields higher performance variance among different iterations, such that its standard deviations of precision, recall, and F1 exceeds those of the other three methods. One possible explanation is that aggregating annotations based on majority votes disposes of information about each annotator and inserts noise into the labels. In other words, modeling each annotator, and their presumable internal consistency, could lead to more stable prediction results. However, this hypothesis requires further investigation.
We now evaluate the individual predictions made by the multi-annotator model (prior to majority vote) on how well they match individual annotators’ labels (Table 1). All three multi- annotator approaches obtain higher F1-scores than how the baseline model does in predicting majority labels (note that these are different tasks, and not directly comparable). The multi-task model achieved the highest F1-score of 63.20. The result suggests that the multi-task model benefits from fine-tuning annotators separately (thereby avoiding inconsistencies due to majority votes) as well as learning from all instances in a shared fashion.
4.3.2 Modeling Uncertainty
Figure 2 shows the correlations of uncertainty estimation using each method with the annotation disagreement calculated as . While traditional estimations such as Softmax and MC dropout have a moderate correlation with annotator disagreements, the uncertainty measured by our three multi-annotator methods show significantly better correlation, with the ensemble method posting a slightly higher correlation than the other two methods. In other words, in addition to performing better on predicting majority votes, multi-annotator models also predict model uncertainty better than traditional approaches.
We further analyze the pair-wise correlation between estimations of uncertainty by different approaches (Figure 3). As expected, the Softmax and MC dropout methods are highly correlated, and similarly, our methods show high correlation among themselves. It is also interesting to note that the uncertainty estimated by our methods also correlate significantly with traditional methods (i.e., between 0.6 and 0.7), except for the multi-task method and MC Dropout method, which have a lower correlation of 0.53.
The fact that the uncertainty scores for multi-task and multi-label models are highly correlated with each other (0.86) suggests that they both identify textual features that cause disagreement. We verified this by training a separate model using the same BERT-based setup using Sigmoid activation to directly predict the annotator disagreement. The predicted uncertainty by this model obtained similar correlation with the annotator uncertainty (0.47) as the multi-task and multi-label models.
4.3.3 Computation Time
We now assess the computation cost associated with the different approaches. Table 2 shows the time it took to train a single cross-validation fold (i.e., 80% of the dataset). As expected, the ensemble approach takes the longest to train, as it require training |A| different models (each with varying training set sizes), and the baseline takes the shortest time. Impressively, multi-label and multi-task models do not take significantly more time to train. In other words, while the multi-task model trains additional layers for annotators, it adds only a marginal computation cost to the baseline model.
4.4 Results on GoEmotions
In this section, we describe results obtained on the six binary classification tasks performed using the GoEmotions dataset. Since the multi-task approach obtained better performance overall on GHC, we report the results on only the multi-task approach here. We start by assessing how well the multi-annotator model matches the single-task performance of predicting the majority label.Table 3 reports the average and standard deviation of F1-scores over 5 iterations of training and testing. Unlike GHC where we used 5-fold cross validation, for the GoEmotions dataset we use the pre-defined train, validation, test splits in the dataset. We verified that these splits are stratifed with respect to annotators. As in GHC experiments, while the baseline model is trained and tested on the majority vote, the multi-task model is trained on available annotator-level annotations for each instance and the predictions from all classifier heads are aggregated to get the final label during testing.
. | Full Dataset (|A| = 82) . | Subset (|A| = 53) . | ||
---|---|---|---|---|
Emotion | Baseline | Multi-task | Baseline | Multi-task |
Anger | 40.38 ± 4.4 | 39.01± 6.4 | 41.95± 6.1 | 42.75 ± 4.4 |
Disgust | 38.79 ± 3.9 | 38.31± 1.9 | 37.72 ± 2.0 | 35.77± 2.0 |
Fear | 58.96 ± 5.0 | 54.97± 6.1 | 57.68± 3.7 | 58.58 ± 2.3 |
Joy | 47.80± 2.2 | 49.53 ± 3.6 | 47.45 ± 3.1 | 46.26± 1.2 |
Sadness | 49.22± 5.2 | 50.36 ± 3.2 | 47.55± 5.4 | 48.00 ± 3.4 |
Surprise | 40.96 ± 2.9 | 38.97± 3.6 | 39.44± 5.7 | 40.22 ± 2.2 |
. | Full Dataset (|A| = 82) . | Subset (|A| = 53) . | ||
---|---|---|---|---|
Emotion | Baseline | Multi-task | Baseline | Multi-task |
Anger | 40.38 ± 4.4 | 39.01± 6.4 | 41.95± 6.1 | 42.75 ± 4.4 |
Disgust | 38.79 ± 3.9 | 38.31± 1.9 | 37.72 ± 2.0 | 35.77± 2.0 |
Fear | 58.96 ± 5.0 | 54.97± 6.1 | 57.68± 3.7 | 58.58 ± 2.3 |
Joy | 47.80± 2.2 | 49.53 ± 3.6 | 47.45 ± 3.1 | 46.26± 1.2 |
Sadness | 49.22± 5.2 | 50.36 ± 3.2 | 47.55± 5.4 | 48.00 ± 3.4 |
Surprise | 40.96 ± 2.9 | 38.97± 3.6 | 39.44± 5.7 | 40.22 ± 2.2 |
Results obtained on the full dataset are shown in the second and third columns of Table 3. While the multi-task model outperformed the baseline in predicting two emotions—joy and sadness—it underperformed the baseline for the other four emotions, although the ranges of F1-scores largely overlap. It is also observed that the standard deviations of the multi-task model F1-scores are significantly larger than what was observed for GHC.
On further inspection, we found that many annotators contributed very few annotations in the dataset. For instance, 29 annotators had fewer than 1000 annotations in the training set, six of them having fewer than 100. In addition, the label distribution is extremely skewed for all six emotions—ranging from 1.6% positive labels for fear on average across all annotators, to 4.0% positive labels on average for joy. Consequently, many annotator heads have too few positive instances to learn from; some had zero positive instances in the training set. This makes the corresponding learning tasks in the multi-task setting hard or even impossible on this dataset, and might explain the lower performance and higher variance in F1-scores.
In order to make a fairer comparison, we performed our experiments on a subset of the dataset that only includes the annotations by 53 annotators who had more than 1000 annotations. Results obtained on this subset are in the fourth and fifth columns of Table 3. Our multi-annotator model outperforms the baseline model on predicting the majority label in four of the six tasks (anger, fear, sadness, and surprise), while obtaining slightly lower results on disgust and joy. While F1-score ranges of baseline and multi-task models still largely overlap, the multi-task model fares significantly better when there are enough instances for each annotator head to learn from. The multi-task model also reported lower standard deviation in performance than the baseline model, suggesting better robustness in the learned model.
The main advantage of our multi-annotator model is the ability to capture multiple perspectives efficiently. In that respect, our model fared better at modeling annotator uncertainty across the board. As shown in Figure 4, our multi-annotator model obtained better correlation overall with annotator disagreement than Softmax and MC dropout approaches across all six emotions, both in the full dataset as well as the subset (joy in the full dataset being the only exception). This further demonstrates the strength of our approach that does not come at any significant cost of performance or efficiency (training the multi-task model on the full dataset takes 6.1 minutes per epoch, comparing to 5.2 minutes for the baseline model).
5 Analysis
In this section, we further analyze the multi-task model and its outputs, as it posted the overall best performance among the three approaches, considering the predictive performance, uncertainty modeling correlation, and time efficiency. We focus on the GHC model for this analysis.
5.1 Error Analysis
We first qualitatively analyze the mismatches between the multi-task and baseline model on their majority vote predictions. Among all GHC instances (|X| = 27,665), the multi-task and baseline models disagreed on 1,945 labels. Table 4 shows some examples of such instances and the corresponding majority vote, and the percentage of annotators who labeled them as hate speech. Table 4 also provides the baseline model’s prediction (columns 6), the multi-task model’s majority label, and the percentage of prediction heads labeling them as hate speech (columns 4 and 5).
Text . | Annotation . | Multi-task . | Baseline Label . | ||
---|---|---|---|---|---|
Label . | %Hate . | Label . | %Hate . | ||
Agreed. Traitors deserve to be shot. PERIOD. | Not-Hate | 0.33 | Not-Hate | 0.39 | Hate |
Are the democrats guilty of treason? | Not-Hate | 0.25 | Not-Hate | 0.39 | Hate |
I don’t know, you tell me ya prick. | Not-Hate | 0.33 | Not-Hate | 0.39 | Hate |
They should force the ‘refugees’ to wear this crap. | Hate | 0.67 | Not-Hate | 0.33 | Hate |
At this point, nobody gives a shit! | Hate | 0.67 | Not-Hate | 0.44 | Hate |
It’s OK to be a Nazi!! | Hate | 1.00 | Not-Hate | 0.44 | Hate |
Arise WHITE PEOPLE! We’re back. | Not-Hate | 0.33 | Hate | 0.56 | Not-Hate |
Shocking! A jew defending another jew! Who’d have thought? | Not-Hate | 0.33 | Hate | 0.56 | Not-Hate |
I am beginning to think being straight is misconduct. | Not-Hate | 0.0 | Hate | 0.56 | Not-Hate |
Armenia is a nation of mongrel bastards. | Hate | 1.0 | Hate | 0.78 | Not-Hate |
Hope they both get AIDS. | Hate | 1.0 | Hate | 0.72 | Not-Hate |
I am so NOT afraid of you gay boy. | Hate | 0.67 | Hate | 0.83 | Not-Hate |
Text . | Annotation . | Multi-task . | Baseline Label . | ||
---|---|---|---|---|---|
Label . | %Hate . | Label . | %Hate . | ||
Agreed. Traitors deserve to be shot. PERIOD. | Not-Hate | 0.33 | Not-Hate | 0.39 | Hate |
Are the democrats guilty of treason? | Not-Hate | 0.25 | Not-Hate | 0.39 | Hate |
I don’t know, you tell me ya prick. | Not-Hate | 0.33 | Not-Hate | 0.39 | Hate |
They should force the ‘refugees’ to wear this crap. | Hate | 0.67 | Not-Hate | 0.33 | Hate |
At this point, nobody gives a shit! | Hate | 0.67 | Not-Hate | 0.44 | Hate |
It’s OK to be a Nazi!! | Hate | 1.00 | Not-Hate | 0.44 | Hate |
Arise WHITE PEOPLE! We’re back. | Not-Hate | 0.33 | Hate | 0.56 | Not-Hate |
Shocking! A jew defending another jew! Who’d have thought? | Not-Hate | 0.33 | Hate | 0.56 | Not-Hate |
I am beginning to think being straight is misconduct. | Not-Hate | 0.0 | Hate | 0.56 | Not-Hate |
Armenia is a nation of mongrel bastards. | Hate | 1.0 | Hate | 0.78 | Not-Hate |
Hope they both get AIDS. | Hate | 1.0 | Hate | 0.72 | Not-Hate |
I am so NOT afraid of you gay boy. | Hate | 0.67 | Hate | 0.83 | Not-Hate |
The most common type of mismatch (57.94% of mismatches) occurs when an instance deemed non-hateful (by majority vote of annotations) is correctly labeled by the multi-task model but incorrectly labeled by the baseline (first set of rows in Table 4). In other words, these samples represent the baseline model’s false-positive predictions, most of which include specific tokens, such as slur words and social group tokens. The next most common type of model mismatch (22.31% of mismatches) occurred when an instance that was deemed hateful (by majority vote) is mislabeled by the multi-task model and labeled correctly by the baseline model. In general, these two types of mismatches correspond to the positive predictions of the baseline model. A possible explanation for the frequency of such mismatches is the high rate of positive predictions by the baseline model, which is also supported by the higher recall and lower precision scores of the baseline model (Table 1).
The other two types of mismatches occurred when the baseline and multi-task model respectively predicted hateful and non-hateful labels. When this mismatch is over an instance deemed hateful by majority vote of annotations (12.19% of mismatches) the multi-task model is making a false-positive error and we observe mentions of social group names in the text. A large number of such instances had even split (54%–44%) between labels across individual predictions (see Table 4), suggesting the model was unsure. The least common type of disagreement is over instances deemed as hateful by both majority vote of annotations and our multi-task model, but mis-classified by the baseline model (7.56% of mismatches).
5.2 Uncertainty vs. Error
Now, we investigate whether the uncertainty in predictions is correlated with whether the multi-task model was able to correctly predict the majority label. Note that the value of uncertainty, based on Equation 1, falls between 0 and 0.25. We observe that the mean value for uncertainty in correct predictions was 0.049 compared to 0.170 when the model was incorrect. Figure 5a shows the corresponding violin plots. While most incorrect predictions had high uncertainty, a small but significant number of errors were made with certainty.
Separating this analysis across true positives, false positives, false negatives, and true negatives represents a more informative picture. For instance, the model is almost always certain about true negatives (M(uncertainty) = 0.040). Similarly, the model is almost always uncertain about false positives (M(uncertainty) = 0.199), something we also observed in the error analysis presented in Section 5.1. On the other hand, both true positives and false negatives have a bi-modal distribution of uncertainty, with similar mean uncertainty values of 0.140 and 0.141, respectively. In sum, a negative prediction with high uncertainty is more likely to be a false negative, in our case.
6 Discussion
We presented multi-annotator approaches that predict individual labels corresponding with each annotator of a subjective task, as an alternative to the more common practice of deriving (and predicting) a single “ground-truth” label, such as the majority vote or average of multiple annotations. We demonstrate that our method based on multi-task architecture obtains better performance for modeling each annotator (63.2 F1-score, micro-averaged across annotators in GHC), and even when aggregating annotators’ predictions, our approach matches or outperforms the baseline across seven tasks. Our study focuses on majority vote as the baseline aggregation approach to demonstrate how this commonly used approach loses meaningful information. Other aggregation strategies such as MACE (Hovy et al., 2013) and Bayesian methods (Paun et al., 2018) could be explored in future work as complementary approaches that can work with the multi-annotator framework.
6.1 Advantages of Multi-Annotator Modeling
One core advantage of our method, which can further be leveraged in practice, is its ability to provide multiple predictions for each instance. As demonstrated in Figures 2 and 4, the multiple predictions can derive an uncertainty estimation that better matches with the disagreement between annotators. The estimated uncertainty could be used to determine when not to make a prediction or to route the example to a manual content moderation queue as it may be an example that annotators likely disagreed on. One could also investigate how to learn an uncertainty threshold to make cleverer predictions. For instance, based on our analysis in 5.2, a negative prediction with high uncertainty is very likely to be a false negative. One could use this knowledge in a deployment scenario and predict a positive label in case of a negative majority prediction with high uncertainty.
Predicting multiple annotations rather than a ground truth is specifically essential in subjective tasks. As Alm (2008) argues, in many subjective tasks, the aim is not to find an accurate answer; instead, a model can produce the most acceptable answer based on responses from different judgements. Accordingly, our method contrasts with approaches for enhancing ground-truth generation prior to modeling. Our approach aims to preserve annotators’ consistency in labeling by delaying the annotation aggregation until the final stage. As a final step, if required, application-driven approaches can be employed to find the most proper answer. For instance, an aggregation approach based on MACE (Hovy et al., 2013; Paun et al., 2018), could be applied to the predicted individual labels to find a final label that considers the trustworthiness of individual annotators.
Researchers have pointed out that in more objective tasks, such as commonsense knowledge or word sense disambiguation, training a model on judgements of a specific set of annotators lack generalizability to annotations generated by new annotators (Geva et al., 2019). However, in subjective tasks such as affect and online abuse detection, different annotator perspectives, and their contrasts, can be useful (Gordon et al., 2021).
Another advantage of having multiple prediction heads in a multi-task architecture is that we could adapt the same model to different value systems. For instance, in cases where annotators with different moral beliefs systemically produce different labels (Waseem, 2016; Díaz, 2020; Patton et al., 2019), one could use the multi-task approach to have a single global model that can adjust predictions to be conditioned on different value systems. This is valuable for international media platforms to build and deploy global models that attend to local cultures and values without retraining entirely separate models for each culture.
Multi-annotator modeling can also be applied in scenarios that may benefit from obtaining several perspectives for a single instance. For example, in detecting affect in language, a range of subjective human knowledge, interpretation, and experience can be modeled through a multi-annotator architecture. This approach would generate a range of affective states either along affect categories, such as anger and happiness, or dimensions, such as arousal and pleasantness (Alm, 2011, 2008), which correspond with different subjective perceptions of the text. Another example is sarcasm detection, where an ambiguous sarcastic text is labeled differently according to annotators’ thresholds for sarcasm (Rakov and Rosenberg, 2013). In a multi-annotator setting, internal consistency of each annotators’ threshold for sarcasm may be preserved in the training process.
6.2 Limitations and Challenges
Our approach is not without limitations. Our experiments were computationally viable because of the relatively small number of annotators in our annotator pool (18 for GHC and 82 for the GoEmotions dataset), which is not usually the case with large crowd-sourced datasets. For instance, the dataset by Díaz (2020) has over 1.4K individual annotators, and Jigsaw (2019) built a dataset with over 8K annotators. Fine-tuning that many separate annotator heads will be computationally expensive and may not be a viable option. However, clustering annotators based on their agreements and aggregating annotator labels into cluster labels could address this issue. In that scenario, the multi-task model would include separate classifier heads for each cluster of annotators. The number of clusters could be determined based on availability of computational resources and data factors to enhance the multi-task approach. This is an important direction of research for future work.
The proposed approach along with other methods for incorporating individual annotators and their disagreements are only viable when annotated datasets include annotator-level labels for each instance. However, most multiply annotated datasets contain only per-instance majority labels (Waseem and Hovy, 2016; Jigsaw, 2018), or aggregate percentages (Davidson et al., 2017; Jigsaw, 2019). Even in cases where the raw annotations were released, the multi-annotator model requires there being enough annotations from each annotator to model them effectively. However, we observed that the dataset designers may not have envisioned such a utility of annotator-level labels for downstream analysis. For instance, in the GoEmotions dataset, many annotators labeled fewer than 1000 instances, making it hard for annotator-level modeling. Moreover, the high cost of gathering large number of annotations per annotator in crowdsourcing platforms may limit the data collection and call for post-hoc modeling solutions. One way to tackle this issue is by choosing a subset of top-performing annotator heads (during the validation step) for the final prediction. Future work should look into such post-processing steps that could further improve the performance.
To enable further exploration into open questions in studying annotator disagreements and efficient ways to model them, the main challenge is the lack of annotator-level labels. This largely stems from the practice of considering crowd annotators as interchangeable, and not accounting for the differences in their perspectives. We recommend data providers to consider releasing individual annotation labels, when feasible to do so, in an anonymized way and with appropriate consent. We also encourage researchers to design data collection efforts in a way that includes a sufficient number of annotations by each annotator, so that systematic differences in their annotation behaviors could be better understood and accounted for.
7 Conclusion
We present a multi-annotator approach that employs a different classifier head for each annotator of a dataset as an alternate method to the practice of predicting the aggregated majority vote. We demonstrate that our method can efficiently obtain better performance in modeling each annotator as well as match the majority vote prediction performance. We present experiments across different subjective classification tasks, including hate speech detection and six different emotion detection tasks. The model uncertainty estimated based on our multi-annotator model(s)’ predictions obtains a higher correlation to the annotation disagreement than more traditional methods. We expect future work to investigate our multi-annotator approach as a means to detect and mitigate model biases. Moreover, monitoring the performance of annotator heads and model uncertainty in an active learning setting has the potential to capture a more diverse and comprehensive set of perspectives in data.
8 Ethical Considerations
Our paper discusses an approach for attending to individual annotator’s judgements in training a supervised model. In doing that, our multi-annotator approach better preserves minority perspectives that are usually sidelined by majority votes. Our intended use case for this approach is in subjective NLP tasks, such as identifying affect, abusive language, or hate speech, where generating a single true answer does not capture the nuances. While our method likely preserves minority perspectives, a misuse of this technique might happen upon weighting individual annotator’s labels during prediction. Such an alternation aimed solely to improve the majority label prediction performance may adversely impact the representation of different perspectives in the model. In fact, such an optimization may cause further marginalization to under-represented perspectives than the current majority vote–based approaches. For instance, identifying annotator heads that significantly disagree with the majority vote might cause their perspectives to be at higher risk of being excluded. It is also important to consider the number of annotators in the annotator pool when applying this method, in order to protect the privacy and anonymity of annotators, since our approach attempts to model their personal subjective preferences and biases. This is especially critical in the case of sensitive tasks such as hate speech annotations, where associating individual annotators with such representations may be undesirable.
Acknowledgments
We thank Ben Hutchinson, Lucas Dixon, and Jeffrey Sorensen for valuable feedback on the manuscript. We also thank TACL Action Editor, Dirk Hovy, and the anonymous reviewers for their constructive feedback.
Notes
During prediction, multi-annotator models do not have access to the list of annotators who originally provided the labels for each instance. Therefore, the original majority vote is predicted as the majority vote among all annotators.
References
Author notes
Action Editor: Dirk Dovy