Investigating Reasons for Disagreement in Natural Language Inference

Abstract We investigate how disagreement in natural language inference (NLI) annotation arises. We developed a taxonomy of disagreement sources with 10 categories spanning 3 high- level classes. We found that some disagreements are due to uncertainty in the sentence meaning, others to annotator biases and task artifacts, leading to different interpretations of the label distribution. We explore two modeling approaches for detecting items with potential disagreement: a 4-way classification with a “Complicated” label in addition to the three standard NLI labels, and a multilabel classification approach. We found that the multilabel classification is more expressive and gives better recall of the possible interpretations in the data.


Introduction
Natural language inference (NLI) is the task of identifying whether a hypothesis sentence is inferred, contradicted, or neither, by a premise.It is considered one of the most fundamental aspects of competent language understanding.In natural language processing, the NLI task is widely used to evaluate models' semantic representations (i.a., Wang et al., 2019a), and to facilitate downstream tasks, e.g. in natural language generation (NLG).
Large NLI datasets have been built by collecting inference judgments for premise-hypothesis pairs and aggregating the judgments by simple methods such as majority voting.However, it has been pointed out that NLI items do not all have a single ground truth and can exhibit systematic disagreement (i.a., Pavlick and Kwiatkowski, 2019;Nie et al., 2020).This questions the assumption of having a single ground truth for each item and the validity of measuring models' ability to produce such ground truth.For instance, in (1) from the MNLI dataset (Williams et al., 2018), 3 out of 5 annotators labeled the item as "Entailment" (the hypothesis is inferred from the premise), 0 labeled it as "Neutral" (the hypothesis cannot be inferred from the premise), and 2 as "Contradiction" (the hypothesis contradicts the premise).
(1) P: the only problem is it's not large enough it only holds about i think they squeezed when Ryan struck out his five thousandth player they they squeezed about forty thousand people in there.People have indeed different judgments on which number is required to count as holding many people.The premise and hypothesis do not resolve explicitly what is being talked about, possibly a stadium.Does 40,000 count as many for a stadium seating capacity?The premise states that it's not large enough and uses the term squeezing, leading some annotators to see the hypothesis it doesn't hold many people as being inferred from the premise.On the other hand, 40,000 people in a specific location is a large number, and some annotators therefore judge the hypothesis as contradictory to the premise.Such disagreement is not captured when taking only one of the three standard NLI labels as ground truth.Recent work (Zhang et al., 2021;Zhou et al., 2021) has thus explored approaches for building NLI models that predict the entire annotation distribution, instead of the majority vote category, in an attempt to move away from assuming a single ground truth per item.However, little is understood about where the disagreement stems from, and whether modeling the distribution is the best way to handle disagreement in annotation.
To investigate these questions, we created a taxonomy of different types of disagreement consisting of 10 categories, falling into 3 high-level classes based on the "Triangle of Reference" by Aroyo and Welty (2015).We manually annotated a subset of MNLI with the 10 categories.Our categorization shows that items leading to disagreement in annotation are highly heterogeneous.Moreover, the interpretation of the NLI label distribution differs across items.We thus explored alternative approaches for modeling disagreement items: a 4-way classification approach with an additional label (on top of the three NLI labels) capturing disagreement items, and a multilabel classification approach of predicting one or more of the three NLI labels.We found that the two models behave somewhat differently, with the multilabel model offering more interpretable outputs, and thus being more expressive.Our findings deepen our understanding of disagreement in a widely used NLI benchmark and contribute to the growing literature on disagreement in annotation.We hope they highlight directions to reduce disagreement when collecting annotations and to design models to handle the disagreement that persists.The annotations, the guidelines and the code are available at https://github.com/njjiang/NLI_disagreement_taxonomy.

Related work
Focusing on disagreement in annotation is not new: Aroyo and Welty (2015) argued for embracing annotation disagreement, viewing it as signal, and not as noise.Even for tasks with supposedly a unique correct answer, such as part-of-speech tagging, there are items for which the right analysis is debatable (Plank et al., 2014b): is social in social media a noun or an adjective?Plank et al. (2014a) showed that incorporating such disagreement signal into the loss functions of part-of-speech taggers improves performance.Previous work noted that disagreement in annotation exists in many semantic tasks: anaphora resolution (Poesio and Artstein, 2005;Versley, 2008;Poesio et al., 2019), coreference (Recasens et al., 2011), sentiment analysis (Kenyon-Dean et al., 2018), word sense disambiguation (Erk and McCarthy, 2009;Passonneau et al., 2012), among others.Aroyo and Welty (2015) introduced the "Triangle of Reference" framework to conceptualize the annotation process and explain annotation disagreement.Annotation differences can stem from the sentences to be annotated, the labels, or the annotators.Indeed, annotators, who interpret the sentences, produce labels in a way that is defined by the annotation guidelines.Underspecification in each of these three components can result in disagreement in the annotations.

Sources for disagreement
Disagreement can arise from (1) uncertainty in the sentence meaning, (2) underspecification of the guidelines, (3) annotator behavior.We use the Triangle of Reference to organize our taxonomy.Marneffe et al. (2012) and Uma et al. (2021) showed that disagreement was systematic in the older NLI datasets.Pavlick and Kwiatkowski (2019) showed that real-valued NLI annotations are better modeled as coming from a mixture of Gaussians as opposed to a single Gaussian distribution.Nie et al. (2020) collected categorical NLI annotations and found disagreement to be widespread, corroborating Pavlick and Kwiatkowski (2019)'s findings.Kalouli et al. (2019) found that items involving entity/events coreference and "loose definitions" of inference (e.g.whether a hill covered by grass is the same as the side of a mountain) have lower inter-annotator agreement.However, there is not yet a systematic investigation of how disagreement in NLI arises.

Disagreement in NLI de
Taxonomy in NLI There is a rich body of work on the taxonomy of reasoning types in NLI, identifying the kinds of inferences exhibited in NLI datasets (i.a., Sammons et al., 2010;LoBue and Yates, 2011;Williams et al., 2022).Our work differs in that we focus on the phenomena that lead to annotation disagreement, which are not necessarily reasoning types (e.g.our category Interrogative Hypothesis, [8] in Table 1).Since we focus on disagreement, we do not categorize different ways of arriving at the same NLI label (e.g.different kinds of high agreement contradiction, as in de Marneffe et al. (2008)).Pavlick and Kwiatkowski (2019) argued that NLI disagreement information should be propagated downstream.Current neural models should thus be evaluated against the full label distribution.Methods for approximating the full distribution have recently been developed for many tasks, using techniques for calibration and learning with soft-labels (i.a., Lalor et al., 2017;Zhang et al., 2021;Fornaciari et al., 2021;Zhou et al., 2021;Uma et al., 2022).

Approaches to model disagreement
However, simply because distributions are the most straightforward form of disagreement information does not mean that they are the optimal representation for intrinsic evaluation or in downstream tasks.Calibration techniques are successful at post-editing the classifier's softmax distribution (Guo et al., 2017), but they convey spurious uncer-tainty for items that do not exhibit disagreement (Zhang et al., 2021).
Categorical decisions tend to be more interpretable and are necessary in downstream tasks.For example, NLI models are often used for automatic fact-checking (Thorne et al., 2018;Luken et al., 2018), where the categorical decision of whether a statement is disinformation determines whether it needs to be censored.Therefore, we explore here different approaches for providing categorical information for disagreement.
For sentiment analysis, Kenyon-Dean et al. ( 2018) used a classification approach with an additional "Complicated" class to capture items with disagreement.Kenyon-Dean et al. (2018) et al., 2019), and showed some success, obtaining 61.93% F1 on the fourth class "Disagreement" using a vanilla-BERT baseline (standard fine-tuning BERT), and 66.5% F1 on the "Disagreement" class using the Artificial Annotator architecture.Here, we further test the 4-way classification approach for NLI.
In addition to its heterogeneity, a "Complicated" or "Disagreement" class is not easily interpretable.We not only need to know whether there is disagreement, but also in what way: which labels do the annotators disagree over.We therefore also take a multilabel classification approach (i.a.Passonneau et al., 2012;Oh et al., 2019;Ferracane et al., 2021), predicting one or more of the three NLI labels.
There is another line of research aiming to model the judgments of individual annotators, as opposed to the aggregated annotations representing the judgments of the population (Gordon et al., 2021;Davani et al., 2022).However, these approaches require the annotators' identities for each annotation, which are often not released with the data.

Disagreement taxonomy
To investigate where disagreement stems from, we conduct a qualitative analysis of parts of the MNLI dataset (Williams et al., 2018).We chose MNLI because it is diverse in genre and inference types, compared to datasets based on image captions which only describe visual scenes (e.g.SICK (Marelli et al., 2014), SNLI (Bowman et al., 2015)).

Data to analyze
The original MNLI dev sets (match and mismatch sets, differing in genres)1 contain 5 annotations per item.The MNLI dev matched set contains 1,599 items for which exactly 3 annotators (out of the 5) agreed on the label.This subset was reannotated by Nie et al. (2020) with 100 annotations per item, called the ChaosNLI dataset.We randomly sampled 450 items from ChaosNLI.Figure 1 shows the annotations, with items organized by which label was the most frequent.While some items can be seen as having a unique ground truth label (depending on how many annotators agreeing on the same label are needed for that -here we take 80%, following Jiang and de Marneffe (2019)), other items clearly lead to differing annotations.
We also sampled 60 items from the MNLI dev matched set in which at most 2 out of the 5 annotators agreed on the label, and there are thus no majority labels.These items, coded with label "-" in Williams et al. (2018) [2,0,3] [24,45,31] [7] Temporal Reference However, co-requesters cannot approve additional co-requesters or restrict the timing of the release of the product after it is issued.
They cannot restrict timing of the release of the product.

Disagreement categories
Our taxonomy of potential disagreement sources consists of 10 categories, shown in Table 1.The categories are organized into three high-level classes, corresponding to the three components of the annotation process in the "Triangle of Reference": (1) uncertainty in the sentence meaning, (2) underspecification of the guidelines, (3) annotator behavior.

Uncertainty in sentence meaning
Some textual phenomena leading to disagreement can be local to Lexical items, where the truth of the hypothesis depends on the meaning of a specific lexical item.That lexical item can have multiple meanings, or its meaning requires certain parameters that remain underspecified in the sentence at hand, as we saw with many in (1). 2 Disagreement can come from a pair of lexical items, where the lexical relationship between the items (e.g.hypernymy, synonymy) is loose, as in [1] in Table 1: do people infer advances in electronics from technological advances?
Other cases involve the holistic meaning of the sentences and interpreting them in different contexts.In some cases, the hypothesis is an Implicature of the premise, as in [2].By definition, an face-to-face conversations, letters).
2 It is challenging to distinguish between multiple senses or implicit parameters.For instance, in the pair P: Then he sobered.-H: He was drunk., whether H can be inferred from P depends on the word sober: one could be sober from alcohol or from other drugs.Are these two meanings of the word or is the substance an implicit parameter?implicature can be cancelled (Grice, 1975), which leads to a potential for differences in the annotations.Here, some of the most authentic papyrus (are sold in The Pharaonic Village) gives rise to the scalar implicature but not all of the most authentic papyrus, making the hypothesis false since it asserts that authentic papyrus are only sold in The Pharaonic Village.However if the implicature is cancelled, some can also be interpreted as all (e.g.Some students came.In fact, all came.) The hypothesis can target what is being presupposed by the premise.Wh-questions, for instance, presuppose that the entity the question bears on exists.The question What changed? in [3] presupposes that something changed, hence the answer Nothing changed can be viewed as contradictory.However, the premise can also be viewed as not giving enough information to judge the truth of the hypothesis, which would lead to a Neutral label.
Probabilistic Enrichment items involve making probabilistic inferences from the premise: the inferred content is likely, but not definitely, true in some contexts.In [4], there is some likelihood that nodding to the speaker's assertion means that one agrees with it.If annotators make that inference, they see the hypothesis as Entailment.But, since the premise is not explicitly stating the hypothesis, a Neutral label is also warranted.Some premises/hypotheses contain typos or are fragments, making it hard to grasp their exact meaning (as in [5]).We call these cases Imperfection, following Williams et al. (2022).

Underspecification in the guidelines
Some disagreements stem from the loose definition of the NLI task.Assuming coreference between the premise and the hypothesis has been noted as an important aspect of the NLI task (Mirkin et al., 2010) and necessary for obtaining high agreement in annotation (de Marneffe et al., 2008;Bowman et al., 2015;Kalouli et al., 2019).In [6], the hypothesis is a contradiction if we assume that Mughal Prime Minister is the same person in both the premise and the hypothesis.However, it could be the case that Nur Jahan's father and husband both served as Mughal Prime Minister but in different terms, making it Neutral.
While the NLI task assumes coreference between entities and events mentioned in the premise and hypothesis, which entity/event to take into consideration is not always clear.For example, in (2), the premise can be taken to talk about "desegregation being undone in Charlotte by magnet schools", in which case the hypothesis is inferred.
(2) P: Unfortunately, the magnet schools began the undoing of desegregation in Charlotte.
H: Desegregation was becoming disbanded in Charlotte thanks to the magnet schools.
[E,N,C]: [81,6,13] The premise can also be taken to focus on the fact that "the desegregation being undone in Charlotte by magnet schools is unfortunate".In other words, two different "Questions Under Discussion" (Roberts, 2012) can be posited for the premise.Under that second interpretation, the hypothesis (in which the undoing of desegregation is positive, given the word thanks) contradicts the premise, where the desegregation undoing is unfortunate.
The truth of the hypothesis can also depend on the time at which the hypothesis is evaluated (Temporal Reference), but the NLI annotation guidelines do not specify how to handle such cases.There are two contextually-salient temporal referents in [7], before or after the product release is issued.If the hypothesis refers to the time after the release is issued, it is true.From the perspective of before the release is issued, it is unclear whether the co-requesters can restrict timing or not.
Unlike assertions, questions do not have truth values (Groenendijk and Stokhof, 1984;Roberts, 2012).It is therefore theoretically ill-defined to ask whether an interrogative hypothesis is true or not given the premise (which is the question asked in Nie et al. (2020)'s annotation interface to build ChaosNLI).However, most of the interrogative hypotheses have interrogative premises (81.8% in MNLI dev sets; all in our subset).Groenendijk and Stokhof (1984) define the notion of entailment between questions: an interrogative q1 entails another q2 iff every proposition that answers q1 answers q2 as well.Some annotators seem to latch onto this definition, as in (3).Still, there is no definition distinguishing neutral from contradictory pairs of questions. 3 Annotators, perhaps to assign some meaning to the Neutral/Contradiction distinction, give judgments that seem to involve applying surface-level features for declarative sentences, choosing Neutral/Contradiction if the sentences involve substitution of unrelated words, as in (4).

Annotator behavior
By definition, disagreement arises when a proportion of annotators behave one way and another proportion another way.We identified two patterns of "systematic behavior" (while it is hard to say for certain what annotators have in mind, the patterns seem robust).When the hypothesis adds content that provides minimal information compared to the premise, but is otherwise entailed, annotators are more likely to judge it as Entailment, thus ignoring/accommodating minimally added content.For instance, the hypothesis in [9] adds the information source (said in the report) which is not mentioned in the premise.From a strict semantic evaluation, the hypothesis is thus not inferred from the premise.Nonetheless most people are happy to infer it.Such added contents are often not at-issue, i.e. not the main point of utterance (Potts, 2005;Simons et al., 2010), appearing as modifiers (Mc-Nally, 2016), or parentheticals, making it easier for people to ignore if not paying enough attention or not being tuned to such differences. 4hese biases are potentially problematic for applications in NLG that use the NLI labels for evaluating paraphrases (modeled as bi-directional entailment, Sekine et al. (2007)), dialog coherence (Dziri et al., 2019), semantic accuracy (Dušek and Kasner, 2020), or use NLI as a pretraining task for learned metrics (Sellam et al., 2020).For instance, it would not be semantically accurate for a generated summary to hallucinate and include extraneous, even if not at-issue content, such as said the report in [9], if not already given in the source text.
When the hypothesis has high lexical overlap with the premise (e.g.involve the same noun phrases), annotators tend to judge it as Entailment even if it is not strictly inferred from the premise.In [10], the hypothesis claims that the white townsfolk thinks it sounds convincing, whereas the premise only states that the white townsfolk makes it sound convincing (and does not mention whose opinion it is).McCoy et al. (2019) pointed out that items in MNLI with high lexical overlap between the premise and the hypothesis often have the Entailment label, and that NLI models learn such shallow heuristics, ending up to incorrectly predict Entailment for items with high overlap.McCoy et al. ( 2019)'s finding might partially be attributed to such annotator behavior.

Taxonomy development and annotation
The taxonomy was developed by a single annotator, starting by examining lowest and highest agreement examples in ChaosNLI to identify linguistic phenomena that are potential sources of disagreement in the NLI annotations.Some categories were merged because the distinction between them seem murky (for instance, the distinction of multiple senses vs. implicit argument in the Lexical category).Event coreference often requires entity coreference and the distinction between both is not clear-cut.For the two sentences vendors crammed the streets with shrine offerings and vendors are lining the streets with torches and fires to refer to the same event, we need to assume that they talk about the same set of vendors.We thus only have one Coreference category.
There were two rounds of annotations.In Round 1, one annotator annotated 400 items from ChaosNLI and iteratively refined the taxonomy, while writing annotation guidelines.Another annotator was then trained.In Round 2, both annotators annotated 50 additional items from ChaosNLI and 60 items from MNLI where only 2 out of the 5 original annotations agreed.These 110 items serve to check that the taxonomy does not "overfit" the 400-item sample used while developing it.
Multi-category annotations More than one reason for disagreement may apply.We therefore adopt a multi-category annotation scheme: each item can have multiple categories.For example in (5), both Implicature and Temporal Reference contribute to disagreement.The premise does not suggest that the park changed name, while the hypothesis does so with the implicature triggered by used to.Therefore, if we evaluate the truth of the hypothesis now, there can be disagreement between Neutral and Contradiction.If we evaluate the truth of the hypothesis in or before 1935, the hypothesis is entailed because the park was named after Corbett at some point.Also, given that the implicature is triggered by a specific lexical item (in contrast to non-conventional conversational implicatures), the category Lexical applies too.
(5) P: The park was established in 1935 and was given Corbett's name after India became independent.
H: The park used to be named after Corbett.
[E,N,C]: [36,34,30] Inter-annotator agreement Since the annotation requires the understanding of various linguistic phenomena, only expert annotation is possible.The two annotators have graduate linguistic training.The Krippendorff's α with MASI distance (Passonneau, 2006) is 0.69.For the items annotated by both annotators, we then aggregated the two sets of annotations by taking their intersection.This resulted in 24 instances of categories deleted for annotator 1 (in 23 items) and 16 (in 16 items) for annotator 2. There were only 4 items (out of 110) with an empty intersection, which we reconciled.Table 3: For each disagreement category, the percentage of items exhibiting convergence (at least 80/100 annotators agreed on the same NLI label), the total number of items in the category, and the mean/standard deviation of the majority vote count.

Findings and discussion
Through the construction of the taxonomy, we found that disagreement arises from many reasons.The NLI annotations do not always show the full picture in terms of the range and nature of the meaning the sentences carry, because (1) even if an item has multiple possible interpretations, the annotators may converge on one of them, (2) there are at least two interpretations of the label distribution, arising out of a single probabilistic inference, or multiple categorical inferences.

Annotators converge on one interpretation
NLI annotations for items exhibiting some of the factors that contribute to disagreement may actually show high agreement.Indeed, even when an item lends itself to uncertainty or multiple interpretations, a high proportion of annotators may converge to the same interpretation.For instance, in [1] (Table 1), 82 annotators (out of 100) take technological advancement to entail advancement in electronics, even though there are other kinds of technological advancement that are not electronics.
In [5], 90 annotators latch onto the fact that the hypothesis seem totally unrelated to the premise, agreeing on the Neutral label.
Table 3 shows the percentage of items in each taxonomy category for which at least 80 (out of 100) annotators agreed on the same NLI label (which we will refer to as "convergence").Interestingly, "Accommodating minimally added content" has the largest amount of convergence (25.5%) and the highest mean majority vote (67.8).The majority voted labels are Entailment (accommodating the content) or Neutral (considering that the content is not given by the premise).Whether accommoda-tion takes place depends on the extent to which the added content is not-at-issue and on the content itself.In [9] (Table 1), 97 annotators accommodated the extra content (said the report) in the hypothesis.In (9), however, the hypothesis also introduces new content all year round, but only 7 annotators accommodate it.In (10), 32 annotators accommodate the added content American, thus more than in ( 9 The difference could be due to the fact that American modifies the subject, which makes it less atissue than all year round modifying the entire matrix clause.In [9], said the report appears in a parenthetical at the end of the sentence, which is even less at-issue than modifiers.Identifying such gradience in disagreement is a very difficult task: simply identifying whether the hypothesis adds content is not enough, knowledge about the role of information structure seems necessary too.

Two interpretations of NLI label distributions
It should now be clear that, by modeling majority vote, we are missing out on the full complexity of language understanding.Some argue that textual inference is probabilistic in nature (Glickman and Dagan, 2005).Therefore, probabilistic inferences give rise to disagreement in categorical labels.(i.a.Zhang et al., 2021;Zhou et al., 2021).Here, we found that disagreement in the categorical labels arise in at least two ways: (1) a single probabilistic inference, or (2) multiple potentially categorical inferences, which is often the case when there are multiple possible specifications of the contextual factors (e.g.coreference, temporal reference, implicit arguments of some lexical items).They also differ in the kinds of uncertainty they exhibit.One is uncertainty in the state of the world.One is in how to interpret the sentences.
This distinction gives different interpretations of the aggregated label distribution.In [4] (Table 1), each annotator may have an underlying probabilistic judgment of how likely it is that Sir James thinks it's absurd, which is then reflected in the aggregated probability distribution.The probability as-sociated with the Entailment label can be taken as the probabilistic belief (Kyburg, 1968) of an individual annotator for the truth of the hypothesis.
On the other hand, [7] involves categorical and probabilistic inferences.Whether the hypothesis is entailed depends on whether it is evaluated before or after the product release is issued.If after, readers have a categorical judgment that the hypothesis is entailed.If before, readers have a probabilistic judgment, leading to the uncertainty between Neutral and Contradiction. Therefore,unlike [4], the probability associated with the Entailment label does not represent the judgment of an individual.
We could design experiments to collect empirical evidence for this distinction, such as collecting multilabel or sliding bar annotations, or free-text explanations to gain direct evidence of whether annotators have categorical/probabilistic judgments.Pursuing this line of research is left for future work.
Artificial task setup One of the reasons for the occurrence of disagreement may be the somewhat artificial setup of the NLI task.The premise and hypothesis are interpreted in isolation with no surrounding discourse.However, discourse context is needed for resolving many of the uncertainty in meaning pointed out here (e.g.coreference, temporal reference, and implicit arguments of lexical items).Investigating whether incorporating context into NLI annotations improves agreement is left for future work.

Modeling experiments
Now that we understand better how disagreement arises, we explore how to build models that provide disagreement information.As discussed in Section 2, a distribution gives the most fine-grained information but can be misleading to interpret, while categorical information is often needed in downstream applications.Therefore, we experiment with models that provide two kinds of categorical information for disagreement: an additional "Complicated" class for labeling low agreement items (Section 4.3), and a multilabel classification approach, where each item is associated with one or more of the three standard NLI labels (Section 4.4).As baseline, we take the MixUp approach in Zhang et al. (2021), which predicts a distribution over the three labels, and uses a threshold to obtain multilabels/4-way labels.
These models can be useful in an annotation pipeline.One needs to collect multiple judgments for each item to cover the range of possible interpretations, but doing so may be prohibitively expensive at a large scale.The annotation budget could thus be prioritized by collecting annotations for items with potential for disagreement, as predicted by the model.Therefore, our goal is not necessarily to maximize accuracy.A model that can recall the possible interpretations is preferred to a model that misses them.

Training data
We saw that there is gradience in disagreement, but we start with clearly delineated data and only take items for which there is distinct (dis)agreement.We first focus on items from ChaosNLI since they have 100 annotations each, giving a clearer signal for (dis)agreement, discarding items where the majority vote is between 60 and 80 (given that it is unclear whether this counts a high or low agreement).However, this gives a highly class-imbalanced set in both schemes, as shown in the line for "Chaos" in Table 4, with less items in E/N/C than in the other classes.5Therefore, we augment the set with data from the original MNLI dev (where items have 5 annotations).We use the following criteria to relabel the data with the 4-way scheme (E, N, C, and Complicated) and the multilabel scheme: -Items receive a single E, N, or C label (in the 4-way and multilabel schemes) if the majority vote label has more than 80 votes (out of 100 annotations) for the ChaosNLI items or if all 5 annotations agree for the MNLI items.the "Chaos+Orig" set into train/dev/test with sizes 2710/816/1956 respectively, stratified by labels.

Baseline
We use Zhang et al. ( 2021)'s MixUp model as baseline for both the 4-way and multilabel schemes.The MixUp model has the same architecture as fine-tuning RoBERTa for classification.During training, each training example is a linear interpolation of two randomly chosen training items, for both the input encodings and the soft-labels (the annotation distributions over E/N/C).We used Zhang et al.'s hyperparameters, with a learning rate of 1e-6 and an early stopping patience of 5 epochs.The model is trained with the data split described above by optimizing KL-divergence with soft-labels.
To evaluate, we convert each predicted distribution to a multilabel, taking any label assigned a probability of at least 0.2 to be present (same threshold we used for the data).The multilabel is then converted to a 4-way label: Complicated if more than one label is present; E, N, or C if it is the only label.Comparing the results from the MixUp model with the ones from our approach will tell whether optimizing for distributions (as done by the MixUp model) gives better predictions than training with categorical labels (as done by our approach), when evaluating with categorical labels.

4-way classification
We fine-tuned RoBERTa (Liu et al., 2019) on the train/dev set using the standard methods for classification.We used the initial learning rate of 1e-5, with learning rate decay by 0.8 times if dev F1 does not improve for two epochs.We trained for up to 30 epochs, with early stopping used if dev F1 does not improve for 10 epochs.We used jiant v1 (Wang et al., 2019b) for our experiments.

Results
Table 5 shows accuracy, macro F1, and F1 for each class.Each score is the average from   the Complicated class is hard to model, due to its heterogeneity, as we saw in Section 3.2.This is also shown in the confusion matrix in Table 6.
Conversely, there are little errors among the three original NLI classes, which is partly due to the stringent threshold we used to identify items on which we take the majority vote.
100 annotations are better Since the Complicated label is the most confused, we investigate where the confusion comes from.We partition the test set by whether the label comes from the ChaosNLI 100 annotations or the original MNLI 5 annotations, and compare the Complicated F1 on each subset.in our train and dev sets.We compare the model predictions with the annotation entropy, shown in Figure 2. Items predicted to be Complicated have significantly higher entropy than items predicted to be other labels (except for predicted Contradiction and Complicated items from ChaosNLI).This suggests that the model learned certain features associated with complicatedness.

Multilabel classification
As mentioned in Section 2, the rationale for using a multilabel classification approach is to get insight in the way in which an item is complicated.Instead of choosing one of the three NLI labels (or four, including Complicated), the model is to predict multiple of the three NLI labels.
Model architecture To perform multilabel classification with 3 labels, we make minimal changes to the standard method for fine-tuning RoBERTa for 3-way classification.We predict each E/N/C label independently, by applying the sigmoid function on top of the 3 logits given by the MLP classifier on top of RoBERTa to obtain probabilities associated with each label.We take the label to be present if its probability is greater than 0.5.The model is trained with a cross entropy loss.
Training procedure We used an initial learning rate of 5e-6, with learning rate decay by 0.8 times if dev F1 does not improve for one epoch.We trained for up to 30 epochs, with early stopping used if dev F1 does not improve for 10 epochs.
Results Table 7 gives the macro precision, recall, and F1, and the exact match accuracy partitioned by the number of gold labels (1/2/3 Labels Accuracy) for the test set and for its subsets.Our model has a higher F1 score than the baseline but a lower precision.The baseline model is more successful at items on annotators agree (higher 1 Label Accuracy), while our model performs better on items with disagreement (2/3 Labels Accuracy).
Comparing the two test set subsets, we see the same pattern as in the 4-way results: on disagreement items, our model performs better (higher 2/3 Labels Accuracy) on the Chaos subset than on the Orig subset.This corroborates the finding from the 4-way classification that 100 annotations give a better indication of complicatedness.
Multilabel model is more expressive The accuracy decreases from the 4-way classification setup, which is expected since the number of possible labels increased from 4 to 7 (all possible combinations of the 3 labels).However, the macro recall increases compared to the 4-way classification (83.35 vs. 67.78),possibly as a result of more expressivity in the model output and not having one challenging and heterogeneous class.We also see this more concretely in the contingency table of the 4-way model vs. the multilabel model predictions (Table 8): when the multilabel model predicts more than one label, the 4-way model often predicts the Complicated class or one of the labels predicted by the multilabel model.In other words, the 4way model may miss one or more labels while the multilabel model can identify all of them.
Takeaways Comparing with the MixUp baseline which is trained with soft-labels, we see that training with categorical labels performs better in predicting categorical labels.Therefore, for downstream tasks where categorical information is needed, training with categorical labels is recommended.The multilabel model is more expressive, and as we will show in Section 5, it provides fine-grained information that gives a better understanding of what the model has learned.Our results suggest that the multilabel approach could potentially be used as intrinsic evaluation for how well the model captures the judgments of the population.

Error Analysis
We analyze the model behavior with respect to the categories of disagreement sources.For each category, Figure 3 gives the percentages of ChaosNLI items annotated with at least that category and having converging NLI interpretations (>80 agree on the NLI label) vs. percentages of items predicted to exhibit disagreement (Complicated by the 4-way model or as having more than one label by the multilabel model).Overall, a category with more agreement (higher majority vote) in the annotations tend to have less items predicted as exhibiting disagreement.This is expected given that an item with convergence corresponds to not having disagreement as gold label, and the model performs well overall.
Comparing the two models, we see that all categories, except [7] Temporal Reference, are farther to the right in the Multilabel classification (bottom panel) whereas they are more spread out in the 4-way classification (top panel), meaning that the 4-way model predicts an agreement label (E/N/C) more often than the Multilabel model.This suggests that the 4-way model is more strongly tied to the convergence statistics and failing to detect potentials of disagreement.It also aligns with the previous finding that the Multilabel model has higher  [15,73,12] N / NC 4 But they persevered, she said, firm and optimistic in their search, until they were finally allowed by a packed restaurant to eat their dinner off the floor.
Because all of the seats were stolen, they had to eat off the floor.
Items in [3] Presupposition, [4] Probabilistic Enrichment, [5] Imperfection are often predicted in both setups to exhibit disagreement (they are to the right of both plots in Figure 3).[6] Coreference, [2] Implicature and [10] High Overlap also appear to the right, depending on the setup.Among those categories, [3] Presupposition, [2] Implicature, [5] Imperfection and [10] High Overlap are associated with surface patterns, potentially making it easier for the models to learn that they often exhibit disagreement.We thus take a closer look at the following categories, across all items annotated: Probabilistic Enrichment, Coreference, and Accommodating Minimally Added Content (discussed in Section 3.4).
Probabilistic Enrichment The multilabel model predicts 68% of the items annotated with Probabilistic Enrichment to have more than one NLI label.In particular, 36% are predicted as EN and 27% as NC, corresponding the common patterns in Probabilistic Enrichment where the enriched (not explicitly stated) inference leads to Entailment/Contradiction, and the Neutral label is warranted without enrichment.We found that the multilabel model often predicts labels when they are only slightly below the threshold of 20 that we used to count a label as present (items 1 and 2 in Table 9).Even though in those cases the model is "incorrect" when calculating the metrics, it shows that the model can retrieve subtle inferences.In item 1, 17 annotators chose Neutral, while 82 chose Entailment: the premise does not mention entering a church, but most annotators take that situation to be likely.The multilabel model is however predicting both Entailment and Neutral, accounting for the possible interpretations.
Coreference For items annotated with Coreference, both models predict Entailment/Contradiction when the premise and hypothesis share the same argument structure or involve simple word substitutions (e.g.wax/clay in item 5 and San'doro/they in item 6, Table 9), which are features of unanimous Entailment/Contradiction.This suggests that such predictions are influenced by the unanimous items.The 4-way model tends to predict Complicated when items annotated with Coreference do not share any structure (as in items 7 and 8).

Accommodating Minimally Added Content
The multilabel model predicts 44% of the items involving minimally added content to have both Entailment and Neutral labels, and 76% of the items to have at least the Neutral label.This is consistent with the majority of these items showing disagreement over Entailment and Neutral, and the sentences themselves exhibiting features of Neutral (added content) and surface features of Entailment (high lexical overlap), as in items 9 and 10.In item 9, the multilabel model recovers a Neutral inference (the premise does not mention current reports), even when only 9 annotators chose the Neutral label.This further illustrates that the multilabel model is better at recalling possible interpretations.

Conclusion
We examined why disagreement in NLI annotations occurs and found that it arises out of all three components of the annotation process.We experimented with modeling NLI disagreement as 4-way and multilabel classifications, and showed that the multilabel model gives a better recall of the range of interpretations.We hope our findings will shed light on how to improve the NLI annotation process, e.g.ways to specify the guidelines to reduce disagreement or introduce contexts that resolve underspecification, ways to gather enough annotations to cover the possible interpretations, as well as ways to model NLI without the single ground truth assumption.

Figure 1 :
Figure 1: ChaosNLI annotations of the 450 items we sampled.Each column of stacked bars represents an item's annotations -the number of votes for each label with top-down ordering of the labels.The horizontal lines indicate 80 votes.

"
Complicated" is most confused The model performs worse on the Complicated label as opposed to the other three NLI labels.This is consistent with (Kenyon-Dean et al., 2018)'s observation:

Figure 2 :
Figure 2: Boxplots of annotation entropy (Left: from original MNLI 5 annotations, Right: from ChaosNLI 100 annotations) by predicted label.Number of items shown in parentheses.Triangles indicate the means.P-values from Mann-Whitney two-sided test.
from the town rode past, routed by their diminished numbers and the fury of the Kal and Thorn.Kal and Thorn were furious at the villagers.[50,41, 9] N / EN

Figure 3 :
Figure 3: For each disagreement category, percentage of ChaosNLI items annotated with that category (number in parentheses) and having converging NLI annotations (>80 majority vote) vs. percentage predicted as Complicated in the 4-way setup or as having more than one label in the multilabel setup.Legend also gives mean majority vote in each category, with standard deviation in parentheses.
had little success predicting that class with LSTM-based models (0.16 F1 for Complicated), because it is heterogeneous and there is likely little learning signals indicating complicatedness.Zhang and de Marneffe (2021) approached the NLI 4-way classification problem using the architecture of Artificial Annotator, an ensemble of multiple BERT models with different biases.They experimented on the NLI version of the CommitmentBank (de Marneffe

Table 2 :
Frequency and percentage of each combination of categories in the taxonomy, in the two annotation rounds.
(8) P: She had the pathetic aggression of a wife or mother-to Bunt there was no difference.H: Bunt was raised motherless in an orphanage.[E,N,C]:[0,88,12]

Table 4 :
Number of items for each 4-way label and each combination of multilabel in each dataset.The number of "Complicated" items is the sum of the number of items with more than one label in the multilabel setup.

Table 5 :
(Zhang and de Marneffe, 2021)rformance on the test set.Right: Performance on the two subsets of the test set, Chaos and Original MNLI.Darker color indicates higher performance.threerandominitializations.The macro F1 of our model is 68.59% (vs.65.34% for MixUp), which is on par with previous work(Zhang and de Marneffe, 2021), but with room for improvement.Our model generally outperforms the baseline, suggesting that training with categorical labels is beneficial for predicting categorical labels.

Table 6 :
Confusion matrix of the 4-way classification predictions from the initialization with the highest macro F1.Darker color indicates higher numbers.

Table 8 :
Contingency matrix of the 4-way classification vs. the multilabel predictions, on the full MNLI dev sets (excluding items used in our train/dev sets).
There should be someone here who knew more of what was going on in this world than he did now.
Cruises are available from the Bhansi Ghat, which is near the CityPalace.You can take cruises from Phoenix Arizona.The key question may be not what Hillary knew but when she knew it.According to current reports, the question is not if, but when did Hillary know about it.

Table 9 :
Examples from the categorization with ChaosNLI annotations and 4-way/multi-label model predictions.