Abstract
Disagreement in natural language annotation has mostly been studied from a perspective of biases introduced by the annotators and the annotation frameworks. Here, we propose to analyze another source of bias—task design bias, which has a particularly strong impact on crowdsourced linguistic annotations where natural language is used to elicit the interpretation of lay annotators. For this purpose we look at implicit discourse relation annotation, a task that has repeatedly been shown to be difficult due to the relations’ ambiguity. We compare the annotations of 1,200 discourse relations obtained using two distinct annotation tasks and quantify the biases of both methods across four different domains. Both methods are natural language annotation tasks designed for crowdsourcing. We show that the task design can push annotators towards certain relations and that some discourse relation senses can be better elicited with one or the other annotation approach. We also conclude that this type of bias should be taken into account when training and testing models.
1 Introduction
Crowdsourcing has become a popular method for data collection. It not only allows researchers to collect large amounts of annotated data in a shorter amount of time, but also captures human inference in natural language, which should be the goal of benchmark NLP tasks (Manning, 2006). In order to obtain reliable annotations, the crowdsourced labels are traditionally aggregated to a single label per item, using simple majority voting or annotation models that reduce noise from the data based on the disagreement among the annotators (Hovy et al., 2013; Passonneau and Carpenter, 2014). However, there is increasing consensus that disagreement in annotation cannot be generally discarded as noise in a range of NLP tasks, such as natural language inferences (De Marneffe et al., 2012; Pavlick and Kwiatkowski, 2019; Chen et al., 2020; Nie et al., 2020), word sense disambiguation (Jurgens, 2013), question answering (Min et al., 2020; Ferracane et al., 2021), anaphora resolution (Poesio and Artstein, 2005; Poesio et al., 2006), sentiment analysis (Díaz et al., 2018; Cowen et al., 2019), and stance classification (Waseem, 2016; Luo et al., 2020). Label distributions are proposed to replace categorical labels in order to represent the label ambiguity (Aroyo and Welty, 2013; Pavlick and Kwiatkowski, 2019; Uma et al., 2021; Dumitrache et al., 2021).
There are various reasons behind the ambiguity of linguistic annotations (Dumitrache, 2015; Jiang and de Marneffe, 2022). Aroyo and Welty (2013) summarize the sources of ambiguity into three categories: the text, the annotators, and the annotation scheme. In downstream NLP tasks, it would be helpful if models could detect possible alternative interpretations of ambiguous texts, or predict a distribution of interpretations by a population. In addition to the existing work on the disagreement due to annotators’ bias, the effect of annotation frameworks has also been studied, such as the discussion on whether entailment should include pragmatic inferences (Pavlick and Kwiatkowski, 2019), the effect of the granularity of the collected labels (Chung et al., 2019), or the system of labels that categorize the linguistic phenomenon (Demberg et al., 2019). In this work, we examine the effect of task design bias, which is independent of the annotation framework, on the quality of crowdsourced annotations. Specifically, we look at inter-sentential implicit discourse relation (DR) annotation, i.e., semantic or pragmatic relations between two adjacent sentences without a discourse connective to which the sense of the relation can be attributed. Figure 1 shows an example of an implicit relation that can be annotated as Conjunction or Result.
Implicit DR annotation is arguably the hardest task in discourse parsing. Discourse coherence is a feature of the mental representation that readers form of a text, rather than of the linguistic material itself (Sanders et al., 1992). Discourse annotation thus relies on annotators’ interpretation of a text. Further, relations can often be interpreted in various ways (Rohde et al., 2016), with multiple valid readings holding at the same time. These factors make discourse relation annotation, especially for implicit relations, a particularly difficult task. We collect 10 different annotations per DR, thereby focusing on distributional representations, which are more informative than categorical labels.
Since DR annotation labels are often abstract terms that are not easily understood by lay individuals, we focus on “natural language” task designs. Decomposing and simplifying an annotation task, where the DR labels can be obtained indirectly from the natural language annotations, has been shown to work well for crowdsourcing (Chang et al., 2016; Scholman and Demberg, 2017; Pyatki et al., 2020). Crowdsourcing with natural language has become increasingly popular. This includes tasks such as NLI (Bowman et al., 2015), SRL (Fitzgerald et al., 2018), and QA (Rajpurkar et al., 2018). This trend is further visible in modeling approaches that cast traditional structured prediction tasks into NL tasks, such as for co-reference (Aralikatte et al., 2021), discourse comprehension (Ko et al., 2021), or bridging anaphora (Hou, 2020; Elazar et al., 2022). It is therefore of interest to the broader research community to see how task design biases can arise, even when the tasks are more accessible to the lay public.
We examine two distinct natural language crowdsourcing discourse relation annotation tasks (Figure 1): Yung et al. (2019) derive relation labels from discourse connectives (DCs) that crowd workers insert; Pyatkin et al. (2020) derive labels from Question Answer (QA) pairs that crowd workers write. Both task designs employ natural language annotations instead of labels from a taxonomy. The two task designs, DC and QA, are used to annotate 1,200 implicit discourse relations in 4 different domains. This allows us to explore how the task design impacts the obtained annotations, as well as the biases that are inherent to each method. To do so we showcase the difference of various inter-annotator agreement metrics on annotations with distributional and aggregated labels.
We find that both methods have strengths and weaknesses in identifying certain types of relations. We further see that these biases are also affected by the domain. In a series of discourse relation classification experiments, we demonstrate the benefits of collecting annotations with mixed methodologies, we show that training with a soft loss with distributions as targets improves model performance, and we find that cross-task generalization is harder than cross-domain generalization.
The outline of the paper is as follows. We introduce the notion of task design bias and analyze its effect on crowdsourcing implicit DRs, using two different task designs (Section 3–4). Next, we quantify strengths and weaknesses of each method using the obtained annotations, and suggest ways to reduce task bias (Section 5). Then we look at genre-specific task bias (Section 6). Lastly, we demonstrate the task bias effect on DR classification performance (Section 7).
2 Background
2.1 Annotation Biases
Annotation tends to be an inherently ambiguous task, often with multiple possible interpretations and without a single ground truth (Aroyo and Welty, 2013). An increasing amount of research has studied annotation disagreements and biases.
Prior studies have focused on how crowdworkers can be biased. Worker biases are subject to various factors, such as educational or cultural background, or other demographic characteristics. Prabhakaran et al. (2021) point out that for more subjective annotation tasks, the socio-demographic background of annotators contributes to multiple annotation perspectives and argue that label aggregation obfuscates such perspectives. Instead, soft labels are proposed, such as the ones provided by the CrowdTruth method (Dumitrache et al., 2018), which require multiple judgments to be collected per instance (Uma et al, 2021). Bowman and Dahl (2021) suggest that annotations that are subject to bias from methodological artifacts should not be included in benchmark datasets. In contrast, Basile et al. (2021) argue that all kinds of human disagreements should be predicted by NLU models and thus included in evaluation datasets.
In contrast to annotator bias, a limited amount of research is available on bias related to the formulation of the task. Jakobsen et al. (2022) show that argument annotations exhibit widely different levels of social group disparity depending on which guidelines the annotators followed. Similarly, Buechel and Hahn (2017a, b) study different design choices for crowdsourcing emotion annotations and show that the perspective that annotators are asked to take in the guidelines affects annotation quality and distribution. Jiang et al. (2017) study the effect of workflow for paraphrase collection and found that examples based on previous contributions prompt workers to produce more diverging paraphrases. Hube et al. (2019) show that biased subjective judgment annotations can be mitigated by asking workers to think about responses other workers might give and by making workers aware of their possible biases. Hence, the available research suggests that task design can affect the annotation output in various ways. Further research studied the collection of multiple labels: Jurgens (2013) compares between selection and scale rating and finds that workers would choose an additional label for a word sense labelling task. In contrast, Scholman and Demberg (2017) find that workers usually opt not to provide an additional DR label even when allowed. Chung et al. (2019) compare various label collection methods including single / multiple labelling, ranking, and probability assignment. We focus on the biases in DR annotation approaches using the same set of labels, but translated into different “natural language” for crowdsourcing.
2.2 DR Annotation
Various frameworks exist that can be used to annotate discourse relations, such as RST (Mann and Thompson, 1988) and SDRT (Asher, 1993). In this work, we focus on the annotation of implicit discourse relations, following the framework used to annotate the Penn Discourse Treebank 3.0 (PDTB, Webber et al., 2019). PDTB’s sense classification is structured as a three-level hierarchy, with four coarse-grained sense groups in the first level and more fine-grained senses for each of the next levels.1 The process is a combination of manual and automated annotation: An automated process identifies potential explicit connectives, and annotators then decide on whether the potential connective is indeed a true connective. If so, they specify one or more senses that hold between its arguments. If no connective or alternative lexicalization is present (i.e., for implicit relations), each annotator provides one or more connectives that together express the sense(s) they infer.
DR datasets, such as PDTB (Webber et al., 2019), RST-DT (Carlson and Marcu, 2001), and TED-MDB (Zeyrek et al., 2019), are commonly annotated by trained annotators, who are expected to be familiar with extensive guidelines written for a given task (Plank et al., 2014; Artstein and Poesio, 2008; Riezler, 2014). However, there have also been efforts to crowdsource discourse relation annotations (Kawahara et al., 2014; Kishimoto et al., 2018; Scholman and Demberg, 2017; Pyatkin et al., 2020). We investigate two crowdsourcing approaches that annotate inter-sentential implicit DRs and we deterministically map the NL-annotations to the PDTB3 label framework.
2.2.1 Crowdsourcing DRs with the DC Method
Yung et al. (2019) developed a crowdsourcing discourse relation annotation method using discourse connectives, referred to as the DC method. For every instance, participants first provide a connective that in their view, best expresses the relation between the two arguments. Note that the connective chosen by the participant might be ambiguous. Therefore, participants disambiguate the relation in a second step, by selecting a connective from a list that is generated dynamically based on the connective provided in the first step. When the first step insertion does not match any entry in the connective bank (from which the list of disambiguating connectives is generated), participants are presented with a default list of twelve connectives expressing a variety of relations. Based on the connectives chosen in the two steps, the inferred relation sense can be extracted. For example, the Conjunction reading in Figure 1 can be expressed by in addition, and the Result reading can be expressed by consequently.
2.2.2 Crowdsourcing DRs by the QA Method
Pyatkin et al. (2020) proposed to crowdsource discourse relations using QA pairs. They collected a dataset of intra-sentential QA annotations which aim to represent discourse relations by including one of the propositions in the question and the other in the respective answer, with the question prefix (What is similar to..?, What is an example of..?) mapping to a relation sense. Their method was later extended to also work inter-sententially (Scholman et al., 2022). In this work we make use of the extended approach that relates two distinct sentences through a question and answer. The following QA pair, for example, connects the two sentences in Figure 1 with a result relation.
- (1)
What is the result of Caesar being assassinated by a group of rebellious senators?(S1) - A new series of civil wars broke out [...](S2)
The annotation process consists of the following steps: From two consecutive sentences, annotators are asked to choose a sentence that will be used to formulate a question. The other sentence functions as an answer to that question. Next they start building a question by choosing a question prefix and by completing the question with content from the chosen sentence.
Since it is possible to choose either of the two sentences as question/answer for a specific set of symmetric relations, (i.e., What is the reason a new series of civil wars broke out?), we consider both possible formulations as equivalent.
The set of possible question prefixes cover all PDTB 3.0 senses (excluding belief and speech-act relations). The direction of the relation sense, e.g., arg1-as-denier vs. arg2-as-denier, is determined by which of the two sentences is chosen for the question/answer. While Pyatkin et al. (2020) allowed crowdworkers to form multiple QA pairs per instance, i.e., annotate more than one discourse sense per relation, we decided to limit the task to 1 sense per relation per worker. We took this decision in order for the QA method to be more comparable to the DC method, which also only allows the insertion of a single connective.
3 Method
3.1 Data
We annotated 1,200 inter-sentential discourse relations using both the DC and the QA task design.2 Of these 1,200 relations, 900 were taken from the DiscoGeM corpus and 300 from the PDTB 3.0.
DiscoGeM Relations
The 900 DiscoGeM instances that were included in the current study represent different domains: 296 instances were taken from the subset of DiscoGeM relations that were taken from Europarl proceedings (written proceedings of prepared political speech taken from the Europarl corpus; Koehn, 2005), 304 instances were taken from the literature subset (narrative text from five English books),3 and 300 instances from the Wikipedia subset of DiscoGeM (informative text, taken from the summaries of 30 Wikipedia articles). These different genres enable a cross-genre comparison. This is necessary, given that prevalence of certain relation types can differ across genres (Rehbein et al., 2016; Scholman et al., 2022a; Webber, 2009).
These 900 relations were already labeled using the DC method in DiscoGeM; we additionally collect labels using the QA method for the current study. In addition to crowd-sourced labels using the DC and QA methods, the Wikipedia subset was also annotated by three trained annotators.4 Forty-seven percent of these Wikipedia instances were labeled with multiple senses by the expert annotators (i.e., were considered to be ambiguous or express multiple readings).
PDTB Relations
The PDTB relations were included for the purpose of comparing our annotations with traditional PDTB gold standard annotations. These instances (all inter-sentential) were selected to represent all relational classes, randomly sampling at most 15 and at least 2 (for classes with less than 15 relation instances we sampled all existing relations) relation instances per class. The reference labels for the PDTB instances consist of the original PDTB labels annotated as part of the PDTB3 corpus. Only 8% of these consisted of multiple senses.
3.2 Crowdworkers
Crowdworkers were recruited via Prolific using a selection approach (Scholman et al., 2022) that has been shown to result in a good trade off between quality and time/monetary efforts for DR annotation. Crowdworkers had to meet the following requirements: be native English speakers, reside in UK, Ireland, USA, or Canada, and have obtained at least an undergraduate degree.
Workers who fulfilled these conditions could participate in an initial recruitment task, for which they were asked to annotate a text with either the DC or QA method and were shown immediate feedback on their performance. Workers with an accuracy ≥ 0.5 on this task were qualified to participate in further tasks. We hence created a unique set of crowdworkers for each method. The DC annotations (collected as part of DiscoGeM) were provided by a final set of 199 selected crowdworkers; QA had a final set of 43 selected crowdworkers.5 Quality was monitored throughout the production data collection and qualifications were adjusted according to performance.
Every instance was annotated by 10 workers per method. This number was chosen based on parity with previous research. For example, Snow et al. (2008) show that a sample of 10 crowdsourced annotations per instance yields satisfactory accuracy for various linguistic annotation tasks. Scholman and Demberg (2017) found that assigning a new group of 10 annotators to annotate the same instances resulted in a near-perfect replication of the connective insertions in an earlier DC study.
Instances were annotated in batches of 20. For QA, one batch took about 20 minutes to complete, and for DC, 7 minutes. Workers were reimbursed about £2.50 and £1.88 per batch, respectively.
3.3 Inter-annotator Agreement
We evaluate the two DR annotation methods by the inter-annotator agreement (IAA) between the annotations collected by both methods and IAA with reference annotations collected from trained annotators.
Cohen’s kappa (Cohen, 1960) is a metric frequently used to measure IAA. For DR annotations, a Cohen’s kappa of .7 is considered to reflect good IAA (Spooren and Degand, 2010). However, prior research has shown that agreement on implicit relations is more difficult to reach than on explicit relations: Kishimoto et al. (2018) report an F1 of .51 on crowdsourced annotations of implicits using a tagset with 7 level-2 labels; Zikánová et al. (2019) report κ = .47 (58%) on expert annotations of implicits using a tagset with 23 level-2 labels; and Demberg et al. (2019) find that PDTB and RST-DT annotators agree on the relation sense on 37% of implicit relations. Cohen’s kappa is primarily used for comparison between single labels and the IAAs reported in these studies are also based on single aggregated labels.
However, we also want to compare the obtained 10 annotations per instance with our reference labels that also contain multiple labels. The comparison becomes less straightforward when there are multiple labels because the chance of agreement is inflated and partial agreement should be treated differently. We thus measure the IAA between multiple labels in terms of both full and partial agreement rates, as well as the multi-label kappa metric proposed by Marchal et al. (2022). This metric adjusts the multi-label agreements with bootstrapped expected agreement. We consider all the labels annotated by the crowdworkers in each instance, excluding minority labels with only one vote.6
In addition, we compare the distributions of the crowdsourced labels using the Jensen-Shannon divergence (JSD) following existing reports (Erk and McCarthy, 2009; Nie et al., 2020; Zhang et al., 2021). Similarly, minority labels with only one vote are excluded. Since distributions are not available in the reference labels, when comparing with the reference labels, we evaluate by the JSD based on the flattened distributions of the labels, which means we replace the original distribution of the votes with an even distribution of the labels that have been voted by more than one annotator. We call this version JSD_flat.
As a third perspective on IAA we report agreement among annotators on an item annotated with QA/DC. Following previous work (Nie et al., 2020), we use entropy of the soft labels to quantify the uncertainty of the crowd annotation. Here labels with only one vote are also included as they contribute to the annotation uncertainty. When calculating the entropy, we use a logarithmic base of n = 29, where n is the number of possible labels. A lower entropy value suggests that the annotators agree with each other more and the annotated label is more certain. As discussed in Section 1, the source of disagreement in annotations could come from the items, the annotators, and the methodology. High entropy across multiple annotations of a specific item within the same annotation task suggests that the item is ambiguous.
4 Results
We first compare the IAA between the two crowdsourced annotations, then we discuss IAA between DC/QA and the reference annotations, and lastly we perform an analysis based on annotation uncertainty. Here, “sub-labels” of an instance means all relations that have received more than one annotation; and “label distribution” is the distribution of the votes of the sub-labels.
4.1 IAA Between the Methods
Table 1 shows that both methods yield more than two sub-labels per instance after excluding minority labels with only one vote. This supports the idea that multi-sense annotations better capture the fact that often more than one sense can hold implicitly between two discourse arguments.
Item counts . | Europarl . | Novel . | Wiki. . | PDTB . | all . |
---|---|---|---|---|---|
296 . | 304 . | 300 . | 302 . | 1202 . | |
QA sub-labels/item | 2.13 | 2.21 | 2.26 | 2.45 | 2.21 |
DC sub-labels/item | 2.37 | 2.00 | 2.09 | 2.21 | 2.17 |
full/+partial agreement | .051/.841 | .092/.865 | .060/.920 | .050/.884 | .063/.878 |
multi-label kappa | .813 | .842 | .903 | .868 | .857 |
JSD | .505 | .492 | .482 | .510 | .497 |
Item counts . | Europarl . | Novel . | Wiki. . | PDTB . | all . |
---|---|---|---|---|---|
296 . | 304 . | 300 . | 302 . | 1202 . | |
QA sub-labels/item | 2.13 | 2.21 | 2.26 | 2.45 | 2.21 |
DC sub-labels/item | 2.37 | 2.00 | 2.09 | 2.21 | 2.17 |
full/+partial agreement | .051/.841 | .092/.865 | .060/.920 | .050/.884 | .063/.878 |
multi-label kappa | .813 | .842 | .903 | .868 | .857 |
JSD | .505 | .492 | .482 | .510 | .497 |
Table 1 also presents the IAA between the labels crowdsourced with QA and DC per domain. The agreement between the two methods is good: The labels assigned by the two methods (or at least one of the sub-labels in case of a multi-label annotation) match for about 88% of the items. This speaks for the fact that both methods are valid, as similar sets of labels are produced.
The full agreement scores, however, are very low. This is expected, as the chance to match on all sub-labels is also very low compared to a single-label setting. The multi-label kappa (which takes chance agreement of multiple labels into), and JSD (which compares the distributions of the multiple labels) are hence more suitable. We note that the PDTB gold annotation that we use for evaluation does not assign multiple relations systematically and has a low rate of double labels. This explains why the PDTB subsets have a high partial agreement while the JSD ends up being worst.
4.2 IAA Between Crowdsourced and Reference Labels
Table 2 compares the labels crowdsourced by each method and the reference labels, which are available for the Wikipedia and PDTB subsets. It can be observed that both methods achieve higher full agreements with the reference labels than with each other on both domains. This indicates that the two methods are complementary, with each method better capturing different sense types. In particular, the QA method tends to show higher agreement with the reference for Wikipedia items, while the DC annotations show higher agreement with the reference for PDTB items. This can possibly be attributed to the development of the methodologies: The DC method was originally developed by testing on data from the PDTB in Yung et al. (2019), whereas the QA method was developed by testing on data from Wikipedia and Wikinews in Pyatkin et al. (2020).
Item counts . | Wiki. . | PDTB . |
---|---|---|
300 . | 302 . | |
Ref. sub-labels/item | 1.54 | 1.08 |
QA: sub-labels/item | 2.26 | 2.45 |
full/+partial agreement | .133/.887 | .070/.487 |
multi-label kappa | .857 | .449 |
JSDflat | .468 | .643 |
DC: sub-labels/item | 2.09 | 2.21 |
full/+partial agreement | .110/.853 | .103/.569 |
multi-label kappa | .817 | .524 |
JSDflat | .483 | .606 |
Item counts . | Wiki. . | PDTB . |
---|---|---|
300 . | 302 . | |
Ref. sub-labels/item | 1.54 | 1.08 |
QA: sub-labels/item | 2.26 | 2.45 |
full/+partial agreement | .133/.887 | .070/.487 |
multi-label kappa | .857 | .449 |
JSDflat | .468 | .643 |
DC: sub-labels/item | 2.09 | 2.21 |
full/+partial agreement | .110/.853 | .103/.569 |
multi-label kappa | .817 | .524 |
JSDflat | .483 | .606 |
4.3 Annotation Uncertainty
Table 3 compares the average entropy of the soft labels collected by both methods. It can be observed that the uncertainty among the labels chosen by the crowdworkers is similar across domains but always slightly lower for DC. We further look at the correlation between annotation uncertainty and cross-method agreement, and find that agreement between methods is substantially higher for those instances where within-method entropy was low. Similarly, we find that agreement between crowdsourced annotations and gold labels is highest for those relations, where little entropy was found in crowdsourcing.
. | Europarl . | Wikipedia . | Novel . | PDTB . |
---|---|---|---|---|
QA | 0.40 | 0.38 | 0.38 | 0.41 |
DC | 0.37 | 0.34 | 0.35 | 0.36 |
. | Europarl . | Wikipedia . | Novel . | PDTB . |
---|---|---|---|---|
QA | 0.40 | 0.38 | 0.38 | 0.41 |
DC | 0.37 | 0.34 | 0.35 | 0.36 |
Next, we want to check if the item effect is similar across different methods and domains. Figure 2 shows the correlation between the annotation entropy and the agreement with the reference of each item, of each method for the Wikipedia / PDTB subsets. It illustrates that annotations of both methods diverge with the reference more as the uncertainty of the annotation increases. While the effect of uncertainty is similar across methods on the Wikipedia subset, the quality of the QA annotations depends more on the uncertainty compared to the DC annotations on the PDTB subset. This means that method bias also exists on the level of annotation uncertainty and should be taken into account when, for example, entropy is used as a criterion to select reliable annotations.
5 Sources of the Method Bias
In this section, we analyze method bias in terms of the sense labels collected by each method. We also examine the potential limitations of the methods which could have contributed to the bias and demonstrate how we can utilize information on method bias to crowdsource more reliable labels. Lastly, we provide a cross-domain analysis.
Table 5 presents the confusion matrix of the labels collected by both methods for the most frequent level-2 relations. Figure 3 and Table 4 show the distribution of the true and false positives of the sub-labels. These results show that both methods are biased towards certain DRs. The source of these biases can be categorized into two types, which we will detail in the following subsections.
label . | FNQA . | FNDC . | FPQA . | FPDC . |
---|---|---|---|---|
conjunction | 43 | 46 | 203 | 167 |
arg2-as-detail | 42 | 62 | 167 | 152 |
precedence | 19 | 18 | 18 | 37 |
arg2-as-denier | 38 | 20 | 15 | 47 |
result | 10 | 5 | 110 | 187 |
contrast | 8 | 17 | 84 | 39 |
arg2-as-instance | 10 | 7 | 44 | 57 |
reason | 12 | 17 | 54 | 37 |
synchronous | 20 | 27 | 11 | 5 |
arg2-as-subst | 21 | 13 | 1 | 0 |
equivalence | 22 | 22 | 2 | 1 |
succession | 17 | 15 | 24 | 3 |
similarity | 7 | 8 | 15 | 12 |
norel | 12 | 12 | 0 | 0 |
arg1-as-detail | 9 | 8 | 39 | 13 |
disjunction | 5 | 4 | 10 | 0 |
arg1-as-denier | 3 | 3 | 33 | 31 |
arg2-as-manner | 2 | 2 | 9 | 0 |
arg2-as-excpt | 2 | 2 | 1 | 0 |
arg2-as-goal | 1 | 1 | 5 | 0 |
arg2-as-cond | 1 | 1 | 0 | 0 |
arg2-as-negcond | 1 | 1 | 0 | 0 |
arg1-as-goal | 1 | 1 | 3 | 0 |
label . | FNQA . | FNDC . | FPQA . | FPDC . |
---|---|---|---|---|
conjunction | 43 | 46 | 203 | 167 |
arg2-as-detail | 42 | 62 | 167 | 152 |
precedence | 19 | 18 | 18 | 37 |
arg2-as-denier | 38 | 20 | 15 | 47 |
result | 10 | 5 | 110 | 187 |
contrast | 8 | 17 | 84 | 39 |
arg2-as-instance | 10 | 7 | 44 | 57 |
reason | 12 | 17 | 54 | 37 |
synchronous | 20 | 27 | 11 | 5 |
arg2-as-subst | 21 | 13 | 1 | 0 |
equivalence | 22 | 22 | 2 | 1 |
succession | 17 | 15 | 24 | 3 |
similarity | 7 | 8 | 15 | 12 |
norel | 12 | 12 | 0 | 0 |
arg1-as-detail | 9 | 8 | 39 | 13 |
disjunction | 5 | 4 | 10 | 0 |
arg1-as-denier | 3 | 3 | 33 | 31 |
arg2-as-manner | 2 | 2 | 9 | 0 |
arg2-as-excpt | 2 | 2 | 1 | 0 |
arg2-as-goal | 1 | 1 | 5 | 0 |
arg2-as-cond | 1 | 1 | 0 | 0 |
arg2-as-negcond | 1 | 1 | 0 | 0 |
arg1-as-goal | 1 | 1 | 3 | 0 |
5.1 Limitation of Natural Language for Annotation
There are limitations of representing DRs in natural languages using both QA and DC. For example, the QA method confuses workers when the question phrase contains a connective:7
- (2)
“Little tyke,” chortled Mr. Dursley as he left the house.He got into his car and backed out of number four’s drive. [QA:succession, precedence, DC:conjunction, precedence]
In the above example, the majority of the workers formed the question “After whathe left the house?, which was likely a confusion with “What did he do afterhe left the house?”. This could explain the frequent confusion between precedence and succession by QA, resulting in the frequent FPs of succession (Figure 3).8
For DC, rare relations which lack a frequently used connective are harder to annotate, for example:
- (3)
He had made an arrangement with one of the cockerels to call him in the mornings half an hour earlier than anyone else, and would put in some volunteer labour at whatever seemed to be most needed, before the regular day’s work began.His answer to every problem, every setback, was “I will work harder!” - which he had adopted as his personal motto. [QA:arg1-as-instance; DC:result]
It is difficult to use the DC method to annotate the arg1-as-instance relation due to a lack of typical, specific, and context-independent connective phrases that mark these rare relations, such as “this is an example of ...”. By contrast, the QA method allows workers to make a question and answer pair in the reverse direction, with S1 being the answer to S2, using the same question words, e.g., What is an example of the fact that his answer to every problem [...] was “I will work harder!”?. This allows workers to label rarer relation types that were not even uncovered by trained annotators.
Many common DCs are ambiguous, such as but and and, and can be hard to disambiguate. To address this, the DC method provides workers with unambiguous connectives in the second step. However, these unambiguous connectives are often relatively uncommon and come with different syntactic constraints, depending on whether they are coordinating or subordinating conjunctions or discourse adverbials. Hence, they do not fit in all contexts. Additionally, some of the unambiguous connectives sound very “heavy” and would not be used naturally in a given sentence. For example, however is often inserted in the first step, but it can mark multiple relations and is disambiguated in the second step by the choice among on the contrary for contrast, despite for arg1-as-denier, and despite this for arg2-as-denier. Despite this was chosen frequently since it can be applied to most contexts. This explains the DC method’s bias towards arg2-as-denier against contrast (Figure 3: most FPs of arg2-as-denier and most FNs of contrast come from DC).
While the QA method also requires workers to select from a set of question starts, which also contain infrequent expressions (such as Unless what..?), workers are allowed to edit the text to improve the wordings of the questions. This helps reduce the effect of bias towards more frequent question prefixes and makes crowdworkers doing the QA task more likely to choose infrequent relation senses than those doing the DC task.
5.2 Guideline Underspecification
Jiang and de Marneffe (2022) report that some disagreements in NLI tasks come from the loose definition of certain aspects of the task. We found that both QA and DC also do not give clear enough instructions in terms of argument spans. The DRs are annotated at the boundary of two consecutive sentences but both methods do not limit workers to annotate DRs that span exactly the two sentences.
More specifically, the QA method allows the crowdworkers to form questions by copying spans from one of the sentences. While this makes sure that the relation lies locally between two consecutive sentences, it also sometimes happens that workers highlight partial spans and annotate relations that span over parts of the sentences. For example:
- (4)
I agree with Mr Pirker, and it is probably the only thing I will agree with him on if we do vote on the Ludford report.It is going to be an interesting vote. [QA:arg2-as-detail,reason; DC:conjunction,result]
In Ex. (4), workers constructed the question “What provides more detailson the vote on the Ludford report?”. This is similar to the instructions in PDTB 2.0 and 3.0’s annotation manuals, specifying that annotators should take minimal spans which don’t have to span the entire sentence. Other relations should be inferred when the argument span is expanded to the whole sentence, for example a result relation reflecting that there is little agreement, which will make the vote interesting.
Often, a sentence can be interpreted as the elaboration of certain entities in the previous sentence. This could explain why Arg1/2-as-detail tends to be overlabelled by QA. Figure 3 shows that the QA has more than twice as many FP counts for arg2-as-detail compared to DC – the contrast is even bigger for arg1-as-detail. Yet it is not trivial to filter out such questions that only refer to a part of the sentence, because in some cases, the highlighted entity does represent the whole argument span.9 Clearer instructions in the guidelines are desirable.
Similarly, DC does not limit workers to annotate relations between the two sentences, consider:
- (5)
When two differently-doped regions exist in the same crystal, a semiconductor junction is created.The behavior of charge carriers, which include electrons, ions and electron holes, at these junctions is the basis of diodes, transistors and all modern electronics. [Ref:arg2-as-detail; QA:arg2-as-detail, conjunction; DC:conjunction, result]
In this example, many people inserted as a result, which naturally marks the intra-sentence relation (...is created as a result.) Many relations are potentially spuriously labelled as Result, which are frequent between larger chunks of texts. Table 5 shows that the most frequent confusion is between DC’s cause and QA’s conjunction.10 Within the level-2 cause relation sense, it is the level-3 result relation that turns out to be the main contributor to the observed bias. Figure 3 also shows that most FPs of result come from the DC method.
5.3 Aggregating DR Annotations Based on Method Bias
The qualitative analysis above provides insights on certain method biases observed in the label distributions, such as QA’s bias towards arg1/2-as detail and succession and DC’s bias towards concession and result. Being aware of these biases would allow to combine the methods: After first labelling all instances with the more cost-effective DC method, result relations, which we know tend to be overlabelled by the DC method, could be re-annotated using the QA method. We simulate this for our data and find that this would increase the partial agreement from 0.853 to 0.913 for Wikipedia and from 0.569 to 0.596 for PDTB.
6 Analysis by Genre
For each of the four genres (Novel, Wikipedia, Europarl, and WSJ) we have ∼300 implicit DRs annotated by both DC and QA. Scholman et al. (2022a) showed, based on the DC method, that in DiscoGeM, conjunction is prevalent in the Wikipedia domain, Precedence in Literature and Result in Europarl. The QA annotations replicate this finding, as displayed in Figure 4.
It appears more difficult to obtain agreement with the majority labels in Europarl than in other genres, which is reflected in the average entropy (see Table 3) of the distributions for each genre, where DC has the highest entropy in the Europarl domain and QA the second highest (after PDTB). Table 1 confirms these findings, showing that the agreement between the two methods is highest for Wikipedia and lowest for Europarl.
In the latter domain, the DC method results in more causal relations: 36% of the conjunctions labelled by QA are labelled as result in DC.11 Manual inspection of these DC annotations reveals that workers chose considering this frequently only in the Europarl subset. This connective phrase is typically used to mark a pragmatic result relation, where the result reading comes from the belief of the speaker (Ex. (4)). This type of relation is expected to be more frequent in speech and argumentative contexts and is labelled as result-belief in PDTB3. QA does not have a question prefix available that could capture result-belief senses. The result labels obtained by DC are therefore a better fit with the PDTB3 framework than QA’s conjunctions. Concession is generally more prevalent with the DC method, especially in Europarl, with 9% compared to 3% for QA. Contrast, on the other hand, seems to be favored by the QA method, of which most (6%) contrast relations are found in Wikipedia, compared to 3% for DC. Figure 4 also highlights that for the QA approach, annotators tend to choose a wider variety of senses which are rarely ever annotated by DC, such as purpose, condition, and manner.
We conclude that encyclopedic and literary texts are the most suitable to be annotated using either DC or QA, as they show higher inter-method agreement (and for Wikipedia also higher agreement with gold). Spoken-language and argumentative domains on the other hand are trickier to annotate as they contain more pragmatic readings of the relations.
7 Case Studies: Effect of Task Design on DR Classification Models
Analysis of the crowdsourced annotations reveals that the two methods have different biases and different correlations with domains and the style (and possibly function) of the language used in the domains. We now investigate the effect of task design bias on automatic prediction of implicit discourse relations. Specifically, we carry out two case studies to demonstrate the effect that task design and the resulting label distributions have on discourse parsing models.
Task and Setup
We formulate the task of predicting implicit discourse relations as follows. The input to the model are two sequences S1 and S2, which represent the arguments of a discourse relation. The targets are PDTB 3.0 sense types (including level-3). This model architecture is similar to the model for implicit DR prediction by Shi and Demberg (2019). We experiment with two different losses and targets: a cross-entropy loss where the target is a single majority label and a soft cross-entropy loss where the target is a probability distribution over the annotated labels. Using the 10 annotations per instance, we obtain label distributions for each relation, which we use as soft targets. Training with a soft loss has been shown to improve generalization in vision and NLP tasks (Peterson et al., 2019; Uma et al., 2020). As suggested in Uma et al. (2020), we normalize the sense-distribution over the 30 possible labels12 with a softmax.
Assume one has a relation with the following annotations: 4 result, 3 conjunction, 2 succession, 1 arg1-as-detail. For the hard loss, the target would be the majority label: result. For the soft loss we normalize the counts (every label with no annotation has a count of 0) using a softmax, for a smoother distribution without zeros.
Data
In addition to the 1,200 instances we analyzed in the current contribution, we additionally use all annotations from DiscoGeM as training data. DiscoGeM, which was annotated with the DC method, adds 2756 Novel relations, 2504 Europarl relations, and 345 Wikipedia relations. We formulate different setups for the case studies.
7.1 Case 1: Incorporating Data from Different Task Designs
The purpose of this study is to see if a model trained on data crowdsourced by DC/QA methods can generalize to traditionally annotated test sets. We thus test on the 300 Wikipedia relations annotated by experts (Wiki gold), all implicit relations from the test set of PDTB 3.0 (PDTB test), and the implicit relations of the English test set of TED-MDB (Zeyrek et al., 2020). For training data, we either use (1) all of the DiscoGeM annotations (Only DC); or (2) 1200 QA annotations from all four domains, plus 5,605 DC annotations from the rest of DiscoGeM (Intersection, ∩); or (3) 1200 annotations which combine the label counts (e.g., 20 counts instead of 10) of QA and DC, plus 5,605 DC annotations from the rest of DiscoGeM (Union, ∪). We hypothesize that this union will lead to improved results due to the annotation distribution coming from a bigger sample. When testing on Wiki gold, the corresponding subset of Wikipedia relations are removed from the training data. We randomly sampled 30 relations for dev.
Results
Table 6 shows how the model generalizes to traditionally annotated data. On the PDTB and the Wikipedia test set, the model with a soft loss generally performs better than the hard loss model. TED-MDB, on the other hand, only contains a single label per relation and training with a distributional loss is therefore less beneficial. Mixing DC and QA data only improves in the soft case for PDTB. The merging of the respective method label counts, on the other hand, leads to the best model performance on both PDTB and TED-MDB. On Wikipedia the best performance is obtained when training on soft DC-only distributions. Looking at the label-specific differences in performance, we observe that improvement on the Wikipedia test set mainly comes from better precision and recall when predicting arg2-as-detail, while on PDTB QA+DC∩ Soft is better at predicting conjunction.
. | PDTBtest . | Wikigold . | TED-M. . |
---|---|---|---|
DC | 0.34† | 0.65† | 0.36 |
DC Soft | 0.29*† | 0.70*† | 0.34* |
QA+DC∩ | 0.34★ | 0.67 | 0.37 |
QA+DC∩ Soft | 0.38*★ | 0.66* | 0.31* |
QA+DC∪ | 0.35♠ | 0.49♠ | 0.36♠ |
QA+DC∪ Soft | 0.41*♠ | 0.67*♠ | 0.43*♠ |
. | PDTBtest . | Wikigold . | TED-M. . |
---|---|---|---|
DC | 0.34† | 0.65† | 0.36 |
DC Soft | 0.29*† | 0.70*† | 0.34* |
QA+DC∩ | 0.34★ | 0.67 | 0.37 |
QA+DC∩ Soft | 0.38*★ | 0.66* | 0.31* |
QA+DC∪ | 0.35♠ | 0.49♠ | 0.36♠ |
QA+DC∪ Soft | 0.41*♠ | 0.67*♠ | 0.43*♠ |
We conclude that training on data that comes from different task designs does not hurt performance, and even slightly improves performance when using majority vote labels. When training with a distribution, the union setup (∪) seems to work best.
7.2 Case 2: Cross-domain vs Cross-method
The purpose of this study is to investigate how cross-domain generalization is affected by method bias. In other words, we want to compare a cross-domain and cross-method setup with a cross-domain and same-method setup. We test on the domain-specific data from the 1,200 instances annotated by QA and DC, respectively, and train on various domain configurations from DiscoGem (excluding dev and test), together with the extra 300 PDTB instances, annotated by DC.
Table 7 shows the different combinations of data sets we use in this study (columns) as well as the results of in- and cross-domain and in- and cross-method predictions (rows). Both a change in domain and a change in annotation task lead to lower performance. Interestingly, the results show that the task factor has a stronger effect on performance than the domain: When training on DC distributions, the QA test results are worse than the DC test results in all cases. This indicates that task bias is an important factor to consider when training models. Generally, except in the out-of-domain novel test case, training with a soft loss leads to the same or considerably better generalization accuracy than training with a hard loss. We thus confirm the findings of Peterson et al. (2019) and Uma et al. (2020) also for DR classification.
8 Discussion and Conclusion
DR annotation is a notoriously difficult task with low IAA. Annotations are not only subject to the interpretation of the coder (Spooren and Degand, 2010), but also to the framework (Demberg et al., 2019). The current study extends these findings by showing that the task design also crucially affects the output. We investigated the effect of two distinct crowdsourced DR annotation tasks on the obtained relation distributions. These two tasks are unique in that they use natural language to annotate. Even though these designs are more intuitive to lay individuals, we show that also such natural language-based annotation designs suffer from bias and leave room for varying interpretations (as do traditional annotation tasks).
The results show that both methods have unique biases, but also that both methods are valid, as similar sets of labels are produced. Further, the methods seem to be complementary: Both methods show higher agreement with the reference label than with each other. This indicates that the methods capture different sense types. The results further show that the textual domain can push each method towards different label distributions. Lastly, we simulated how aggregating annotations based on method bias improves agreement.
We suggest several modifications to both methods for future work. For QA, we recommend to replace question prefix options which start with a connective, such as “After what”. The revised options should ideally start with a Wh-question word, for example, “What happens after..”. This would make the questions sound more natural and help to prevent confusion with respect to level-3 sense distinctions. For DC, an improved interface that allows workers to highlight argument spans could serve as a screen that confirms the relation is between the two consecutive sentences. Syntactic constraints making it difficult to insert certain rare connectives could also be mitigated if the workers are allowed to make minor edits to the texts.
Considering that both methods show benefits and possible downsides, it could be interesting to combine them for future crowdsourcing efforts. Given that obtaining DC annotations is cheaper and quicker, it could make sense to collect DC annotations on a larger scale and then use the QA method for a specific subset that shows high label entropy. Another option would be to merge both methods, by first letting the crowdworkers insert a connective and then use QAs for the second connective-disambiguation step. Lastly, since we showed that often more than one relation sense can hold, it would make sense to allow annotators to write multiple QA pairs or insert multiple possible connectives for a given relation.
The DR classification experiments revealed that generalization across data from different task designs is hard, in the DC and QA case even harder than cross-domain generalization. Additionally, we found that merging data distributions coming from different task designs can help boost performance on data coming from a third source (traditional annotations). Lastly, we confirmed that soft modeling approaches using label distributions can improve discourse classification performance.
Task design bias has been identified as one source of annotation bias and acknowledged as an artifact of the dataset in other linguistic tasks as well (Pavlick and Kwiatkowski, 2019; Jiang and de Marneffe, 2022). Our findings show that the effect of this type of bias can be reduced by training with data collected by multiple methods. This could be the same for other NLP tasks, especially those cast in natural language, and comparing their task designs could be an interesting future research direction. We therefore encourage researchers to be more conscious about the biases crowdsourcing task design introduces.
Acknowledgments
This work was supported by the Deutsche Forschungsgemeinschaft, Funder ID: http://dx.doi.org/10.13039/501100001659, grant number: SFB1102: Information Density and Linguistic Encoding, by the the European Research Council, ERC-StG grant no. 677352, and the Israel Science Foundation grant 2827/21, for which we are grateful. We also thank the TACL reviewers and action editors for their thoughtful comments.
Notes
We merge the belief and speech-act relation senses (which cannot be distinguished reliably by QA and DC) with their corresponding more general relation senses.
The annotations are available at https://github.com/merelscholman/DiscoGeM.
Animal Farm by George Orwell, Harry Potter and the Philosopher’s Stone by J. K. Rowling, The Hitchhikers Guide to the Galaxy by Douglas Adams, The Great Gatsby by F. Scott Fitzgerald, and The Hobbit by J. R. R. Tolkien.
Instances were labeled by two annotators and verified by a third; Cohen’s κ agreement between the first annotator and the reference label was .82 (88% agreement), and between the second and the reference label was .96 (97% agreement). See Scholman et al. (2022a) for additional details.
The larger set of selected workers in the DC method is because more data was annotated by DC workers as part of the creation of DiscoGeM.
We assumed there were 10 votes per item and removed labels with less than 20% of votes, even though in rare cases there could be 9 or 11 votes. On average, the removed labels represent 24.8% of the votes per item.
The examples are presented in the following format: italics = argument 1; bolded = argument 2; plain = contexts.
Similarly, the question “Despite what …?” is easily confused with “despite...”, which could explain the frequent FP of arg1-as-denier by the QA method.
Such as “a few final comments” in this example: Ladies and gentlemen, I would like to make a few final comments.This is not about the implementation of the habitats directive.
A chi-squared test confirms that the observed distribution is significantly different from what could be expected based on chance disagreement.
This appeared to be distributed over many annotators and is thus a true method bias.
precedence, arg2-as-detail, conjunction, result, arg1-as-detail, arg2-as-denier, contrast, arg1-as-denier, synchronous, reason, arg2-as-instance, arg2-as-cond, arg2-as-subst, similarity, disjunction, succession, arg1-as-goal, arg1-as-instance, arg2-as-goal, arg2-as-manner, arg1-as-manner, equivalence, arg2-as-excpt, arg1-as-excpt, arg1-as-cond, differentcon, norel, arg1-as-negcond, arg2-as-negcond, arg1-as-subst.
References
Author notes
Action Editor: Annie Louis