Abstract
Discourse parsing has been studied for decades. However, it still remains challenging to utilize discourse parsing for real-world applications because the parsing accuracy degrades significantly on out-of-domain text. In this paper, we report and discuss the effectiveness and limitations of bootstrapping methods for adapting modern BERT-based discourse dependency parsers to out-of-domain text without relying on additional human supervision. Specifically, we investigate self-training, co-training, tri-training, and asymmetric tri-training of graph-based and transition-based discourse dependency parsing models, as well as confidence measures and sample selection criteria in two adaptation scenarios: monologue adaptation between scientific disciplines and dialogue genre adaptation. We also release COVID-19 Discourse Dependency Treebank (COVID19-DTB), a new manually annotated resource for discourse dependency parsing of biomedical paper abstracts. The experimental results show that bootstrapping is significantly and consistently effective for unsupervised domain adaptation of discourse dependency parsing, but the low coverage of accurately predicted pseudo labels is a bottleneck for further improvement. We show that active learning can mitigate this limitation.
1 Introduction
Discourse parsing aims to uncover structural organization of text, which is useful in Natural Language Processing (NLP) applications such as document summarization (Louis et al., 2010; Hirao et al., 2013; Yoshida et al., 2014; Bhatia et al., 2015; Durrett et al., 2016; Xu et al., 2020), text categorization (Ji and Smith, 2017; Ferracane et al., 2017), question answering (Verberne et al., 2007; Jansen et al., 2014), and information extraction (Quirk and Poon, 2017). In particular, dependency-style representation of discourse structure has been studied intensively in recent years (Asher and Lascarides, 2003; Hirao et al., 2013; Li et al., 2014b; Morey et al., 2018; Hu et al., 2019; Shi and Huang, 2019). Figure 1 shows an example of discourse dependency structure, which is recorded in COVID-19 Discourse Dependency Treebank (COVID19-DTB), a new manually annotated resource for discourse dependency parsing of biomedical abstracts.
State-of-the-art discourse dependency parsers are generally trained on a manually annotated treebank, which is available in a limited number of domains, such as RST-DT (Carlson et al., 2001) for news articles, SciDTB (Yang and Li, 2018) for NLP abstracts, and STAC (Asher et al., 2016) and Molweni (Li et al., 2020) for multi-party dialogues. However, when the parser is applied directly to out-of-domain documents, the parsing accuracy degrades significantly due to the domain shift problem. In fact, we normally face this issue in the real world because human supervision is generally scarce and expensive to obtain in the domain of interest.
Unsupervised Domain Adaptation (UDA) aims to adapt a model trained on a source domain, where a limited amount of labeled data is available, to a target domain, where only unlabeled data is available. Bootstrapping (or pseudo labeling) has been shown to be effective for the UDA problem of syntactic parsing (Steedman et al., 2003b, a; Reichart and Rappoport, 2007; Søaard and Rishøj, 2010; Weiss et al., 2015). In bootstrapping for syntactic parsing, we first train a model on the labeled source sentences, the model is used to give pseudo labels (i.e., parse trees) to unlabeled target sentences, and then the model is retrained on the manually and automatically labeled sentences.
On the contrary, despite the significant progress achieved in discourse parsing so far (Li et al., 2014b; Ji and Eisenstein, 2014; Joty et al., 2015; Perret et al., 2016; Wang et al., 2017; Kobayashi et al., 2020; Koto et al., 2021), bootstrapping for the UDA problem of discourse parsing is still not well understood. Jiang et al. (2016) and Kobayashi et al. (2021) explored how to enrich the labeled dataset using bootstrapping methods; however, their studies are limited to the in-domain setup, where the labeled and unlabeled datasets are derived from the same domain. In contrast to these studies, we focus on the more realistic and challenging scenario, namely, out-of-domain discourse parsing, where the quality and diversity of the pseudo-labeled dataset become more crucial for performance enhancement.
In this paper, we perform a series of analyses of various bootstrapping methods in UDA of modern BERT-based discourse dependency parsers and report the effectiveness and limitations of these approaches. Figure 2 shows an overview of our bootstrapping system. Specifically, we investigate self-training (Yarowsky, 1995), co-training (Blum and Mitchell, 1998; Zhou and Goldman, 2004), tri-training (Zhou and Li, 2005), and asymmetric tri-training (Saito et al., 2017) of graph-based and transition-based discourse dependency parsing models, as well as confidence measures and sample selection criteria in two adaptation scenarios: monologue adaptation between scientific disciplines and dialogue genre adaptation. We show that bootstrapping improves out-of-domain discourse dependency parsing significantly and consistently across different adaptation setups.
Our analyses also reveal that bootstrapping has a difficulty in creating pseudo-labeled data that is both diverse and accurate, which is a current limiting factor in further improving accuracy, and furthermore it is difficult to boost the coverage by simply increasing the number of unlabeled documents. We show that an active learning approach can be an effective solution to the limitation.1
The rest of this paper is organized as follows: Section 2 provides an overview of related studies. Section 3 clarifies the problem and describes the methodology: bootstrapping algorithms, discourse dependency parsing models, confidence measures, and sample selection criteria. Section 4 describes the details of COVID19-DTB. Section 5 describes the experimental setup. In Section 6 we report and discuss the experimental results and provide practical recommendations for out-of-domain discourse dependency parsing. Finally, Section 7 concludes the paper.
2 Related Work
Various discourse parsing models have been proposed in the past decades. For constituency-style discourse structure like RST (Mann and Thompson, 1988), the parsing models can be categorized into the chart-based approach (Joty et al., 2013; Joty et al., 2015; Li et al., 2014a, 2016a), which finds the globally optimal tree using an efficient algorithm like dynamic programming, or the transition-based (or sequential) approach (Marcu, 1999; Sagae, 2009; Hernault et al., 2010b; Feng and Hirst, 2014; Ji and Eisenstein, 2014; Wang et al., 2017; Kobayashi et al., 2020; Zhang et al., 2020; Koto et al., 2021), which builds a tree incrementally by performing a series of decisions. For dependency-style discourse structure like the RST variants (Hirao et al., 2013; Li et al., 2014b; Morey et al., 2018) or Segmented Discourse Representation Theory (Asher and Lascarides, 2003), the models can also be categorized into the graph-based approach (Li et al., 2014b; Yoshida et al., 2014; Afantenos et al., 2015; Perret et al., 2016) or the transition-based (sequential) approach (Muller et al., 2012; Hu et al., 2019; Shi and Huang, 2019). Recently, pre-trained transformer encoders such as BERT (Devlin et al., 2019) and SpanBERT (Joshi et al., 2019) have been shown to greatly improve discourse parsing accuracy (Guz and Carenini, 2020; Koto et al., 2021). In this paper, we are not aiming at developing novel parsing models. Instead, we aim to investigate the effectiveness and limitations of bootstrapping methods for adapting the modern BERT-based discourse parsers.
Manually annotated discourse treebanks are significantly scarce, and their domains are limited. For example, the most popular discourse treebank, RST-DT (Carlson et al., 2001), contains only 385 labeled documents in total. To address the lack of large-scale labeled data, a number of semi-supervised, weakly supervised, and unsupervised techniques have been proposed in the discourse parsing literature. Hernault et al. (2010a) proposed a semi-supervised method that utilizes unlabeled documents to expand feature vectors in SVM classifiers in order to achieve better generalization for infrequent discourse relations. Liu and Lapata (2018) and Huber and Carenini (2019) proposed to exploit document-level class labels (e.g., sentiment) as distant supervision to induce discourse dependency structures from neural attention weights. Badene et al. (2019a, b) investigated a data programming paradigm (Ratner et al., 2016), which uses rule-based labeling functions to automatically annotate unlabeled documents and trains a generative model on the weakly supervised data. Kobayashi et al. (2019) and Nishida and Nakayama (2020) proposed fully unsupervised discourse constituency parsers, which can produce only tree skeletons and rely strongly on pre-trained word embeddings or human prior knowledge on document structure.
Technically most similar to our work, Jiang et al. (2016) and Kobayashi et al. (2021) proposed to enlarge the training dataset using a combination of multiple parsing models. Jiang et al. (2016) used co-training for enlarging the RST-DT training set with 2,000 Wall Street Journal articles, with a focus on improving classification accuracy on infrequent discourse relations. Kobayashi et al. (2021) proposed to exploit discourse subtrees that are agreed by two different models for enlarging the RST-DT training set. Interestingly, their proposed methods improved the classification accuracy especially for infrequent discourse relations.
These studies mainly assume the in-domain scenario and focus on enlarging the labeled set (e.g., RST-DT training set) using in-domain unlabeled documents, and the system evaluation is generally performed on the same domain with the original labeled set (e.g., RST-DT test set). In this paper, instead, we particularly focus on the UDA scenario, where the goal is to parse the target-domain documents accurately without relying on human supervision in the target domain. We believe this research direction is important for developing usable discourse parsers, because a target domain to which one would like to apply a discourse parser is normally different from the domains/genres of existing corpora, and manually annotated resources are rarely available in most domains/genres.
3 Method
3.1 Problem Formulation
The input is a document represented as a sequence of clause-level (in single-authored text) or utterance-level (in multi-party dialogues) spans called Elementary Discourse Units (EDUs).2 Our goal is to derive a discourse dependency structure, , given the input EDUs, x = e0,e1,…,en, which is analogous to syntactic dependency structure. A discourse dependency, (h,d,r), represents that the d-th EDU (called dependent) relates to the h-th EDU (called head) directly with the discourse relation . Each EDU except for the root node, e0, has a single head.
In this paper, we assume that we have a limited number of labeled documents in the source domain, while a large collection of unlabeled documents is available in the target domain. In particular, we assume that the source and target domains have different data distributions lexically or rhetorically (e.g., vocabulary, document length, and discourse relation distributions), but the domains share the same annotation scheme (e.g., definition of discourse relation classes). Our task is to adapt a parsing model (or models) trained in the source domain to the target domain using the unlabeled target data.
3.2 Bootstrapping
The aim of this paper is to investigate the effectiveness and limitations of various bootstrapping methods in UDA of modern BERT-based discourse dependency parsers. We show the overall flow of the bootstrapping methods in Figure 2. Initially we have a small set of labeled documents, ℒs, in the source domain, and a large collection of unlabeled documents, , in the target domain. Then the bootstrapping procedure works as follows:
- (1)
Train initial models on ℒs = {(xs,ys)}.
- (2)
Parse unlabeled documents using the current model f, such as, f: .3,4
- (3)
Measure the confidence scores of the pseudo-labeled data and select a subset, , that is expected to be reliable and useful.
- (4)
Retrain the models on for several epochs (set to 3 in this work).
Steps (2)-(4) loop for many rounds until a predefined stopping criterion is met.
Bootstrapping can be interpreted as a methodology where teachers generate pseudo supervision for students, and the students learn the task on it. Existing bootstrapping methods vary depending on how the teacher and student models are used. In this paper, we specifically explore the following bootstrapping methods: self-training (Yarowsky, 1995; McClosky et al., 2006; Reichart and Rappoport, 2007; Suzuki and Isozaki, 2008; Huang and Harper, 2009), co-training (Blum and Mitchell, 1998; Zhou and Goldman, 2004; Steedman et al., 2003b, a), tri-training (Zhou and Li, 2005; Weiss et al., 2015; Ruder and Plank, 2018), and asymmetric tri-training (Saito et al., 2017).
Self-Training
Self-Training (ST) starts with a single model f trained on ℒs. The overall procedure is the same as the one described above. The single model is both a teacher and a student for itself. Thus, it is difficult for the model to obtain novel knowledge (or supervision) that the model has not learn, and its errors may be amplified by the retraining cycle.
Co-Training
Co-Training (CT) starts with two parsing models, f1 and f2, that are expected to have different inductive biases with each other. The two models are pre-trained on the same ℒs. In Step 2, each model independently parses the unlabeled documents: . In Step 3, each of the pseudo-labeled sets are filtered by a selection criterion: . In Step 4, each model fi is retrained on .
In CT, the two models teach each other. Thus, each model is the teacher and the student for the other model simultaneously. In contrast to ST, each model can obtain knowledge that it has not yet learned. CT can be viewed as enhancing the agreement between the models.
Tri-Training (TT)
Tri-Training (TT) consists of three different models, f1, f2, and f3, which are initially trained on the same ℒs. In contrast to CT, where the single teacher fi is used to generate pseudo labels for the student fj (j≠i), TT uses two teachers, fi and fj (j≠i), to generate a pseudo-labeled set for the remaining student fk (k≠i,j). We measure the confidence for the pair of teachers’ parse trees, (, ), using the ratio of agreed dependencies (described in Subsection 3.4), based on which we determine whether or not to include the teachers’ predictions in the pseudo-labeled set.
Asymmetric Tri-Training (AT)
Asymmetric Tri-training (AT) is an extension of TT for UDA. A special domain-specific model is used only for test inference; the other two models, f2 and f3, are used only to generate pseudo labels . The domain-specific model is retrained on only , while f2 and f3 are retrained on .
3.3 Parsing Models
We employ three types of BERT-based discourse dependency parsers: (1) A graph-based arc-factored model (McDonald et al., 2005) with a biaffine attention mechanism (Dozat and Manning, 2017), (2) a transition-based shift-reduce model (Nivre, 2004; Chen and Manning, 2014; Kiperwasser and Goldberg, 2016), and (3) the backward variant of the shift-reduce model.
EDU Embedding
We compute EDU embeddings using a pre-trained Transformer encoder. This manner is common across the three parsing models, though the Transformer parameters are untied and fine-tuned separately. Specifically, we first break down the input document into non-overlapping segments of 512 subtokens, and then encode each segment independently by the Transformer encoder. Lastly, we compute EDU-level span embeddings as a concatenation of the Transformer output states at the span endpoints (wi and wj) and the span-level syntactic head word5wk, i.e., [wi;wj;wk].
Arc-Factored Model
Arc-Factored Model (A) is a graph-based dependency parser, which can find the globally optimal dependency structure using dynamic programming. Specifically, we employ the biaffine attention model (Dozat and Manning, 2017) for computing dependency scores s(h,d) ∈ℝ, and we decode the optimal structure y* using Eisner Algorithm, such that the tree score is maximized. We predict the discourse relation classes for each unlabeled dependency (h,d) ∈ y* using another biaffine attention layer and MLP, namely, . To reduce the computational time for inference, we employed the Hierarchical Eisner Algorithm (Zhang et al., 2021), which decodes dependency trees from the sentence level to the paragraph level and then to the whole text level.
Shift-Reduce Model
Shift-Reduce Model (S) is a transition-based dependency parser, which builds a dependency structure incrementally by executing a series of local actions. Specifically, we employ the arc-standard system proposed by Nivre (2004), which has a buffer to store the input EDUs to be analyzed and a stack to store the in-progress subtrees and defines the following action classes: SHIFT, RIGHT-ARC-l, and LEFT-ARC-l. We decode the dependency structure y* using a greedy search algorithm, i.e., taking the action a* that is valid and the most probable at each decision step: , where σ denotes the parsing configuration.
Backward Shift-Reduce Model
We expect that different inductive biases can be introduced by processing the document from the back. As the third model option, we develop a backward variant of the Shift-Reduce Model (B), which processes the input sequence in the reverse order.
3.4 Confidence Measures
The key challenge in bootstrapping on out-of-domain data is how to assess the reliability (or usefulness) of the pseudo labels and how to select an error-free and high-coverage subset. We define confidence measures to assess the reliability of the pseudo-labeled data. In Section 3.5, we define selection criteria to filter out unreliable pseudo-labeled data based on their confidence scores.
Model-based Confidence
For the bootstrapping methods that use a single teacher to generate a pseudo-labeled set (i.e., ST, CT), we define the confidence of the teacher model based on predictive probabilities of the decisions used to build a parse tree. A discourse dependency structure consists of a set (or series) of decisions. Therefore, we use the average of the predictive probabilities over the decisions.6 How to calculate the model-based confidence measure C(x,y) depends on the parsing models:
- Arc-Factored Model:
where (h,d,r) ∈ y.
- Shift-Reduce Model, Backward Model:
where A(x,y) denotes the action and configuration sequence to produce the parse tree y for x.
Agreement-based Confidence
3.5 Sample Selection Criteria
Inspired by Steedman et al. (2003a), we define two kinds of sample selection criteria, each of which focuses on the reliability (i.e., accuracy) and the usefulness (i.e., training utility) of the data, respectively.
Rank-above-k
This is a reliability-oriented selection criterion. We keep only the top N × k samples with higher confidence scores, where N is the number of candidate pseudo-labeled data, and k ∈ [0.0,1.0]. Specifically, we first rank the candidate pseudo-labeled data based on the teacher-side confidence scores, and then we select a subset that satisfies R(x) ≤ N × k. where R(x) ∈ [1,N] denotes the ranking of x.
Rank-diff-k
This is a utility-oriented selection criterion. In contrast to Rank-above-k, which relies only on the teacher-side confidence, this criterion utilizes both the teacher-side and the student-side confidence scores. This criterion retain the pseudo-labeled data whose relative ranking on the teacher side is higher than the relative ranking on the student side by a margin k or more. Specifically, after ranking the candidates independently for each side, we compute the gap of the relative rankings on the two sides, and then select a subset that meets Rteacher(x) + k ≤Rstudent(x).
4 COVID19-DTB
We release a new discourse dependency treebank for scholarly paper abstracts on COVID-19 and related coronaviruses like SARS and MERS in order to test unsupervised domain adaptation of discourse dependency parsing. We name our new treebank COVID-19 Discourse Dependency Treebank (COVID19-DTB).
4.1 Construction
We followed the RST-DT annotation guideline (Carlson and Marcu, 2001) for EDU segmentation. Based on SciDTB and Penn Discourse Treebank (PDTB) (Prasad et al., 2008), we defined 14 discourse relation classes shown in Table 1. We carefully analyzed the annotation data of SciDTB and found that some classes are hard to discriminate even for humans, which can lead to undesirable inconsistencies in the new dataset. Thus, we have merged some classes, such as Cause-Effect + Explain Cause-Result. Some classes are also renamed from SciDTB to fit the biomedical domain, such as Evaluation Findings.
COVID19-DTB . | SciDTB . |
---|---|
Root | Root |
Elaboration | Elaboration, Progression, Summary |
Comparison | Contrast, Comparison |
Cause-Result | Cause-Effect, Explain |
Condition | Condition |
Temporal | Temporal |
Joint | Joint |
Enablement | Enablement |
Manner-Means | Manner-Means |
Attribution | Attribution |
Background | Background |
Findings | Evaluation |
Textual-Organization | – |
Same-Unit | Same-Unit |
COVID19-DTB . | SciDTB . |
---|---|
Root | Root |
Elaboration | Elaboration, Progression, Summary |
Comparison | Contrast, Comparison |
Cause-Result | Cause-Effect, Explain |
Condition | Condition |
Temporal | Temporal |
Joint | Joint |
Enablement | Enablement |
Manner-Means | Manner-Means |
Attribution | Attribution |
Background | Background |
Findings | Evaluation |
Textual-Organization | – |
Same-Unit | Same-Unit |
First, we sampled 300 abstracts randomly from the 2020 September snapshot of The COVID-19 Open Research Dataset (CORD-19) (Wang et al., 2020), which contains over 500,000 scholarly articles on COVID-19 and related coronaviruses like SARS and MERS. Then, the 300 abstracts were segmented into EDUs manually by the authors. Then, we employed two professional annotators to give gold discourse dependency structures to the 300 abstracts. The annotators were trained using a few examples and a manual guideline, and then they annotated the 300 abstracts independently.7 We divided the results into development and test splits, each of which consists of 150 examples.
4.2 Corpus Statistics
Table 2 and Figure 3 show the statistics and the discourse relation distribution of COVID19-DTB. We also show the statistics and the distribution of SciDTB for comparison. We mapped discourse relations in SciDTB to the corresponding classes in COVID19-DTB. We removed the Root relations in computing the proportions.
. | COVID19-DTB . | SciDTB . |
---|---|---|
Total number of documents | 300 | 1045 (unique: 798) |
Total number of EDUs | 6005 | 15723 |
Avg number of EDUs / doc | 20.0 | 15.0 |
Avg dependency distance | 2.7 | 2.5 |
Max. dependency distance | 38 | 26 |
Avg Root position | 6.6 | 3.9 |
. | COVID19-DTB . | SciDTB . |
---|---|---|
Total number of documents | 300 | 1045 (unique: 798) |
Total number of EDUs | 6005 | 15723 |
Avg number of EDUs / doc | 20.0 | 15.0 |
Avg dependency distance | 2.7 | 2.5 |
Max. dependency distance | 38 | 26 |
Avg Root position | 6.6 | 3.9 |
The average number of EDUs per document in each corpus was 20.0 and 15.0, respectively. Although the average dependency distances in the two corpora are almost the same (2.7 vs. 2.5), the maximum dependency distance of COVID19-DTB is significantly longer than that of SciDTB. Furthermore, the average position of Root’s direct dependent is located further back in COVID19-DTB (6.6 vs. 3.9). Although the overall discourse relation distributions look similar, the proportions of Elaboration and Same-Unit are larger in COVID19-DTB. These differences reflect the fact that biomedical abstracts tend to be longer, have more complex sentences with embedded clauses, and contain more detailed information, suggesting the difficulty of discourse parser adaptation across the two domains.
5 Experimental Setup
Datasets
We evaluated the bootstrapping methods on two UDA scenarios: The first setup was a monologue adaptation between scientific disciplines: NLP and biomedicine (especially on COVID-19), which is actually an important scenario because there is still no text-level discourse treebank on biomedical documents. We used the training split of SciDTB (Yang and Li, 2018) as the labeled source dataset, which contains 742 manual discourse dependency structures on the abstracts in ACL Anthology. We also used the 2020 September snapshot of CORD-19 (Wang et al., 2020) as the unlabeled target dataset, which contains about 76,000 biomedical abstracts. We used the development and test splits of COVID19-DTB for validation and testing, respectively. The discourse relation labels in the SciDTB training set were mapped to the corresponding classes of COVID19-DTB. We mapped Textual-Organization relations in COVID19-DTB to Elaboration, because there is no corresponding class in SciDTB. We also mapped Temporal relations in the two datasets to Condition to reduce the significant class imbalance.
The second setup was an adaptation across dialogue genres, that is, dialogues in a multi-party game and dialogues in Ubuntu Forum. We used the training split of STAC (Asher et al., 2016) as the labeled source dataset, which contains 887 manually labeled discourse dependency structures on multi-party dialogues in the game, The Settlers of Catan. We also used the Ubuntu Dialogue Corpus (Lowe et al., 2015) as the unlabeled target dataset, which contains dialogues extracted from the Ubuntu chat logs. We retained dialogues with 7-16 utterances and 2-9 speakers. We also removed dialogues with long utterances (more than 20 words). Finally, we obtained approximately 70,000 dialogues. We used the development and test splits of Molweni (Li et al., 2020) for validation and testing. Each split contains 500 manually labeled discourse dependency structures on multi-party dialogues derived from the Ubuntu Dialogue Corpus.
The unlabeled target documents in both setups were segmented into EDUs using a publicly available EDU segmentation tool (Wang et al., 2018).
Evaluation
We employed the traditional evaluation metrics in dependency parsing literature, namely, Labeled Attachment Score (LAS) and Unlabeled Attachment Score (UAS). We also used Root Accuracy (RA), which indicates how well a system can identify the most representative EDU in the document (i.e., the dependent of the special root node).
Implementation Details
As the pre-trained transformer encoders, we used SciBERT (Beltagy et al., 2019) and SpanBERT (Joshi et al., 2019) in the first and second adaptation setups, respectively. The dimensionality of the MLPs in the arc-factored model and the shift-reduce models are 100 and 128, respectively. We used AdamW and Adam optimizers for optimizing the transformer’s parameters (θbert) and the task-specific parameters (θtask), respectively, following Joshi et al. (2019). We first trained the base models on the labeled source dataset using the following hyper-parameters: batch size = 1, learning rate (LR) for θbert = 2e−5, LR for θtask = 1e−4, warmup steps = 2.4K. Then, we ran the bootstrapping methods using the models with: batch size = 1, LR for θbert = 2e−6, LR for θtask = 1e−5, warmup steps = 7K. We trained all approaches for a maximum of 40 epochs. We applied early stopping when the validation LAS does not increase for 10 epochs.
6 Results and Discussion
6.1 Effectiveness
We verified the effectiveness of bootstrapping methods on the two UDA scenarios. We evaluated the source-only models, which were trained only on the labeled source dataset, as the baseline. Table 3 shows the results. The bootstrapping methods consistently gave gains in performance regardless of the adaptation scenarios. The best systems were CT (A S) with Rank-above-0.6 and CT (S A) with Rank-diff-100, which outperformed the source-only systems (e.g., Source-only (S)) by more than 4.4 LAS points on the monologue setup (NLP COVID-19) and by more than 6.3 LAS points on the dialogue setup (Game Ubuntu Forum), respectively. CT, TT, and AT tended to achieve higher accuracy than ST, particularly in the dialogue adaptation setup. These results indicate that bootstrapping is significantly and consistently effective for UDA of discourse dependency parsing in various adaptation scenarios, and that employing multiple models is particularly effective in reducing the unintended tendency of ST to amplify its own errors.
Method . | Selection . | Abstracts . | Dialogues . | |||
---|---|---|---|---|---|---|
LAS . | UAS . | RA . | LAS . | UAS . | ||
Source-only (A) | − | 61.3 | 74.8 | 82.0 | 29.9 | 55.1 |
Source-only (S) | − | 61.8 | 74.5 | 78.0 | 33.2 | 66.1 |
Source-only (B) | − | 60.0 | 72.9 | 78.0 | 29.2 | 55.6 |
ST (A A) | above-0.6 | 65.8 | 78.7 | 88.7 | 34.7 | 60.6 |
ST (S S) | above-0.6 | 65.3 | 76.9 | 84.7 | 37.9 | 67.4 |
CT (A S) | above-0.6 | 66.2 | 78.1 | 86.0 | 38.0 | 64.8 |
CT (S A) | above-0.6 | 66.1 | 78.2 | 86.0 | 39.1 | 64.4 |
CT (A S) | diff-100 | 66.0 | 78.3 | 88.0 | 38.5 | 66.5 |
CT (S A) | diff-100 | 66.2 | 78.8 | 84.7 | 39.5 | 66.0 |
CT (S B) | above-0.6 | 65.3 | 76.8 | 84.0 | 38.1 | 67.2 |
CT (B S) | above-0.6 | 65.6 | 76.9 | 87.3 | 38.5 | 67.4 |
CT (S B) | diff-100 | 65.5 | 76.8 | 86.0 | 39.1 | 67.5 |
CT (B S) | diff-100 | 65.5 | 76.6 | 86.7 | 39.2 | 67.7 |
TT (A S, B) | above-0.6 | 65.9 | 78.5 | 87.3 | 38.5 | 66.6 |
TT (S A, B) | above-0.6 | 65.9 | 78.4 | 86.0 | 39.1 | 66.7 |
TT (A S, B) | diff-100 | 65.4 | 77.4 | 86.7 | 38.6 | 66.8 |
TT (S A, B) | diff-100 | 65.1 | 77.7 | 87.3 | 38.9 | 66.5 |
AT (A S, B) | above-0.6 | 64.9 | 77.3 | 85.3 | 36.9 | 66.7 |
AT (S A, B) | above-0.6 | 65.3 | 77.4 | 88.7 | 38.6 | 63.2 |
AT (A S, B) | diff-100 | 65.3 | 77.6 | 84.7 | 36.9 | 65.7 |
AT (S A, B) | diff-100 | 64.6 | 77.6 | 85.3 | 38.2 | 61.9 |
Method . | Selection . | Abstracts . | Dialogues . | |||
---|---|---|---|---|---|---|
LAS . | UAS . | RA . | LAS . | UAS . | ||
Source-only (A) | − | 61.3 | 74.8 | 82.0 | 29.9 | 55.1 |
Source-only (S) | − | 61.8 | 74.5 | 78.0 | 33.2 | 66.1 |
Source-only (B) | − | 60.0 | 72.9 | 78.0 | 29.2 | 55.6 |
ST (A A) | above-0.6 | 65.8 | 78.7 | 88.7 | 34.7 | 60.6 |
ST (S S) | above-0.6 | 65.3 | 76.9 | 84.7 | 37.9 | 67.4 |
CT (A S) | above-0.6 | 66.2 | 78.1 | 86.0 | 38.0 | 64.8 |
CT (S A) | above-0.6 | 66.1 | 78.2 | 86.0 | 39.1 | 64.4 |
CT (A S) | diff-100 | 66.0 | 78.3 | 88.0 | 38.5 | 66.5 |
CT (S A) | diff-100 | 66.2 | 78.8 | 84.7 | 39.5 | 66.0 |
CT (S B) | above-0.6 | 65.3 | 76.8 | 84.0 | 38.1 | 67.2 |
CT (B S) | above-0.6 | 65.6 | 76.9 | 87.3 | 38.5 | 67.4 |
CT (S B) | diff-100 | 65.5 | 76.8 | 86.0 | 39.1 | 67.5 |
CT (B S) | diff-100 | 65.5 | 76.6 | 86.7 | 39.2 | 67.7 |
TT (A S, B) | above-0.6 | 65.9 | 78.5 | 87.3 | 38.5 | 66.6 |
TT (S A, B) | above-0.6 | 65.9 | 78.4 | 86.0 | 39.1 | 66.7 |
TT (A S, B) | diff-100 | 65.4 | 77.4 | 86.7 | 38.6 | 66.8 |
TT (S A, B) | diff-100 | 65.1 | 77.7 | 87.3 | 38.9 | 66.5 |
AT (A S, B) | above-0.6 | 64.9 | 77.3 | 85.3 | 36.9 | 66.7 |
AT (S A, B) | above-0.6 | 65.3 | 77.4 | 88.7 | 38.6 | 63.2 |
AT (A S, B) | diff-100 | 65.3 | 77.6 | 84.7 | 36.9 | 65.7 |
AT (S A, B) | diff-100 | 64.6 | 77.6 | 85.3 | 38.2 | 61.9 |
Next, we further analyzed in what kind of documents the bootstrapping system is particularly effective. We divided the COVID19-DTB test set into bins by the number of EDUs in each document (n ≤ 15, 15 < n ≤ 30, n > 30), and for each bin we examined the percentage of examples improved by the bootstrapping systems over the source-only system. Figure 4 shows the results. When the document length was 10 or shorter, there was no improvement in most examples; however, when the length was longer than 30, the percentage was jumped to around 80% with CT, TT, and AT. These results indicate that the longer the documents (or maybe the higher the document complexity) in the target domain, the greater the benefit of bootstrapping.
We also investigated the importance of employing different types of parsing models in CT. The theoretical importance of employing models with different views (or inductive biases) in CT has been discussed (Blum and Mitchell, 1998; Abney, 2002; Zhou and Goldman, 2004). We trained base models with the same neural architecture but with different initial parameters on the labeled source dataset, and then retrained them using CT. We can see from the results in Table 4 that the LAS of CT using different model types is consistently higher than that of CT with the same model types, suggesting empirically that it is effective to employ different model types in bootstrapping.
Method . | Selection . | Abstracts . | ||
---|---|---|---|---|
LAS . | UAS . | RA . | ||
CT (A S) | above-0.6 | 66.2 | 78.1 | 86.0 |
CT (S A) | above-0.6 | 66.1 | 78.2 | 86.0 |
CT (A A) | above-0.6 | 65.5 | 77.8 | 76.0 |
CT (S S) | above-0.6 | 65.5 | 77.9 | 86.0 |
Method . | Selection . | Abstracts . | ||
---|---|---|---|---|
LAS . | UAS . | RA . | ||
CT (A S) | above-0.6 | 66.2 | 78.1 | 86.0 |
CT (S A) | above-0.6 | 66.1 | 78.2 | 86.0 |
CT (A A) | above-0.6 | 65.5 | 77.8 | 76.0 |
CT (S S) | above-0.6 | 65.5 | 77.9 | 86.0 |
6.2 Analysis of Confidence Measures
One of the key challenges in bootstrapping for UDA is to assess the true reliability of pseudo-labeled data. Here, we analyzed the confidence measures.
Confidence Scores Correlate with Quality
Regardless of the selection criteria, data with high confidence scores tend to be selected. Figure 5(a) shows the relationships between the confidence scores and the parsing quality (LAS) in the target domain. Specifically, we calculated the confidence scores of each example in the COVID19-DTB test set, sorted the examples in descending order of their confidence scores, and evaluated LAS for each of the top k% subset. We confirmed that the confidence scores were roughly correlated with the parsing quality, and the top candidates of higher confidence tended to be more accurate than the ones of lower confidence. For example, if we restricted the test data to the top 10% with the highest confidence scores, the LAS of CT (A S) with Rank-above-0.6 was 76.8%, which was much higher than the LAS of this system on the full test set (i.e., 66.2%).
Confident (Accurate) Pseudo Labels are Biased
Next, we examined what kind of documents are assigned with higher confidence scores. Figure 5(b) shows the relationships between the confidence scores and the document length (i.e., the number of EDUs). We found the strong correlation between them: Documents with higher confidence scores are biased to shorter documents. This bias did not depend on the confidence measures (model-based vs. agreement-based), the sample selection criteria, and even the presence of bootstrapping. Based on these results, we can conjecture that longer documents tend to be of poor quality (low confidence) and less likely to be included in the selected pseudo-labeled set . This conjecture further implies that the current bottleneck of the bootstrapping systems is the low coverage of the selected pseudo-labeled set .
Low Coverage of Accurate Pseudo Labels
Based on the above conjecture, it is natural to expect that there is too little accurate supervision for longer documents in the selected pseudo-labeled set, and that the parsing accuracy of the bootstrapping systems drop especially for longer documents. Figure 6 shows the relationships between the parsing quality and the document length. The use of bootstrapping methods improved the overall performance over the source-only systems; however, regardless of the bootstrapping types, the performance dropped significantly for longer documents. These results confirm the shortage of accurate supervision for longer documents in the selected pseudo-labeled set.
The Problem is Not Sampling, but Prediction
To alleviate this low-coverage issue, we modified the confidence measures defined in Subsection3.4 to select longer documents more aggressively for the pseudo-labeled set . We simply omitted the averaging calculation over the decisions. However, as shown in Figure 6 (see the results with “w/o avg.”), the performance degradation tendency for longer documents did not change. This fact indicates that the current bottleneck is not the low coverage of the selected pseudo-labeled set , but the low coverage of accurate supervision in the candidate pseudo-labeled data pool, that is, ℒt.
6.3 Analysis of Selection Criteria
Another important challenge in bootstrapping for UDA is to select an error-free and high-coverage subset from the candidate pseudo-labeled data pool. Here, we analyzed the selection criteria.
There Is a Reliability-Coverage Trade-off
Varying the parameter k of Rank-above-k and Rank-diff-k, we examined the final parsing quality and the average number of selected pseudo-labeled data (out of 5K candidates). We trained and evaluated CT (A S) on the COVID19-DTB test set. Figure 7 (lines with circle markers) shows the results. Rank-above-k achieved a slightly higher performance than Rank-diff-k. However, Rank-diff-k achieved the best performance with less pseudo-labeled data. More interestingly, we confirmed that, for both criteria, there is a trade-off between reliability (precision) and coverage (recall) in the selected pseudo-labeled set: When k was too strict (i.e., when k was too small for Rank-above-k, and when the margin k was too large for Rank-diff-k), the number of selected pseudo-labeled data was too small, resulting in lower LAS. When k was relaxed to some extent, the number of selected pseudo-labeled data increased and the LAS reached the highest LAS. However, if k was relaxed further, the accuracy decreased from the highest point due to the contamination of too much noisy supervision.
Quantity Is Not the Only Issue
Next, we evaluated the sample selection criteria without the influence of the number of selected pseudo-labeled data. We selected the same number of pseudo-labeled data (set to 3K) across different k by adjusting the number of the unlabeled samples (set to 5K in the previous experiments) appropriately. For example, to select 3K pseudo-labeled data with Rank-above-0.2, we sampled 15K unlabeled data at each bootstrapping round. Figure 7 (lines with rectangle markers) shows the results. In the region where k was strict, the LAS curves improved compared to before the adjustment because the number of selected pseudo-labeled data increased. Meanwhile, in the region where k was relaxed, the LAS curves decreased or were retained because the selection size is decreased. More interestingly, even with this adjusted setting, the strictest k was not the best parameter. These results indicate that, although the number of selected pseudo-labeled data is important, it is not the only factor that determines the optimal parameter k, and that it is still difficult to identify truly useful pseudo-labeled data based on these sample selection criteria alone.
6.4 Increasing the Unlabeled Dataset Size
So far, we have demonstrated that the current major limitation of bootstrapping is the difficulty of generating diverse and accurate pseudo-labeled data pool ℒt. The most straightforward way for mitigating this low-coverage problem is to increase the number of unlabeled target documents. Figure 8 shows that increasing the number of unlabeled data with bootstrapping improved the parsing quality. However, the quality improvement saturated after 5K documents. These facts demonstrate that the low-coverage problem can not be mitigated by simply adding more unlabeled documents. We suspect this is because increasing the diversity of unlabeled documents does not always increase the diversity of accurately pseudo-labeled data.
6.5 Active Learning
A more direct and promising solution than increasing the unlabeled corpus size is to manually annotate a small amount of documents that the bootstrapping system can not analyze accurately. We tested the potential effectiveness of active learning (AL) (Settles, 2009). To emulate the AL process, we used the Molweni training set (9K dialogues) as the unlabeled target documents and leveraged the gold annotation. We first measured the confidence (or uncertainty) scores of each unlabeled document using the source-only or co-training systems that had already been trained in the dialogue adaptation setup. Then, we sampled 100 documents with the worst confidence scores, because such data are unlikely to be selected in bootstrapping and accurately parsed. Finally, we fine-tuned each model on the 100 actively-labeled data. We also used random confidence (uncertainty) measures as the baseline, whose results are averaged over 5 trials. Table 5 shows that, even though only 100 dialogues were annotated manually, AL improved the performance significantly, which was difficult to achieve by bootstrapping alone. Annotating highly uncertainty data is more effective than annotating randomly sampled dialogues. We can also see that the combination of bootstrapping and AL achieves higher performance than the source-only model with AL, suggesting that bootstrapping and AL can be complementary and that bootstrapping is useful to identify potentially useful data in AL. The performance improvement could be further increased by repeating bootstrapping and AL alternatively, which is worth investigating in the future.
Method . | Confidence . | LAS . |
---|---|---|
Source-only (S) | − | 33.2 |
CT (S A) w/ above-0.6 | − | 39.1 |
Source-only (S) + AL | random | 45.2 |
Source-only (S) + AL | model-based | 45.8 |
CT (S A) w/ above-0.6 + AL | random | 45.9 |
CT (S A) w/ above-0.6 + AL | model-based | 46.4 |
CT (S A) w/ above-0.6 + AL | agreement-based | 46.3 |
Method . | Confidence . | LAS . |
---|---|---|
Source-only (S) | − | 33.2 |
CT (S A) w/ above-0.6 | − | 39.1 |
Source-only (S) + AL | random | 45.2 |
Source-only (S) + AL | model-based | 45.8 |
CT (S A) w/ above-0.6 + AL | random | 45.9 |
CT (S A) w/ above-0.6 + AL | model-based | 46.4 |
CT (S A) w/ above-0.6 + AL | agreement-based | 46.3 |
6.6 Summary and Recommendations
Here, we summarize what we have learned from the experiments and push the findings a step further in order to provide practical guidelines for out-of-domain discourse dependency parsing.
Bootstrapping improves out-of-domain discourse dependency parsing significantly and consistently in various adaptation scenarios. In particular, we recommend co-training with the arc-factored and shift-reduce models because co-training tends to be more effective and more efficient in training than tri-training variants.
A labeled source dataset that is as close as possible to the target domain is preferable to suppress the domain-shift problem. The labeled source dataset should also follow the the same annotation framework with the target domain (e.g., definitions of EDUs and discourse relation classes).
It is reasonable to use the models’ predictive probability as the confidence measure to filter out noisy pseudo labels, because the confidence scores correlate with the accuracy of the pseudo labels. In particular, we recommend the Rank-above-k criterion because, unlike Rank-diff-k, k is independent of the number of unlabeled data. However, since the accurately predicted pseudo labels are biased towards simpler documents, the parsing accuracy on more complex documents is difficult to improve even with bootstrapping.
The low-coverage problem of pseudo labels is not alleviated by increasing the number of unlabeled target documents. We recommend manually annotating a small amount of target documents using active learning and combining it with bootstrapping.
7 Conclusion
In this paper, we investigated the effectiveness and limitation of bootstrapping methods in unsupervised domain adaptation of BERT-based discourse dependency parsers. The results demonstrate that bootstrapping is effective significantly and consistently in various adaptation scenarios. However, regardless of the tuned confidence measures and sample selection criteria, the bootstrapping methods have a difficulty in generating both diverse and accurate pseudo labels, which is the current limiting factor in further improvement. This low-coverage problem cannot be mitigated by just increasing the unlabeled corpus size. We confirmed that the active learning can be the effective solution to this problem.
We have a limitation in this study: Our experiments use only English documents. Although bootstrapping and discourse parsing models are language-independent at the algorithmic level, experiments of domain adaptation require labeled datasets on both source and target domains for training and evaluation. In order to investigate the universality and language-dependence features of the bootstrapping methods in various languages, it is necessary to develop discourse treebanks in a variety of languages.
In the future, we will expand the COVID19-DTB dataset with additional biomedical abstracts to facilitate the exploration and application of discourse parsing technologies to biomedical knowledge acquisition.
Acknowledgments
We would like to thank the action editor and three anonymous reviewers for their thoughtful and insightful comments, which we found very helpful in improving the paper. This work was supported by JSPS KAKENHI 21K17815. This work was also partly supported by JST, AIP Trilateral AI Research, grant number JPMJCR20G9, Japan.
Notes
Our code is available at https://github.com/norikinishida/discourse-parsing. The COVID19-DTB dataset is also available at https://github.com/norikinishida/biomedical-discourse-treebanks.
We call both single-authored text and multi-party dialogues as documents.
For bootstrapping methods that employ multiple models (e.g., co-training), ℒt is created for each model f.
In our experiments, for every bootstrapping round we used 5K sampled documents instead of the whole unlabeled documents , because parsing the whole documents at every bootstrapping round is computationally expensive and does not scale to a large-scale dataset. The 5K samples were flashed for every bootstrapping round.
A span-level syntactic head word is a token whose parent in the syntactic dependency graph is ROOT or is not within the EDU’s span. When there are multiple head words in an EDU, we choose the left most one. We used the spaCy en_core_web_sm model to obtain the syntactic dependency graph.
We also tested a model-based confidence measure using the entropy of predictive probabilities, where we replaced the predictive probability of a decision (e.g., P(h*| d)) with the corresponding (negative) entropy, e.g., . Entropy has been used especially in the active learning literature to calculate data uncertainty (Li et al., 2016b; Kasai et al., 2019). However, the predictive probabilities outperformed the entropy counterparts consistently in our experiments. Thus, we adopted the predictive probabilities for the model-based confidence measure.
The inter-annotator agreement is thus not calculated in the current version of the dataset. Instead, we had several discussions with each annotator to maintain the annotation consistency at a satisfactory level.
References
Author notes
Action Editor: Shay Cohen