Abstract
In this paper, we introduce an unsupervised discourse constituency parsing algorithm. We use Viterbi EM with a margin-based criterion to train a span-based discourse parser in an unsupervised manner. We also propose initialization methods for Viterbi training of discourse constituents based on our prior knowledge of text structures. Experimental results demonstrate that our unsupervised parser achieves comparable or even superior performance to fully supervised parsers. We also investigate discourse constituents that are learned by our method.
1 Introduction
Natural language text is generally coherent (Halliday and Hasan, 1976) and can be analyzed as discourse structures, which formally describe how text is coherently organized. In discourse structure, linguistic units (e.g., clauses, sentences, or larger textual spans) are connected together semantically and pragmatically, and no unit is independent nor isolated. Discourse parsing aims to uncover discourse structures automatically for given text and has been proven to be useful in various NLP applications, such as document summarization (Marcu, 2000; Louis et al., 2010; Yoshida et al., 2014), sentiment analysis (Polanyi and Van den Berg, 2011; Bhatia et al., 2015), and automated essay scoring (Miltsakaki and Kukich, 2004).
Despite the promising progress achieved in recent decades (Carlson et al., 2001; Hernault et al., 2010; Ji and Eisenstein, 2014; Feng and Hirst, 2014; Li et al., 2014; Joty et al., 2015; Morey et al., 2017), discourse parsing still remains a significant challenge. The difficulty is due in part to shortage and low reliability of hand-annotated discourse structures. To develop a better-generalized parser, existing algorithms require a larger amounts of training data. However, manually annotating discourse structures is expensive, time-consuming, and sometimes highly ambiguous (Marcu et al., 1999).
One possible solution to these problems is grammar induction (or unsupervised syntactic parsing) algorithms for discourse parsing. However, existing studies on unsupervised parsing mainly focus on sentence structures, such as phrase structures (Lari and Young, 1990; Klein and Manning, 2002; Golland et al., 2012; Jin et al., 2018) or dependency structures (Klein and Manning, 2004; Berg-Kirkpatrick et al., 2010; Naseem et al., 2010; Jiang et al., 2016), though text-level structural regularities can also exist beyond the scope of a single sentence. For instance, in order to convey information to readers as intended, a writer should arrange utterances in a coherent order.
We tackle these problems by introducing unsupervised discourse parsing, which induces discourse structures for given text without relying on human-annotated discourse structures. Based on Rhetorical Structure Theory (RST) (Mann and Thompson, 1988), which is one of the most widely accepted theories of discourse structure, we assume that coherent text can be represented as tree structures, such as the one in Figure 1. The leaf nodes correspond to non-overlapping clause-level text spans called elementary discourse units (EDUs). Consecutive text spans are combined to each other recursively in a bottom–up manner to form larger text spans (represented by internal nodes) up to a global document span. These text spans are called discourse constituents. The internal nodes are labeled with both nuclearity statuses (e.g., Nucleus-Satellite or NS) and rhetorical relations (e.g., ELABORATION, CONTRAST) that hold between connected text spans.
In this paper, we especially focus on unsupervised induction of an unlabeled discourse constituent structure (i.e., a set of unlabeled discourse constituent spans) given a sequence of EDUs, which corresponds to the first tree-building step in conventional RST parsing. Such constituent structures provide hierarchical information of input text, which is useful in downstream tasks (Louis et al., 2010). For instance, a constituent structure [X [Y Z]] indicates that text span Y is preferentially combined with Z (rather than X) to form a constituent span, and then the text span [Y Z] is connected with X. In other words, this structure implies that [XY] is a distituent span and requires Z to become a constituent span. Our challenge is to find such discourse-level constituentness from EDU sequences.
The core hypothesis of this paper is that discourse tree structures and syntactic tree structures share the same (or similar) constituent properties at a metalevel, and thus, learning algorithms developed for grammar inductions are transferable to unsupervised discourse constituency parsing by proper modifications. Actually, RST structures can be formulated in a similar way as phrase structures in the Penn Treebank, though there are a few differences: The leaf nodes are not words but EDUs (e.g., clauses), and the internal nodes do not contain phrase labels but hold nuclearity statuses and rhetorical relations.
The expectation-maximization (EM) algorithm (Klein and Manning, 2004) has been the dominating unsupervised learning algorithm for grammar induction. Based on our hypothesis and this fact, we develop a span-based discourse parser (in an unsupervised manner) by using Viterbi EM (or “hard” EM) (Neal and Hinton, 1998; Spitkovsky et al., 2010; DeNero and Klein, 2008; Choi and Cardie, 2007; Goldwater and Johnson, 2005) with a margin-based criterion (Stern et al., 2017; Gaddy et al., 2018).1 Unlike the classic EM algorithm using inside-outside re-estimation (Baker, 1979), Viterbi EM allows us to avoid explicitly counting discourse constituent patterns, which are generally too sparse to estimate reliable scores of text spans.
The other technical contribution is to present effective initialization methods for Viterbi training of discourse constituents. We introduce initial-tree sampling methods based on our prior knowledge of document structures. We show that proper initialization is crucial in this task, as observed in grammar induction (Klein and Manning, 2004; Gimpel and Smith, 2012).
On the RST Discourse Treebank (RST-DT) (Carlson et al., 2001), we compared our parse trees with manually annotated ones. We observed that our method achieves a Micro F1 score of 68.6% (84.6%) in the (corrected) RST-PARSEVAL (Marcu, 2000; Morey et al., 2018), which is comparable with or even superior to fully supervised parsers. We also investigated the discourse constituents that can or cannot be learned well by our method.
2 Related Work
The earliest studies that use EM in unsupervised parsing are Lari and Young (1990) and Carroll and Charniak (1992), which attempted to induce probabilistic context-free grammars (PCFG) and probabilistic dependency grammars using the classic inside–outside algorithm (Baker, 1979). Klein and Manning (2001b, 2002) perform a weakened version of constituent tests (Radford, 1988) by the Constituent-Context Model (CCM), which, unlike a PCFG, describes whether a contiguous text span (such as DT JJ NN) is a constituent or a distituent. The CCM uses EM to learn constituenthood over part-of-speech (POS) tags and for the first time outperformed the strong right-branching baseline in unsupervised constituency parsing. Klein and Manning (2004) proposed the Dependency Model with Valence (DMV), which is a head automata model (Alshawi, 1996) for unsupervised dependency parsing over POS tags and also relies on EM. These two models have been extended in various works for further improvements (Berg-Kirkpatrick et al., 2010; Naseem et al., 2010; Golland et al., 2012; Jiang et al., 2016).
In general, these methods use the inside–outside (dynamic programming) re-estimation (Baker, 1979) in the E step. However, Spitkovsky et al. (2010) showed that Viterbi training (Brown et al., 1993), which uses only the best-scoring tree to count the grammatical patterns, is not only computationally more efficient but also empirically more accurate in longer sentences. These properties are, thus, suitable for “document-level” grammar induction, where the document length (i.e., the number of EDUs) tends to be long.2 In addition, as explained later in Section 3, we incorporate Viterbi EM with a margin-based criterion (Stern et al., 2017; Gaddy et al., 2018); this allows us to avoid explicitly counting each possible discourse constituent pattern symbolically, which is generally too sparse and appears only once.
Prior studies (Klein and Manning, 2004; Gimpel and Smith, 2012; Naseem et al., 2010) have shown that initialization or linguistic knowledge plays an important role in EM-based grammar induction. Gimpel and Smith (2012) demonstrated that properly initialized DMV achieves improvements in attachment accuracies by 20 ∼ 40 points (i.e., 21.3% 64.3%), compared with the uniform initialization. Naseem et al. (2010) also found that controlling the learning process with the prior (universal) linguistic knowledge improves the parsing performance of DMV. These studies usually rely on insights on syntactic structures. In this paper, we explore discourse-level prior knowledge for effective initialization of the Viterbi training of discourse constituency parsers.
Our method also relies on recent work on RST parsing. In particular, one of the initialization methods in our EM training (in Section 3.3 (i)) is inspired by the inter-sentential and multi-sentential approach used in RST parsing (Feng and Hirst, 2014; Joty et al., 2013, 2015). We also follow prior studies (Sagae, 2009; Ji and Eisenstein, 2014) and utilize syntactic information, i.e., dependency heads, which contributes to further performance gains in our method.
The most similar work to that presented here is Kobayashi et al. (2019), who propose unsupervised RST parsing algorithms in parallel with our work. Their method builds an unlabeled discourse tree by using the CKY dynamic programming algorithm. The tree-merging (splitting) scores in CKY are defined as similarity (dissimilarity) between adjacent text spans. The similarity scores are calculated based on distributed representations using pre-trained embeddings. However, similarity between adjacent elements are not always good indicators of constituentness. Consider tag sequences “VBD IN” and “IN NN”. The former is an example of a distituent sequence, whereas the latter is a constituent. “VBD”, “IN”, and “NN” may have similar distributed representations because these tags cooccur frequently in corpora. This implies that it is difficult to distinguish constituents and distituents if we use only similarity (dissimilarity) measures. In this paper, we aim to mitigate this issue by introducing parameterized models to learn discourse constituentness.
3 Methodology
In this section, we first describe the parsing model we develop. Next, we explain how to train the model in an unsupervised manner by using Viterbi EM. Finally, we present the initialization methods we use for further improvements.
3.1 Parsing Model
Thus, our parsing model consists of a single scoring function s(i,j) that computes a constituent score of a contiguous text span xi:j = xi,…,xj, or simply (i,j). The higher the value of s(i,j), the more likely that xi:j is a discourse constituent.
We show our parsing model in Figure 2. Our implementation of s(i,j) can be decomposed into three modules: EDU-level feature extraction, span-level feature extraction, and span scoring. We discuss each of these in turn. Later, we also explain the decoding algorithm that we use to find the globally best-scoring tree.
Feature Extraction and Scoring
Prior work (Sagae, 2009; Ji and Eisenstein, 2014) has shown that syntactic cues can accelerate discourse parsing performance. We therefore extract syntactic features from each EDU. We apply a (syntactic) dependency parser to each sentence in the input text,3 and then choose a head word for each EDU. A head word is a token whose parent in the dependency graph is ROOT or is not within the EDU.4 We also extract the POS tag and the dependency label corresponding to the head word. A dependency label is a relation between a head word and its parent.
We did not use any feature templates because we found that they did not improve parsing performance in our unsupervised setting, though we observed that template features roughly following Joty et al. (2015) improved performance in a supervised setting.
Decoding
To parse the full document, we first compute C[0,n − 1] in a bottom–up manner and then recursively trace the history of the selected split positions, k, resulting in a binary tree spanning the entire document.
3.2 Unsupervised Learning Using Viterbi EM
In this paper, we use Viterbi EM (Brown et al., 1993; Spitkovsky et al., 2010), a variant of the EM algorithm and self-training (McClosky et al., 2006a,b), to train the span-based discourse constituency parser (Section 3.1) in an unsupervised manner. Viterbi EM has suitable properties for discourse processing, as described later in this section.
Overall Procedure
We first automatically sample initial trees based on our prior knowledge of document structures (described later in Section 3.3) and then perform the M step on the sampled trees to initialize the model parameters. After the initialization step, we repeat the E step and the M step in turns. To perform early stopping, we use a held-out development set of 30 documents with annotated trees , which are never used as the supervision to estimate the parsing model.
E Step
Klein and Manning (2001b) and Spitkovsky et al. (2010) count grammatical patterns used to derive syntactic trees in , which are then normalized and converted to probabilistic grammars in the next M step.
In contrast, “discourse” constituents are significantly sparse and tend to appear only once, which implies that it is almost meaningless to explicitly count discourse constituent patterns symbolically. We therefore attempt to directly use the trees in to update the model parameters in the next M step.
M Step
In the M step, we re-estimate the next model as if it is supervised by the best parse trees found in the previous E step.
The highest-scoring negative tree T′ () can be efficiently found by modifying the dynamic programming algorithm in Equation (17). In particular, we replace s(i,j) with .
Combining Viterbi training and the margin-based objective function allows us to (1) avoid explicitly counting discourse constituent patterns as symbolic variables and (2) directly use the scores of the trees found in the E step for re- estimation of the next model.
3.3 Initialization in EM
In general, the EM algorithm tends to get stuck in local optima of the objective function (Charniak, 1993). Therefore, proper initialization is vital in order to avoid trivial solutions. This phenomenon has also been observed in EM-based grammar induction (Klein and Manning, 2004; Gimpel and Smith, 2012).
In this section, we introduce the initialization methods we use in Viterbi EM. More precisely, given an input document (i.e., a sequence of EDUs), we automatically build a discourse constituent structure based on our general prior knowledge of document structures. Below, we describe the four pieces of prior knowledge we use for the initial-tree sampling.
(i) Document Hierarchy
It is intuitively reasonable to consider that (elementary) discourse units belonging to the same textual chunk (e.g., sentence, paragraph) tend to form a subtree before crossing over the chunk boundaries. For example, we can assume that EDUs in the same sentence are preferentially connected with each other before getting combined with EDUs in other sentences. Actually, Joty et al. (2013, 2015) and Feng and Hirst (2014) observed that it is effective to incorporate inter-sentential and multi-sentential parsing to build a document-level tree.
First, we split an input document into sentence-level and paragraph-level segments by detecting sentence and paragraph boundaries, respectively. We obtain sentence segmentation by applying the Stanford CoreNLP (Manning et al., 2014) to the concatenation of EDUs. We also extract paragraph boundaries by detecting empty lines in the raw documents.6 We then build a discourse constituent structure incrementally from sentence-level subtrees to paragraph-level subtrees and then to the document-level tree in a bottom-up manner. Figure 3 shows this process.
(ii) Discourse Branching Tendency
The second prior knowledge relates to information order in discourses and the branching tendencies of discourse trees. In general, an important text element tends to appear at earlier positions in the document, and then the text following it complements the message, which is reflected in the Right Frontier Constraint (Polanyi, 1985) in Segmented Discourse Representation Theory (Asher and Lascarides, 2003). This tendency can be assumed to hold recursively. Therefore, it is reasonable to consider that discourse structures tend to form right-heavy trees, as shown in Figure 4(a). Based on this assumption, we build right-branching trees for sentence-level, paragraph-level, and document-level discourse structures in the initial-tree sampling.
(iii) Syntax-Aware Branching Tendency
As already discussed, this work assumes that discourse structures tend to form right-heavy trees. However, in our preliminary experiments, we found that this naive assumption produces about 44% erroneous trees for sentence-level structures with 3 EDUs. For sentences with 4 EDUs, the error rate increases to about 70%, which is a non-negligible number in the initialization step.
To resolve this problem, we introduce another, more fine-grained, knowledge concept for sentence- level discourse structures. We expect that sentence- level trees are more strongly affected by syntactic cues (e.g., dependency graphs) than paragraph-level or document-level trees. More specifically, given an EDU sequence of one sentence, xi,⋯ ,xj, we focus on a position of the EDU xk with a head word that is in a ROOT relation with its parent in the dependency graph. We assume that the sub-sequence after the ROOT EDU, xk:j, roughly corresponds to the predicate of the sentence, and the sub-sequence before the ROOT EDU, xi:k−1, corresponds to the subject. We build right-branching trees for each sub-sequence individually and finally bracket them. We illustrate the procedure in Figure 4(b)-(c).
(iv) Locality Bias
Inspired by Smith and Eisner (2006), we introduce a structural locality bias as the last prior knowledge. The locality bias was observed to improve the accuracy of dependency grammar induction. We hypothesize that discourse constituents of shorter spans are preferable to those of longer ones.
4 Experiment Setup
4.1 Data
We use the RST Discourse Treebank (RST-DT) built by Carlson et al. (2001),7 which consists of 385 Wall Street Journal articles manually annotated with RST structures (Mann and Thompson, 1988). We use the predefined split of 347 training articles and 38 test articles. We also prepare a development set with 30 instances randomly sampled from the training set, which is used only for hyper-parameter tuning and early stopping.
We tokenized the documents using Stanford CoreNLP tokenizer and converted them to lowercase. We also replaced digits with “7” (e.g., “12.34” → “77.77”) to reduce data sparsity. We also replaced out-of-vocabulary tokens with special symbols “〈 UNK 〉.”
4.2 Metrics
Following existing studies in unsupervised syntactic parsing (Klein, 2005; Smith, 2006), we quantitatively evaluate unsupervised parsers by comparing parse trees with the manually annotated ones. We use the standard (unlabeled) constituency metrics in PARSEVAL: Unlabeled Precision (UP), Unlabeled Recall (UR), and their Micro F1, which can indicate how well the parser identifies the linguistically reasonable structures.
The traditional evaluation procedure for RST parsing is RST-PARSEVAL (Marcu, 2000), which adapts the PARSEVAL for the RST representation shown in Figure 5(a)-(b). However, Morey et al. (2018) showed that, as shown in Figure 5(c), traditional RST-PARSEVAL gives a higher-than-expected score because it considers pre-terminals (i.e., spans of length 1), which cannot be incorrect in the unlabeled constituency metrics. We therefore follow Morey et al. (2018) and perform the encoding of RST trees as shown in Figure 5(d)-(f). That is, we exclude spans of length 1 and include the root node. We also do not binarize the gold- standard trees.
4.3 Baselines
Right Branching (RB)
Given a sequence of elements (i.e., EDUs or subtrees), RB always chooses the left-most element as a left terminal node and then treats the remaining elements as a right nonterminal (or terminal). This procedure is recursively applied to the remaining elements on the right, resulting in (x0(x1 (x2 … ))). As described in Section 3.3, we predict that RB somewhat captures the branching tendency of discourse informational structures. RB was also used as a strong baseline for unsupervised syntactic constituency parsing in Klein and Manning (2001b).
Left Branching (LB)
Contrary to RB, LB always chooses the right-most element as the right terminal and then transforms the remaining elements on the left to a subtree, resulting in (((…xn−3) xn−2) xn−1).
Adaptive Right Branching (RB*)
We augment RB by considering the syntax-aware branching tendency, described in Section 3.3(iii). That is, based on the position of the head EDU (with the ROOT relation), we split the sentence into two parts and then perform RB for each sub-sequence.
Random Bottom–Up (BU)
BU randomly selects two adjacent elements and brackets them. This operation is repeated in a bottom–up manner until we obtain a single binary tree spanning the whole sequence.
4.4 Hyperparameters
We set the dimensionalities of the word embeddings, POS embeddings, relation embeddings, forward/backward LSTM hidden layers, and MLP to 300, 10, 10, 125, and 100, respectively. We initialized the word embeddings with the GloVe vectors trained on 840 billion tokens (Pennington et al., 2014). During the training, we did not fine-tune the word embeddings. We run the initialization steps for 3 epochs. We used a minibatch size of 10. We also used the Adam optimizer (Kingma and Ba, 2015).
5 Results and Discussion
In this section we report the results of the experiments and discuss them. We first discuss the comparison results of our method with baselines and the fully supervised RST parsers, including the results published in literature (Section 5.1). We then investigate the impact of initialization methods (Section 5.2). Finally, we provide our analysis on discourse constituents induced by our method (Section 5.3).
5.1 Performance Comparison
We compared our method with the baselines described in Section 4.3. We also included the previous work (Kobayashi et al., 2019) on unsupervised RST parsing as our baseline, though it is not a fair comparison because they use binarized golden trees for evaluation.8 For reference, we also compared our method with fully supervised parsers: the supervised version of our model9 and recent supervised parsers (Feng and Hirst, 2014; Joty et al., 2015) that incorporate intra-sentential and multi-sentential parsing as in our parser.
Table 1 shows the unlabeled constituency scores in the corrected RST-PARSEVAL (Morey et al., 2018) against non-binarized trees. We also show the traditional RST-PARSEVAL Micro F1 scores in parentheses. 〈fs, fd〉 indicates that we used only sentence boundaries and discarded paragraph boundaries. The scores of external supervised parsers (Feng and Hirst, 2014; Joty et al., 2015) are borrowed from Morey et al. (2018).
Method | UP | UR | Micro F1 |
Unsupervised | |||
RB | 7.5 | 7.7 | 7.6 (54.6) |
〈RBs,RBd〉 | 47.9 | 49.7 | 48.8 (74.8) |
〈RBs,RBp, RBd〉 | 57.9 | 60.2 | 59.0 (79.9) |
LB | 7.5 | 7.7 | 7.6 (54.6) |
〈LBs, LBd〉 | 41.7 | 43.3 | 42.5 (71.7) |
〈LBs, LBp, LBd〉 | 50.5 | 52.5 | 51.5 (76.2) |
BU | 19.2 | 19.9 | 19.5 (60.5) |
〈BUs, BUd〉 | 47.9 | 49.8 | 48.8 (74.9) |
〈BUs, BUp, BUd〉 | 54.5 | 56.6 | 55.5 (78.1) |
⋯ (a) | 64.5 | 67.0 | 65.7 (83.2) |
⋯ (b) | 65.6 | 68.1 | 66.8 (83.7) |
Kobayashi et al. (2019) | − | − | − (80.8) |
Ours, initialized by (a) | 66.2 | 68.8 | 67.5 (84.0) |
Ours, initialized by (b) | 66.8 | 69.4 | 68.0 (84.3) |
Ours (b) + Aug. | 67.3 | 69.9 | 68.6 (84.6) |
Supervised | |||
Ours, supervised | 68.3 | 70.9 | 69.6 (85.1) |
Feng and Hirst (2014)* | − | − | − (84.4) |
Joty et al. (2015)* | − | − | − (82.5) |
Human | − | − | − (88.7) |
Method | UP | UR | Micro F1 |
Unsupervised | |||
RB | 7.5 | 7.7 | 7.6 (54.6) |
〈RBs,RBd〉 | 47.9 | 49.7 | 48.8 (74.8) |
〈RBs,RBp, RBd〉 | 57.9 | 60.2 | 59.0 (79.9) |
LB | 7.5 | 7.7 | 7.6 (54.6) |
〈LBs, LBd〉 | 41.7 | 43.3 | 42.5 (71.7) |
〈LBs, LBp, LBd〉 | 50.5 | 52.5 | 51.5 (76.2) |
BU | 19.2 | 19.9 | 19.5 (60.5) |
〈BUs, BUd〉 | 47.9 | 49.8 | 48.8 (74.9) |
〈BUs, BUp, BUd〉 | 54.5 | 56.6 | 55.5 (78.1) |
⋯ (a) | 64.5 | 67.0 | 65.7 (83.2) |
⋯ (b) | 65.6 | 68.1 | 66.8 (83.7) |
Kobayashi et al. (2019) | − | − | − (80.8) |
Ours, initialized by (a) | 66.2 | 68.8 | 67.5 (84.0) |
Ours, initialized by (b) | 66.8 | 69.4 | 68.0 (84.3) |
Ours (b) + Aug. | 67.3 | 69.9 | 68.6 (84.6) |
Supervised | |||
Ours, supervised | 68.3 | 70.9 | 69.6 (85.1) |
Feng and Hirst (2014)* | − | − | − (84.4) |
Joty et al. (2015)* | − | − | − (82.5) |
Human | − | − | − (88.7) |
We observe that: (1) the incremental tree-construction approach with boundary information consistently improves parsing performances of the baselines; (2) RB-based CIPs are better than those with LB or BU; and (3) replacing RB with RB* yields further improvements. These results confirm the reasonability of the prior knowledge of document structures. The best baseline is , which achieves a Micro F1 score of 66.8% (83.7%) without any learning. Quite shockingly, the score is competitive with those of the supervised parsers.
Table 1 also demonstrates that our method outperforms all the baselines and achieves an F1 score of 67.5% (84.0%). If we use the best baseline for initial-tree sampling in Viterbi EM, the performance further improves to 68.0% (84.3%).
To investigate the potential of our unsupervised parser, we also augmented the training dataset with an external unlabeled corpus. We used about 2,000 news articles from Wall Street Journal in Penn Treebank (Marcus et al., 1993) that are not shared with the RST-DT test set. We split the raw documents into EDU segmentations by using an external pre-trained EDU segmenter (Wang et al., 2018)10 and found that the larger unlabeled dataset can improve parsing performance to 68.6%.
It is worth noting that our method outperforms the baselines used for the initialization, which implies that our method learns some knowledge of discourse constituentness in an unsupervised manner.
Our method also achieves comparable or superior results to supervised models. We suspect that the reason why the supervised version of our model outperforms the external supervised parsers (Feng and Hirst, 2014; Joty et al., 2015) is mostly dependent on feature extraction the introduction of paragraph boundaries.
5.2 Impact of Initialization Methods
Here, we evaluate the importance of initialization in Viterbi EM. Beginning with uniform initialization, we incrementally applied the initialization techniques introduced in Section 3.3 and investigated their impact on the results.
Table 2 shows the results. We observe that our model yields the lowest score of 58.9% with uniform initialization (no prior knowledge). By introducing Document Hierarchy in Section 3.3(i), parsing performance improves slightly to 59.1%. This result is interesting because the unlabeled constituency scores of BU and 〈BUs, BUp, BUd〉 are quite different (19.5 vs. 55.5; see Table 1). We then introduced Discourse Branching Tendency in Section 3.3(ii) by replacing BU with RB in the CIP, which also improved the performance, slightly, to 59.7%. We then introduced Syntax-Aware Branching Tendency in Section 3.3(iii) by replacing RB with RB* only for the sentence level, which brought a considerable performance gain of 6.6 points (66.3%). Finally, we introduced Locality Bias in Section 3.3(iv) and achieved 67.5%. We also found that our model can be improved further to 68.0% if we use the best baseline for initialization.
Knowledge . | Initial Trees . | Micro F1 . |
---|---|---|
No (Uniform) | BU | 58.9 |
(i) | 〈BUs, BUp, BUd〉 | 59.1 |
(i)+(ii) | 〈RBs, RBp, RBd〉 | 59.7 |
(i)+(ii)+(iii) | 66.3 | |
(i)+(ii)+(iii)+(iv) | 67.5 | |
Best baseline | 68.0 |
Knowledge . | Initial Trees . | Micro F1 . |
---|---|---|
No (Uniform) | BU | 58.9 |
(i) | 〈BUs, BUp, BUd〉 | 59.1 |
(i)+(ii) | 〈RBs, RBp, RBd〉 | 59.7 |
(i)+(ii)+(iii) | 66.3 | |
(i)+(ii)+(iii)+(iv) | 67.5 | |
Best baseline | 68.0 |
In total, these initialization techniques made a difference of 9.1 points compared with uniform initialization (i.e., 58.9 → 68.0), which implies that initialization should be carefully considered in unsupervised discourse (constituency) parsing using EM and that the prior knowledge we proposed in Section 3.3(i)-(iv) can capture some of the tendencies of document structures. We also found that Syntax-Aware Branching Tendency is most effective among the techniques, which suggests that more detailed knowledge can yield further improvements.
5.3 Learned Discourse Constituentness
Here, we further investigate the discourse constituentness learned by our method.
First, we calculated Unlabeled Recall (UR) scores for each relation class in RST-DT. We used 18 coarse-grained classes. Please note that we only focus on constituent spans {(i,j)} because our method does not predict relation labels. Table 3 shows the results of the best four and the worst four relation classes of our method. We compare the results with the supervised version.
Relation . | Ours . | Supervised . |
---|---|---|
ATTRIBUTION | 90.7 | 92.7 |
ENABLEMENT | 87.0 | 82.6 |
MANNER-MEANS | 77.8 | 85.2 |
TEMPORAL | 76.5 | 64.7 |
TOPIC-CHANGE | 57.1 | 42.9 |
EXPLANATION | 56.4 | 56.4 |
EVALUATION | 56.3 | 55.0 |
SUMMARY | 50.0 | 71.9 |
Total | 69.9 | 70.9 |
Relation . | Ours . | Supervised . |
---|---|---|
ATTRIBUTION | 90.7 | 92.7 |
ENABLEMENT | 87.0 | 82.6 |
MANNER-MEANS | 77.8 | 85.2 |
TEMPORAL | 76.5 | 64.7 |
TOPIC-CHANGE | 57.1 | 42.9 |
EXPLANATION | 56.4 | 56.4 |
EVALUATION | 56.3 | 55.0 |
SUMMARY | 50.0 | 71.9 |
Total | 69.9 | 70.9 |
We observe that although our method uses an unsupervised approach and does not rely on structural annotations, some scores are comparable to those of the supervised version. We also found that relation classes with relatively higher scores can be assumed to form right-heavy structures (e.g., ATTRIBUTION,ENABLEMENT), whereas relations with lower scores can be considered to form left-heavy structures (e.g., EVALUATION, SUMMARY). These results are natural because the initialization methods we used in the Viterbi training strongly rely on RB-based CIP. This implies that, to capture discourse constituency phenomena of SUMMARY or EVALUTION relations, it is necessary to introduce other initialization techniques (or prior knowledge) in future.
Lastly, we qualitatively inspected the discourse constituentness learned by our method. We computed span scores s(i, j) for all possible spans (i, j) in the RST-DT test set without using any boundary information. We then sampled text spans xi:j with relatively higher constituent scores, s(i,j) > 10.0.
As shown in the upper part of Table 4, we can observe that our method learns some aspects of discourse constituentness that seems linguistically reasonable. In particular, we found that our method has a potential to predict brackets for (1) clauses with connectives qualifying other clauses from right to left (e.g., “X [because B.]”) and (2) attribution structures (e.g., “say that [B]”). These results indicate that our method is good at identifying discourse constituents near the end of sentences (or paragraphs), which is natural because RB is mainly used for generating initial trees in EM training. The bottom part of Table 4 demonstrates that the beginning position of the text span is also important to estimate constituenthood, along with the ending position.
[The bankruptcy-court reorganization is being challenged ... by a dissident group of claimants][because it places a cap on the total amount of money available][to settle claims.][It also bars future suits against ...](11.74) |
[The first two GAF trials were watched closely on Wall Street][because they were considered to be important tests of goverment’s ability][to convince a jury of allegations][stemming from its insider-trading investigations.][In an eight-court indictment, the goverment charged GAF, ...](10.16) |
[The posters were sold for $1,300 to $6,000,][although the government says][they had a value of only $53 to $200 apiece.][Henry Pitman, the assistant U.S. attorney][handling the case,][said][about ...](11.31) |
[The office, an arm of the Treasury, said][it doesn’t have data on the financial position of applications][and thus can’t determine][why blacks are rejected more often.][Nevertheless, on Capital Hill,][where ...](11.57) |
[After 93 hours of deliberation, the jurors in the second trial said][they were hopelessly deadlocked,][and another mistrial was declared on March 22.][Meanwhile, a federal jury found Mr. Bilzerian ...](11.66) |
[(“I think — she knows me,][but I’m not sure ”)][and Bridget Fonda, the actress][(“She knows me,][but we’re not really the best of friends”).][Mr. Revson, the gossip columnist, says][there are people][who ...](11.11) |
[its vice president ... resigned][and its Houston work force has been trimmed by 40 people, of about 15%.][The maker of hand-held computers and computer systems said][the personnel changes were needed][to improve the efficiency of its manufacturing operation.][The company said][it hasn’t named a successor ...](4.44) |
[its vice president ... resigned][and its Houston work force has been trimmed by 40 people, of about 15%.][The maker of hand-held computers and computer systems said][the personnel changes were needed][to improve the efficiency of its manufacturing operation.][The company said][it hasn’t named a successor...](11.04) |
[its vice president ... resigned][and its Houston work force has been trimmed by 40 people, of about 15%.][The maker of hand-held computers and computer systems said ][the personnel changes were needed][to improve the efficiency of its manufacturing operation.][The company said][it hasn’t named a successor...](5.50) |
[its vice president ... resigned][and its Houston work force has been trimmed by 40 people, of about 15%.][The maker of hand-held computers and computer systems said][the personnel changes were needed][to improve the efficiency of its manufacturing operation.][The company said][it hasn’t named a successor...](7.68) |
[The bankruptcy-court reorganization is being challenged ... by a dissident group of claimants][because it places a cap on the total amount of money available][to settle claims.][It also bars future suits against ...](11.74) |
[The first two GAF trials were watched closely on Wall Street][because they were considered to be important tests of goverment’s ability][to convince a jury of allegations][stemming from its insider-trading investigations.][In an eight-court indictment, the goverment charged GAF, ...](10.16) |
[The posters were sold for $1,300 to $6,000,][although the government says][they had a value of only $53 to $200 apiece.][Henry Pitman, the assistant U.S. attorney][handling the case,][said][about ...](11.31) |
[The office, an arm of the Treasury, said][it doesn’t have data on the financial position of applications][and thus can’t determine][why blacks are rejected more often.][Nevertheless, on Capital Hill,][where ...](11.57) |
[After 93 hours of deliberation, the jurors in the second trial said][they were hopelessly deadlocked,][and another mistrial was declared on March 22.][Meanwhile, a federal jury found Mr. Bilzerian ...](11.66) |
[(“I think — she knows me,][but I’m not sure ”)][and Bridget Fonda, the actress][(“She knows me,][but we’re not really the best of friends”).][Mr. Revson, the gossip columnist, says][there are people][who ...](11.11) |
[its vice president ... resigned][and its Houston work force has been trimmed by 40 people, of about 15%.][The maker of hand-held computers and computer systems said][the personnel changes were needed][to improve the efficiency of its manufacturing operation.][The company said][it hasn’t named a successor ...](4.44) |
[its vice president ... resigned][and its Houston work force has been trimmed by 40 people, of about 15%.][The maker of hand-held computers and computer systems said][the personnel changes were needed][to improve the efficiency of its manufacturing operation.][The company said][it hasn’t named a successor...](11.04) |
[its vice president ... resigned][and its Houston work force has been trimmed by 40 people, of about 15%.][The maker of hand-held computers and computer systems said ][the personnel changes were needed][to improve the efficiency of its manufacturing operation.][The company said][it hasn’t named a successor...](5.50) |
[its vice president ... resigned][and its Houston work force has been trimmed by 40 people, of about 15%.][The maker of hand-held computers and computer systems said][the personnel changes were needed][to improve the efficiency of its manufacturing operation.][The company said][it hasn’t named a successor...](7.68) |
6 Conclusion
In this paper, we introduced an unsupervised discourse constituency parsing algorithm that uses Viterbi EM with a margin-based criterion to train a span-based neural parser. We also introduced initialization methods for the Viterbi training of discourse constituents. We observed that our unsupervised parser achieves comparable or even superior performance to the baselines and fully supervised parsers. We also found that learned discourse constituents depend strongly on initialization used in Viterbi EM, and it is necessary to explore other initialization techniques to capture more diverse discourse phenomena.
We have two limitations in this study. First, this work focuses only on unlabeled discourse constituent structures. Although such hierarchical information is useful in downstream applications (Louis et al., 2010), both nuclearity statuses and rhetorical relations are also necessary for a more complete RST analysis. Second, our study uses only English documents for evaluation. However, different languages may have different structural regularities. Hence, it would be interesting to investigate whether the initialization methods are effective in different languages, which we believe gives suggestions on discourse-level universals. We leave these issues as a future work.
Acknowledgments
The research results have been achieved by “Research and Development of Deep Learning Technology for Advanced Multilingual Speech Translation”, the Commissioned Research of National Institute of Information and Communications Technology (NICT), Japan. This work was also supported by JSPS KAKENHI grant number JP19K22861, JP18J12366.
Notes
Our code can be found at https://github.com/norikinishida/DiscourseConstituencyInduction-ViterbiEM.
Prior studies on grammar induction generally use sentences up to length 10, 15, or 40. On the other hand, about half the documents in the RST-DT corpus (Carlson et al., 2001) are longer than 40.
We apply the Stanford CoreNLP parser (Manning et al., 2014) to the concatenation of the EDUs; https://stanfordnlp.github.io/CoreNLP/.
If there are multiple head words in an EDU, we choose the left most one.
A detailed investigation of the span-based parsing model using LSTM can be found in Gaddy et al. (2018).
Therefore, our “paragraph” boundaries do not strictly correspond to paragraph segmentation. However, we found that this pseudo “paragraph” segmentation improves the parsing accuracy. We used the raw WSJ files (“*.out”) in RST-DT, e.g., “wsj_1135.out.”
However, scores against the binarized trees and the original trees are quite similar (Morey et al., 2018).
We used the same model and hyperparameters as the unsupervised model. The only difference is that we used conventional supervised learning with manually annotated trees in stead of Viterbi EM.