Abstract
We extend a pair of continuous combinator-based constituency parsers (one binary and one multi-branching) into a discontinuous pair. Our parsers iteratively compose constituent vectors from word embeddings without any grammar constraints. Their empirical complexities are subquadratic. Our extension includes 1) a swap action for the orientation-based binary model and 2) biaffine attention for the chunker-based multi-branching model. In tests conducted with the Discontinuous Penn Treebank and TIGER Treebank, we achieved state-of-the-art discontinuous accuracy with a significant speed advantage.
1 Introduction
Discontinuity is common in natural languages, as illustrated in Figure 1. Children from a discontinuous constituent are not necessarily consecutive because each can group with its syntactic cousins in the sentence rather than its two adjacent neighbors. Although this relaxation makes discontinuous parsing more challenging than continuous parsing, it becomes more valuable for studies and applications in non-configurational languages (Johnson, 1985), where word order does not determine grammatical function. With gradually saturated continuous parsing accuracy (Zhou and Zhao, 2019; Kitaev and Klein, 2018, 2020; Xin et al., 2021), discontinuous parsing has started gaining more attention (Fernández-González and Gómez-Rodríguez, 2020a, 2021; Corro, 2020).
Typically, constituency parsers are divided into two genres (not including methods employing dependency parsing, e.g., Fernández-González andGómez-Rodríguez [2020b]): 1) Global parsers use a fixed chart to search through all parsing possibilities for a global optimum. 2) Local parsers rely on fewer local decisions, which leads to lower complexities. The global parser complexities start at least from binary O(n3) (Kitaev and Klein, 2018) or m-ary O(n4) (Xin et al., 2021), resulting in low speeds for long parses. Global parsers introduce numerous hypotheses, whereas a local shift-reduce process or incremental parse exhibits more linguistic interests with fewer outputs (Yoshida et al., 2021; Kitaev et al., 2022). Neural global parsers dominate both continuous and discontinuous parsing (Corro, 2020; Ruprecht and Mörbitz, 2021) in terms of F1 score, but they do not exhibit a strong accuracy advantage over other parsers. Although local parsers may in some cases produce ill-formed trees, global parsers do not guarantee the selection of gold-standard answers from their charts.
We propose extending a pair of local parsers to achieve high speed, accuracy, and convenient investigation for both binary and multi-branching parses. Chen et al. (2021) proposed a pair of continuous parsers employing bottom-up vector compositionality. We dub these as neural combinatory constituency parsers (NCCP) with binary CB and multi-branching CM. To the best of our knowledge, they possess the top parsing speeds for continuous constituency. CB reflects a linguistic branching tendency, whereas CM represents unsupervised grammatical headedness through composition weight. CM is among the few parsers that do not require preprocessing binarization (Xin et al., 2021) but possess a high parsing speed. We dub our extension as DCCP with binary DB and multi-branching DM. The mechanisms with neural discontinuous combinators are shown in Figure 2.
Specifically, our combinators take a sentence-representing vector sequence as input and predict layers of concurrent tree-constructing actions. Two mechanisms are employed. DB triggers an action if the orientations of two neighboring vectors agree. A swap action exchanges the vectors; a joint action composes a new vector with them. CB only possesses a joint action. Meanwhile, DM takes discontinuous vectors to form biaffine attention matrices and decides their groups collectively; the remaining continuous vectors resort to chunking decisions, as with CM. NCCP and DCCP are unlexicalized supervised greedy parsers.
The contributions of our study are as follows:
We propose a pair of discontinuous parsers1 (i.e., binary and multi-branching) by extending continuous parsers of Chen et al. (2021).
We demonstrate the effectiveness of our work on the Discontinuous Penn Treebank (Evang and Kallmeyer, 2011, DPTB) and the TIGER Treebank (Brants et al., 2004). Our parsers achieve new state-of-the-art discontinuous F1 scores and parsing speeds with a small set of training and inferring tricks, including discontinuity as data augmentation, unsupervised headedness, automatic hyperparameter tuning, and pre-trained language models.
2 Related Work
Global Parsing.
The chart for binary continuous parsing (Kitaev and Klein, 2018) is triangular, as shown in black in Figure 3. The horizontal dimension enumerates the position of each node (i.e., each bit as a word). The vertical C indicates the number of continuous cases included in each node. Nodes at height h share combinatorics of complexity h. Consequently, a binary continuous chart parser has a fixed complexity for the CKY decoding algorithm.
In m-ary and/or discontinuous cases (i.e., for multi-branching arity m and/or fan-out k of each constituent), the chart is superior to that of binary continuous parsing in terms of complexity. Both horizontal and vertical axes expand to diversify the combination of discrete bits of each lexical node. Because of the expansion, m-ary continuous global parsing (Xin et al., 2021) has O(n4) complexity, whereas binary discontinuous parsing has exponential complexity at O(n3k) (Corro, 2020; Stanojevic and Steedman, 2020), where k ∈{1,2} are special cases for binary CFG (likely in Chomsky Normal Form [CNF]) and binary Linear Context-Free Rewriting Systems with maximum fan-out 2 (Stanojevic and Steedman, 2020, LCFRS-2). M-ary discontinuous parsing, which certainly has a higher complexity, is not yet available for global parsing.
For efficiency, expensive rules are commonly restricted by LCFRS-2 parsers (Corro, 2020; Stanojevic and Steedman, 2020; Ruprecht and Mörbitz, 2021). A tricky O(n3) variant of Corro (2020) covering major rules has produced the best results. However, the variant excludes 2% sophisticated discontinuous rules on TIGER Treebank. Limited by the simplified grammar, their discontinuous scores are low, especially for recalls. A global optimum does not guarantee a gold parse, leaving room for local parsing.
Local Parsing.
Local parsers do not observe the chart framework and only consider one greedy or a few hypotheses. Transition-based parsers with a swap or gap action have sequential actions and low complexities (Maier, 2015; Coavoux and Crabbé, 2017). Multiple swaps or gaps combine to construct a large discontinuous constituent. In contrast, stack-free parsing can directly pick up a distant component with one attachment search (Coavoux and Cohen, 2019). Easy-first (Nivre et al., 2009; Versley, 2014) and chunker-based (Ratnaparkhi, 1997; Collobert, 2011) run rapidly.
Fernández-González and Gómez-Rodríguez (2020a) redirected discontinuity to dependency parsing via pointer networks and obtained significant accuracy improvement among greedy parsers. However, it is difficult to determine whether such improvement originates from the model or the extra head information. In contrast, Fernández-González and Gómez-Rodríguez (2020b) reordered input words in at most O(n2) complexity and redirected to various continuous parsers.
3 Discontinuous Combinatory Parsing
We call a level of partial derivations or subtrees a ply (Jurafsky and Martin, 2009), as depicted in Figure 4. Similar to the state of a transition-based parser, each of which leads to an action, ply is the state for DCCP, each of which leads to a sequence of concurrent actions for itself.
Starting with a sequence of words (x1,⋯ ,xn) as an initial ply, we assemble an unlabeled discontinuous parse tree in a bottom–up manner by applying concurrent actions to the roots of subtrees in the ply and iterating until sequence length n = 1.
3.1 Binary Ply: Joint and Swap
Summary.
All nodes in a DB ply are derived by Formula 1 under the condition of Formula 2 to form a new ply. As exemplified in Figure 4 for DB, (x1,x2) and (x4,x5) meet the condition of Formula 2 and have respective joint and swap actions. Meanwhile, x3 takes neither action and remains in the ply, because the orientations of x2 and x4 do not agree with x3, regardless of x3’s orientation.
3.2 Multi-branching Ply: Affinity and Chunk
Dozat and Manning (2017) characterized each dependency tree as a sparse asymmetric matrix via biaffine attention, with each sole positive signal in a row (or column) indicating a lexical dependency (from a word to its head or vice versa). Nevertheless, lexical dependency is not available for constituency parsing, and biaffine attention becomes expensive at O(n2) complexity.
Summary.
Via fast chunking or a small biaffine attention matrix, DM balances to increase its efficiency. As exemplified in Figure 4 for DM, discontinuous (x1,x2,x5) are grouped as one because of their mutual affinity, which is equivalent to a 3 × 3 biaffine attention matrix of ones. Node x2 is selected as the medoid for the constituent’s location in the new ply. Meanwhile, continuous (x3,x4) forms a constituent for chunk(xi ⊕ xi +1) = affinity(xi,xi +1) = 𝟙(i∉{2,4}) and i ∈ [2,4].
3.3 Oracle
Empty Node and Unary Branch.
Similar to a range of previous works (Chen et al., 2021; Shen et al., 2018; Kitaev and Klein, 2020; Corro, 2020), we adopt an empty label “” for our substructure (e.g., binarization). Additionally, we collapse each unary branch into a single node and join their constituent labels according to their hierarchical order (e.g., S+VP as derivation SVP) with easy restoration during inference. Unary collapse is productive for label type, as shown in Table 1.
Binarization ρDB.
C children of a constituent join one by one via their orientation and joint signals. For c ∈ [1, C), [1, c]-th children are set to orientation right (1) and (c, C]-th are set to orientation left (0). Neighboring children have positive joint signals if they are siblings. Otherwise, negative joints swap them toward their siblings. We normalize a factor for treebank binarization. In continuous parsing, ρDB ∈{0, 1} implies CNF.
As illustrated in Figure 5 (a) & (b), we obtain layers of action signals from the binarization of (c). As an extension to CNF, we use the beta distribution ρDB ∼Beta(αleft,αright) ∈ (0, 1) to create augmented samples with .
Medoid ρDM.
We use a set of categorical medoid factors ρDM ∈{random,leftmost,rightmost} to stratify a multi-branching tree: 1) random picks a random child with uniform probability, whereas 2) leftmost and 3) rightmost take the two ends of a discontinuous group.
3.4 Model Implementation
NCCP and DCCP have the same bottom–up iteration on a ply and share two types of neural components: bidirectional Long Short-Term Memory (BiLSTM) and feedforward neural network (FFNN).
In Algorithm 1, BiLSTMcxt contextualizes the sequence of words x1:n as embeddings and BiLSTMply contextualizes the ply sequence for either DBFOLD in Algorithm 2 or DMFOLD in Algorithm 3, either of which modifies ply and constructs new layers of embeddings . The necessity of BiLSTMcxt and BiLSTMply contextualization was empirically examined with NCCP. Meanwhile, FFNNtag and FFNNlabel predict the lexical tags and constituent labels based on contextualized individual embeddings without grammar constraint.
Binary Combinator.
The FOLD of DB is shown in Algorithm 2. The COMPOSE function uses sigmoid activation “σ” to create a pair of complementary gates λ and (1 − λ) for xL and xR. λ is a vector of the same size as the embeddings.
Multi-branching Combinator.
To identify discontinuous groups in the affinity biaffine attention matrix, DM takes for value range (0,1) () and booleanizes it into B ← M > θ. It 1) tries default threshold θ = 0.5 as the natural selection for sigmoid activation and checks whether all the following statements are true:
B is symmetric (i.e., ),
any rows v,w ∈ B are v≠0,
either v = w or .
It succeeds in most cases. Otherwise, it 2) tries a value from M as θ, checks and loops again. We order the thresholds by their distances to the default 0.5. If all 2) fail, it 3) simply falls back into grouping all nodes as one and counts one FAIL.
Basic Losses.
Complexity.
Extreme cases provide the upper bounds for our theoretical complexity. DB takes fully swapping plies and fully joining plies. Each ply costs O(n) recurrency; the bound is O(n2). Every DM ply involves a matrix for all nodes and decreases n only by one. Assume that we limit check 2) to some fixed sizes. Each ply costs O(n2); the bound is O(n3).
However, DCCP has an empirical O(n2) complexity with strong linearity, as shown in Figure 6. DB has higher linear coefficients because of its slow binary combination. Meanwhile, DM shows stronger quadratic tendency because of biaffine attention. Yet, their coefficient magnitudes are on par with one another.
3.5 Training Tricks
Data Augmentation.
Figure 7 first summarizes (e) basic data augmentation with binarization for DB and medoid for DM (including (a), (b), and (d) in Figure 5). The beta distribution can resemble a uniform random distribution or other biased distributions to detect linguistic branching tendency with specific (αleft,αright).
Then, we further leverage the intermediate non-terminal node “” to create more -subtrees. The augmentation is an inspiration of CM’s deterministic _SUB node, which balances subtree heights and boosts both accuracy and efficiency. (Non-_SUB trees remain at their original heights.) However, (g)-subtree is random and creates imbalance. It creates only one stretching branch by iteratively grouping nodes with possibility , which has three significant impacts:
Random stretching branches add mild variations to the context as states for robust ply actions in FOLD.
Random discontinuity creates DB orientation layers that cannot be created by ρDB binarization.
They reduce large (possibly continuous) constituents into smaller (possibly discontinuous) pieces without adding a large payload to the biaffine attention, which narrows the gap between DB and DM (DM is more vulnerable to dramatic many-to-one COMPOSE).
Taking NP “a good day” for instance, any of “a day,” “a good,” and “good day” can be an intermediate option for creating the NP. On the one hand, these options create varied contexts for the remaining parts of a ply. On the other hand, assume that “a day” (which is not a ρDB product) is selected. DM learns to discern it with the other possible “a day” in biaffine attention based on their context.
Model Robustness.
To further randomize DB training, we introduce (f)ply shuffle and its resultant losses Lorishfl and Ljntshfl. It shuffles with respect to each constituent, takes the new sequence to BiLSTMply and FOLD, and reuses the ply of orientation and joint for those additional losses. For example, a VP to the left of an NP gets shuffled to the right with the same ply “rightleft” producing the additional loss items.
Continuous affinity and discontinuous affinity in DM undergo different identification processes. To minimize the difference, we introduce (h) and for continuous and interply affinity, in addition to cardinal . These reduce the risk of biaffine attention forwarding incorrect nodes, which would evoke exposure bias. We use positive rates βc and βx for and to limit the sample size, where layers contain discontinuous nodes. Fallible signals are more likely to form losses via HINGE-LOSS.
4 Experiment
DCCP takes frozen pre-trained FastText as static word embedding (PWE) or fine-tuned 12-layer pre-trained XLNet and BERT as contextualized embedding of pre-trained language models (PLM) as lexical input 2 and parses on English DPTB and German TIGER treebanks. See Table 2.
DCCP model dimension | 300 | |||
BiLSTMctx / BiLSTMply layers | 6 / 2 | |||
FFNN{tag, label, ori, jnt, disc} layers | 2 | |||
Optimizer | Adam | |||
(1-epoch γ warm-up, linear decay, and early stop.) | ||||
Drop out rate (recurrent) | 0.4 (0.2) | |||
Batch size (non-training) | 80 (160) | |||
Parameter sizes (w/o PLM) | ||||
BiLSTMcxt | +CB | +CM | +DB | +DM |
3.25M | 0.36M | 0.55M | 1.32M | 1.45M |
DCCP model dimension | 300 | |||
BiLSTMctx / BiLSTMply layers | 6 / 2 | |||
FFNN{tag, label, ori, jnt, disc} layers | 2 | |||
Optimizer | Adam | |||
(1-epoch γ warm-up, linear decay, and early stop.) | ||||
Drop out rate (recurrent) | 0.4 (0.2) | |||
Batch size (non-training) | 80 (160) | |||
Parameter sizes (w/o PLM) | ||||
BiLSTMcxt | +CB | +CM | +DB | +DM |
3.25M | 0.36M | 0.55M | 1.32M | 1.45M |
Two-stage Training for a PWE Model.
PLM models also use general hyperparameters with learning rate 10−6 at S1. PLMs are frozen during the first 50 epochs to avoid noise pollution and then are fine-tuned with learning rate 3 × 10−6. They inherit explored hyperparameters from PWE models at S2, except for learning rate 3 × 10−6.
4.1 Overall Results
Table 3 shows F1 scores of recent neural discontinuous parsers under comparable conditions on test sets. We follow their reported number of significant digits and reduce the effects of random initialization with an average of five runs. The details are shown in Table 4.
Model . | Type . | Complexity . | DPTB test set . | TIGER test set . | ||||
---|---|---|---|---|---|---|---|---|
without pre-trained language model . | F1 . | D.F1 . | Speed . | F1 . | D.F1 . | Speed . | ||
Coavoux et al. (2019) | Trans-Gap | O(n) | 91.0 | 71.3 | 80 | 82.7 | 55.9 | 126 |
Coavoux and Cohen (2019) | Stack-Free | O(n2) | 90.9 | 67.3 | 38 | 82.5 | 55.9 | 64 |
Pointer-based VG20 w/ Ling et al. (2015) | Seq-Labeling | O(n2) | 88.8 | 45.8 | 611 | 77.5 | 39.5 | 568 |
Pointer-based FG22 w/ Ling et al. (2015) | Multitask† | O(n2) | – | – | – | 86.6 | 62.6 | – |
Stanojevic and Steedman (2020) | Chart | O(n6) | 90.5 | 67.1 | – | 83.4 | 53.5 | – |
Corro (2020) | Chart | O(n3) | 92.9 | 64.9 | 355 | 85.2 | 51.2 | 474 |
Ruprecht and Mörbitz (2021) w/ flair | Chart | – | 91.8 | 76.1 | 86 | 85.1 | 61.0 | 80 |
DB w/ FastText (en & de) | Combinator | O(n2) | 92.0 | 75.6 | 940 | 84.9 | 60.1 | 1160 |
DM w/ FastText (en & de) | Combinator | O(n3) | 92.1 | 78.1 | 970 | 85.1 | 62.0 | 1300 |
with pre-trained language model | ||||||||
Pointer-based VG20 w/ BERTBASE | Seq-Labeling | O(n2) | 91.9 | 50.8 | 80 | 84.6 | 51.1 | 80 |
Pointer-based FG22 w/ BERTBASE | Multitask† | O(n2) | – | – | – | 89.8 | 71.0 | – |
Corro (2020) w/ BERT | Chart | O(n3) | 94.8 | 68.9 | – | 90.0 | 62.1 | – |
Ruprecht and Mörbitz (2021) w/ BERT | Chart | – | 93.3 | 80.5 | 57 | 88.3 | 69.0 | 60 |
FG21 w/ XLNet (en) or BERTBASE (de) | Reorder-Chart | O(n3) | 95.1 | 74.1 | 179 | 88.5 | 63.0 | 238 |
FG21 w/ XLNet (en) or BERTBASE (de) | Reorder-Trans | O(n2) | 95.5 | 73.4 | 133 | 88.5 | 62.7 | 157 |
DB w/ XLNet (en) or BERTBASE (de) | Combinator | O(n2) | 94.8 | 76.6 | 275 | 89.5 | 69.7 | 424 |
DM w/ XLNet (en) or BERTBASE (de) | Combinator | O(n3) | 95.0 | 83.0 | 375 | 89.6 | 70.9 | 535 |
Model . | Type . | Complexity . | DPTB test set . | TIGER test set . | ||||
---|---|---|---|---|---|---|---|---|
without pre-trained language model . | F1 . | D.F1 . | Speed . | F1 . | D.F1 . | Speed . | ||
Coavoux et al. (2019) | Trans-Gap | O(n) | 91.0 | 71.3 | 80 | 82.7 | 55.9 | 126 |
Coavoux and Cohen (2019) | Stack-Free | O(n2) | 90.9 | 67.3 | 38 | 82.5 | 55.9 | 64 |
Pointer-based VG20 w/ Ling et al. (2015) | Seq-Labeling | O(n2) | 88.8 | 45.8 | 611 | 77.5 | 39.5 | 568 |
Pointer-based FG22 w/ Ling et al. (2015) | Multitask† | O(n2) | – | – | – | 86.6 | 62.6 | – |
Stanojevic and Steedman (2020) | Chart | O(n6) | 90.5 | 67.1 | – | 83.4 | 53.5 | – |
Corro (2020) | Chart | O(n3) | 92.9 | 64.9 | 355 | 85.2 | 51.2 | 474 |
Ruprecht and Mörbitz (2021) w/ flair | Chart | – | 91.8 | 76.1 | 86 | 85.1 | 61.0 | 80 |
DB w/ FastText (en & de) | Combinator | O(n2) | 92.0 | 75.6 | 940 | 84.9 | 60.1 | 1160 |
DM w/ FastText (en & de) | Combinator | O(n3) | 92.1 | 78.1 | 970 | 85.1 | 62.0 | 1300 |
with pre-trained language model | ||||||||
Pointer-based VG20 w/ BERTBASE | Seq-Labeling | O(n2) | 91.9 | 50.8 | 80 | 84.6 | 51.1 | 80 |
Pointer-based FG22 w/ BERTBASE | Multitask† | O(n2) | – | – | – | 89.8 | 71.0 | – |
Corro (2020) w/ BERT | Chart | O(n3) | 94.8 | 68.9 | – | 90.0 | 62.1 | – |
Ruprecht and Mörbitz (2021) w/ BERT | Chart | – | 93.3 | 80.5 | 57 | 88.3 | 69.0 | 60 |
FG21 w/ XLNet (en) or BERTBASE (de) | Reorder-Chart | O(n3) | 95.1 | 74.1 | 179 | 88.5 | 63.0 | 238 |
FG21 w/ XLNet (en) or BERTBASE (de) | Reorder-Trans | O(n2) | 95.5 | 73.4 | 133 | 88.5 | 62.7 | 157 |
DB w/ XLNet (en) or BERTBASE (de) | Combinator | O(n2) | 94.8 | 76.6 | 275 | 89.5 | 69.7 | 424 |
DM w/ XLNet (en) or BERTBASE (de) | Combinator | O(n3) | 95.0 | 83.0 | 375 | 89.6 | 70.9 | 535 |
DPTB (test) . | F1 . | D.F1 . | |
---|---|---|---|
PWE | DB | 91.97±0.05 | 75.62±0.82 |
DM | 92.06±0.10 | 78.14±0.69 | |
PLM | DB | 94.84±0.24 | 76.62±2.07 |
DM | 95.04±0.06 | 83.04±0.79 | |
TIGER (test) | F1 | D.F1 | |
PWE | DB | 84.88±0.08 | 60.08±0.37 |
DM | 85.11±0.13 | 62.02±0.71 | |
PLM | DB | 89.48±0.16 | 69.68±0.55 |
DM | 89.61±0.09 | 70.93±0.63 |
DPTB (test) . | F1 . | D.F1 . | |
---|---|---|---|
PWE | DB | 91.97±0.05 | 75.62±0.82 |
DM | 92.06±0.10 | 78.14±0.69 | |
PLM | DB | 94.84±0.24 | 76.62±2.07 |
DM | 95.04±0.06 | 83.04±0.79 | |
TIGER (test) | F1 | D.F1 | |
PWE | DB | 84.88±0.08 | 60.08±0.37 |
DM | 85.11±0.13 | 62.02±0.71 | |
PLM | DB | 89.48±0.16 | 69.68±0.55 |
DM | 89.61±0.09 | 70.93±0.63 |
DCCP models achieved state-of-the-art performance in terms of discontinuous F1 scores and parsing speeds. Although speed tests are conducted on different platforms, our parsers lead by a significant margin. In terms of overall F1 score, our parsers outperform some chart parsers (Stanojevic and Steedman, 2020; Ruprecht and Mörbitz, 2021) and slightly underperform the overall best outline, as characterized in boldface.
4.2 Ablation Study
We ablate the PWE models in two-stage training, as shown in Table 5. We only show one representative run with ablation because of the similar low variability on development sets. DB has two data augmentation items and ρDB as well as one model item ply shuffle. On refers to and off . Off Beta(1,1) refers to a static ρDB = 0.5 and (0,0,0) shows the performance of bare DB models.
Model . | . | DPTB (dev) . | TIGER (dev) . | ||
---|---|---|---|---|---|
(Stage) . | F1 . | D.F1 . | F1 . | D.F1 . | |
DB | (0,0,0) | 90.93 | 63.28 | 87.73 | 56.49 |
(0,1,1) | 91.61 | 69.84 | 88.70 | 61.15 | |
(1,0,1) | 91.62 | 74.25 | 87.93 | 59.85 | |
(1,1,0) | 91.48 | 70.97 | 89.05 | 63.32 | |
(S1) | ‡(1, 1, 1) | 91.72 | 66.82 | 89.28 | 63.49 |
(S2) | ↪ optuna | 92.25 | 76.60 | 89.59 | 66.03 |
DM | (0,0,0) | 91.62 | 79.37 | 88.30 | 62.41 |
(0,1,1) | 91.44 | 78.70 | 88.61 | 65.10 | |
(1,0,1) | 91.74 | 79.02 | 89.64 | 67.40 | |
(1,1,0) | 91.84 | 77.37 | 89.78 | 67.78 | |
(S1) | ‡(1,1,1) | 92.16 | 80.29 | 89.77 | 68.20 |
(S2) | ↪ optuna | 92.37 | 82.76 | 89.84 | 68.45 |
Model . | . | DPTB (dev) . | TIGER (dev) . | ||
---|---|---|---|---|---|
(Stage) . | F1 . | D.F1 . | F1 . | D.F1 . | |
DB | (0,0,0) | 90.93 | 63.28 | 87.73 | 56.49 |
(0,1,1) | 91.61 | 69.84 | 88.70 | 61.15 | |
(1,0,1) | 91.62 | 74.25 | 87.93 | 59.85 | |
(1,1,0) | 91.48 | 70.97 | 89.05 | 63.32 | |
(S1) | ‡(1, 1, 1) | 91.72 | 66.82 | 89.28 | 63.49 |
(S2) | ↪ optuna | 92.25 | 76.60 | 89.59 | 66.03 |
DM | (0,0,0) | 91.62 | 79.37 | 88.30 | 62.41 |
(0,1,1) | 91.44 | 78.70 | 88.61 | 65.10 | |
(1,0,1) | 91.74 | 79.02 | 89.64 | 67.40 | |
(1,1,0) | 91.84 | 77.37 | 89.78 | 67.78 | |
(S1) | ‡(1,1,1) | 92.16 | 80.29 | 89.77 | 68.20 |
(S2) | ↪ optuna | 92.37 | 82.76 | 89.84 | 68.45 |
On the flip side, DM’s (0,0,0) contains randomness because of ρDM = random. We do not intend to examine a static ρDM as DB yields negative results. Based on effective training tricks, the variants enter the BO process at S2. DCCP shows its sensitivity to in Table 6.
Model . | Dev set . | . | 0.1 . | ‡ 0.25 . | 0.5 . |
---|---|---|---|---|---|
DB | DPTB | 91.61 | 91.79 | 91.72 | 91.95 |
TIGER | 88.70 | 89.04 | 89.28 | 89.25 | |
DM | DPTB | 91.44 | 91.80 | 92.16 | 89.86 |
TIGER | 88.61 | 89.45 | 89.77 | 88.61 |
Model . | Dev set . | . | 0.1 . | ‡ 0.25 . | 0.5 . |
---|---|---|---|---|---|
DB | DPTB | 91.61 | 91.79 | 91.72 | 91.95 |
TIGER | 88.70 | 89.04 | 89.28 | 89.25 | |
DM | DPTB | 91.44 | 91.80 | 92.16 | 89.86 |
TIGER | 88.61 | 89.45 | 89.77 | 88.61 |
4.3 Inference with Unsupervised Headedness
Both CM and DM provide unsupervised headedness . Chen et al. (2021) were unable to test the benefits of CM’s unsupervised headedness because it is a final product that cannot affect parsing. However, DM’s medoid affects parsing performance. On PLM DM, we select different ρDM categories, which affect the location of all discontinuous constituent, and examine their generalization on test sets, as shown in Table 7. All models are trained with ρDM = random but inference with ρDM = uhead exerts positive gains on accuracy.
Test set . | DM MedoidρDM . | F1 . | D.F1 . |
---|---|---|---|
DPTB | uhead | 95.05 | 83.58 |
leftmost | 95.00 | 81.64 | |
rightmost | 95.03 | 82.47 | |
random (min) | 95.01 | 82.18 | |
random (max) | 95.04 | 83.17 | |
TIGER | uhead | 89.62 | 71.61 |
leftmost | 89.56 | 71.43 | |
rightmost | 89.56 | 70.92 | |
random (min) | 89.55 | 71.26 | |
random (max) | 89.61 | 71.52 |
Test set . | DM MedoidρDM . | F1 . | D.F1 . |
---|---|---|---|
DPTB | uhead | 95.05 | 83.58 |
leftmost | 95.00 | 81.64 | |
rightmost | 95.03 | 82.47 | |
random (min) | 95.01 | 82.18 | |
random (max) | 95.04 | 83.17 | |
TIGER | uhead | 89.62 | 71.61 |
leftmost | 89.56 | 71.43 | |
rightmost | 89.56 | 70.92 | |
random (min) | 89.55 | 71.26 | |
random (max) | 89.61 | 71.52 |
Parent (#) . | Head child by maximum weight . |
---|---|
NP (14.4K) from CM | DT (4.5K); NP (4.3K); NNP (1.6K); |
JJ (922); NN (751); NNS (616); | |
etc. (1.6K; 12 of 50 types with “*”) | |
NP(14.3K) from DM | NP (4.7K); DT (4.5K); NNP (1.6K); |
JJ (786); NN (715); NNS (565); | |
etc. (1.4K; 15 of 49 types with “*”) |
Parent (#) . | Head child by maximum weight . |
---|---|
NP (14.4K) from CM | DT (4.5K); NP (4.3K); NNP (1.6K); |
JJ (922); NN (751); NNS (616); | |
etc. (1.6K; 12 of 50 types with “*”) | |
NP(14.3K) from DM | NP (4.7K); DT (4.5K); NNP (1.6K); |
JJ (786); NN (715); NNS (565); | |
etc. (1.4K; 15 of 49 types with “*”) |
5 Discussion
Properties of DCCP Models.
Table 3 exhibits high speeds and near state-of-the-art accuracies of DCCP compared to recent works. DCCP inherits many properties from NCCP. CB and CM are special cases of DB and DM without swap and discontinuous actions. All models contain compact components without grammar restriction. Each model has no more than 4.7M parameters apart from PWE or PLM, as listed in Table 2.
The variability of all models is low, except for the discontinuous F1 score of PLM DB on DPTB, as shown in Table 4. The main cause may not be the random initialization but the different training processes of PWE and PLM models—PLM models use the configuration of PWE models at S2. Those degraded PLM models adopted low configurations (e.g., ). Because DPTB has less discontinuity and overall F1 scores are used for model evaluation, high variability in discontinuous F1 scores becomes more common without several BO trials; this phenomenon is also reflected in Table 5. DB lacks explicit discontinuity, and the selection of hyperparameters seems to be necessary on DPTB.
Figure 8 presents the F1 scores for discontinuity and multi-branching. We select PLM models whose overall F1 scores are close (i.e., most F1 differences are less than 0.1 and DB’s performance is high). DM exhibits persistent advantages over DB when these properties are frequent. We further determined that CM has the same gains over CB starting identically from 4-ary nodes with minor score differences on (D)PTB under the same condition, as shown in Table 9. The result supports the argument of Xin et al. (2021), which asserts that m-ary constituency parsing without binarization preserves some natural advantages, e.g., predicate-argument structure. Specifically, shifts DM’s multi-branching advantage to frequent low-arity trees, favoring the overall scores on DPTB, while it enhances both discontinuity and multi-branching advantages on TIGER, as shown in Table 10, in agreement with Table 6.
Prop. . | Gold . | PWE . | . | . | PLM . | . | . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
M-ary . | Tree . | CB . | CM . | DB . | DM . | DB . | DM . | CB . | CM . | DB . | DM . | DB . | DM . |
1 | 9,073 | 92.25 | 92.02 | 91.43 | 91.35 | 91.68 | 91.53 | 93.80 | 94.33 | 93.36 | 93.41 | 93.91 | 93.82 |
2 | 26,338 | 90.41 | 89.94 | 89.87 | 89.68 | 89.95 | 90.02 | 94.41 | 94.33 | 93.47 | 93.65 | 93.77 | 93.92 |
3 | 7,009 | 84.17 | 83.56 | 83.50 | 83.34 | 83.60 | 83.79 | 90.27 | 89.81 | 88.31 | 88.60 | 88.57 | 88.87 |
4 | 1,490 | 77.87 | 78.95 | 78.19 | 79.82 | 78.50 | 78.86 | 87.42 | 86.51 | 83.98 | 86.50 | 83.88 | 85.90 |
5 | 344 | 74.19 | 77.29 | 76.14 | 78.46 | 74.97 | 78.87 | 81.42 | 84.06 | 78.61 | 85.15 | 81.38 | 83.17 |
6 | 96 | 70.05 | 78.35 | 72.90 | 80.63 | 77.39 | 76.29 | 78.64 | 80.00 | 79.23 | 83.50 | 79.02 | 76.44 |
7 | 32 | 64.71 | 86.15 | 73.53 | 87.10 | 76.47 | 71.43 | 76.47 | 70.18 | 75.76 | 77.42 | 80.60 | 54.90 |
8 | 12 | 64.00 | 72.73 | 81.82 | 85.71 | 75.00 | 80.00 | 78.26 | 86.96 | 78.57 | 83.33 | 91.67 | 63.16 |
9 | 3 | 100.00 | 75.00 | 100.00 | 75.00 | 100.00 | 50.00 | 100.00 | 75.00 | 100.00 | 85.71 | 85.71 | 100.00 |
k > 1 | 731 | – | – | 73.95 | 77.60 | 75.68 | 78.94 | – | – | 78.62 | 83.04 | 78.65 | 82.71 |
All | 44,397 | 92.54 | 92.08 | 91.99 | 92.02 | 92.00 | 92.00 | 95.71 | 95.44 | 94.70 | 94.79 | 95.08 | 95.09 |
Prop. . | Gold . | PWE . | . | . | PLM . | . | . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
M-ary . | Tree . | CB . | CM . | DB . | DM . | DB . | DM . | CB . | CM . | DB . | DM . | DB . | DM . |
1 | 9,073 | 92.25 | 92.02 | 91.43 | 91.35 | 91.68 | 91.53 | 93.80 | 94.33 | 93.36 | 93.41 | 93.91 | 93.82 |
2 | 26,338 | 90.41 | 89.94 | 89.87 | 89.68 | 89.95 | 90.02 | 94.41 | 94.33 | 93.47 | 93.65 | 93.77 | 93.92 |
3 | 7,009 | 84.17 | 83.56 | 83.50 | 83.34 | 83.60 | 83.79 | 90.27 | 89.81 | 88.31 | 88.60 | 88.57 | 88.87 |
4 | 1,490 | 77.87 | 78.95 | 78.19 | 79.82 | 78.50 | 78.86 | 87.42 | 86.51 | 83.98 | 86.50 | 83.88 | 85.90 |
5 | 344 | 74.19 | 77.29 | 76.14 | 78.46 | 74.97 | 78.87 | 81.42 | 84.06 | 78.61 | 85.15 | 81.38 | 83.17 |
6 | 96 | 70.05 | 78.35 | 72.90 | 80.63 | 77.39 | 76.29 | 78.64 | 80.00 | 79.23 | 83.50 | 79.02 | 76.44 |
7 | 32 | 64.71 | 86.15 | 73.53 | 87.10 | 76.47 | 71.43 | 76.47 | 70.18 | 75.76 | 77.42 | 80.60 | 54.90 |
8 | 12 | 64.00 | 72.73 | 81.82 | 85.71 | 75.00 | 80.00 | 78.26 | 86.96 | 78.57 | 83.33 | 91.67 | 63.16 |
9 | 3 | 100.00 | 75.00 | 100.00 | 75.00 | 100.00 | 50.00 | 100.00 | 75.00 | 100.00 | 85.71 | 85.71 | 100.00 |
k > 1 | 731 | – | – | 73.95 | 77.60 | 75.68 | 78.94 | – | – | 78.62 | 83.04 | 78.65 | 82.71 |
All | 44,397 | 92.54 | 92.08 | 91.99 | 92.02 | 92.00 | 92.00 | 95.71 | 95.44 | 94.70 | 94.79 | 95.08 | 95.09 |
Prop. . | Gold . | PWE . | . | PLM . | . | ||||
---|---|---|---|---|---|---|---|---|---|
M-ary . | Tree . | DB . | DM . | DB . | DM . | DB . | DM . | DB . | DM . |
1 | 470 | 45.37 | 49.14 | 51.64 | 55.32 | 55.16 | 56.84 | 54.45 | 57.48 |
2 | 15,379 | 81.83 | 82.57 | 82.25 | 83.36 | 85.91 | 86.34 | 86.50 | 87.31 |
3 | 13,497 | 80.58 | 80.41 | 80.95 | 81.05 | 85.96 | 85.35 | 86.93 | 86.89 |
4 | 6,166 | 73.43 | 73.34 | 74.04 | 74.36 | 80.71 | 79.93 | 81.76 | 81.92 |
5 | 2,202 | 63.66 | 64.14 | 64.27 | 65.15 | 71.09 | 72.40 | 73.37 | 73.88 |
6 | 602 | 50.72 | 52.85 | 51.62 | 55.13 | 59.30 | 63.61 | 61.32 | 64.79 |
7 | 130 | 36.25 | 43.38 | 40.53 | 44.91 | 45.51 | 55.02 | 43.59 | 50.37 |
8 | 20 | 11.24 | 24.39 | 16.44 | 19.51 | 24.32 | 24.24 | 16.67 | 20.00 |
9 | 6 | 12.90 | 66.60 | 21.43 | 28.57 | 16.67 | 33.33 | 12.90 | 54.55 |
k = 1 | 36,317 | 85.95 | 85.92 | 86.41 | 86.39 | 89.85 | 89.75 | 90.69 | 90.66 |
k = 2 | 1,963 | 59.88 | 59.64 | 59.95 | 62.05 | 71.15 | 67.53 | 69.78 | 70.85 |
k = 3 | 194 | 57.46 | 59.83 | 58.89 | 60.61 | 68.23 | 63.91 | 69.21 | 68.95 |
All | 38,474 | 84.56 | 84.50 | 84.99 | 85.08 | 88.82 | 88.57 | 89.55 | 89.58 |
Prop. . | Gold . | PWE . | . | PLM . | . | ||||
---|---|---|---|---|---|---|---|---|---|
M-ary . | Tree . | DB . | DM . | DB . | DM . | DB . | DM . | DB . | DM . |
1 | 470 | 45.37 | 49.14 | 51.64 | 55.32 | 55.16 | 56.84 | 54.45 | 57.48 |
2 | 15,379 | 81.83 | 82.57 | 82.25 | 83.36 | 85.91 | 86.34 | 86.50 | 87.31 |
3 | 13,497 | 80.58 | 80.41 | 80.95 | 81.05 | 85.96 | 85.35 | 86.93 | 86.89 |
4 | 6,166 | 73.43 | 73.34 | 74.04 | 74.36 | 80.71 | 79.93 | 81.76 | 81.92 |
5 | 2,202 | 63.66 | 64.14 | 64.27 | 65.15 | 71.09 | 72.40 | 73.37 | 73.88 |
6 | 602 | 50.72 | 52.85 | 51.62 | 55.13 | 59.30 | 63.61 | 61.32 | 64.79 |
7 | 130 | 36.25 | 43.38 | 40.53 | 44.91 | 45.51 | 55.02 | 43.59 | 50.37 |
8 | 20 | 11.24 | 24.39 | 16.44 | 19.51 | 24.32 | 24.24 | 16.67 | 20.00 |
9 | 6 | 12.90 | 66.60 | 21.43 | 28.57 | 16.67 | 33.33 | 12.90 | 54.55 |
k = 1 | 36,317 | 85.95 | 85.92 | 86.41 | 86.39 | 89.85 | 89.75 | 90.69 | 90.66 |
k = 2 | 1,963 | 59.88 | 59.64 | 59.95 | 62.05 | 71.15 | 67.53 | 69.78 | 70.85 |
k = 3 | 194 | 57.46 | 59.83 | 58.89 | 60.61 | 68.23 | 63.91 | 69.21 | 68.95 |
All | 38,474 | 84.56 | 84.50 | 84.99 | 85.08 | 88.82 | 88.57 | 89.55 | 89.58 |
The training process of CB with CNF binarization has a slight impact on parsing accuracy. Chen et al. (2021) obtained the best CB with Bernoulli distribution P(ρDB = 0) = 0.85 (i.e., P(ρDB = 1) = 0.15 or L85R15 in their format “L%R%”) on PTB. They argued that such binarization brings orientation balance.
Similarly, DB’s S2 exhibits a slightly leftward exploration with the beta distribution, as shown in Figure 9. DPTB has a similar situation. Yet, the optimized distributions are relatively uniform and symmetric, which qualifies our uniform randomness with ρDB at S1 and indicates a desired property for future language-agnostic practice.
Linguistic Properties by DCCP.
From Fig- ure 8, we learned that TIGER is more challenging in discontinuity. Incorrect discontinuity prediction seems to cascade to multi-branching prediction, resulting in degradation of both properties. Meanwhile, DPTB is largely transformed from PTB by typed traces and automatic rules, where the multi-branching accuracy stays more stable.
As seen in Table 7, ρDM = rightmost,leftmost yielded the second best F1 and D.F1 scores on DPTB and TIGER, respectively. The reversed setting yielded poor results, even if not the worst. The observation implies that many English heads locate rightward (right-branching) and that German heads tend to locate leftward (verb-second word order, V2). German has abundant separable verbs with their prefixes at the right-hand side of the clauses.
Error Rates of DCCP.
Greedy parsers allow ill-formed outputs without a single root, especially in case of single-model inference. Our models yielded a few invalid parses, as demonstrated in Table 11. DM models produce more errors. However, unsuccessful decomposition of biaffine attention matrices might not be the direct cause, as also shown in Figure 11. variants cleared the matrices that could not be decomposed with any θ (FAIL). Similar to CM, this genre suffers from more failures. Specifically, the ply size cannot be reduced to one during the iteration. Greedy parsers must suffer the defect because of their simplicity. However, invalid parses can contribute positive F1 scores and global parsers can yield inaccurate parses.
Test set . | DPTB . | TIGER . |
---|---|---|
Total trees | 2,416 | 4,998 |
DB’s ill-formed parses | 1 | 2 |
DM’s ill-formed parses | 15 | 47 |
Biaffine attention matrices | 594 | 7,278 |
θ = 0.5 solutions | 587 | 7,114 |
Average of tries if θ≠0.5 | 12.4 | 42.9 |
FAIL + identity matrices | 0+4 | 0+81 |
Test set . | DPTB . | TIGER . |
---|---|---|
Total trees | 2,416 | 4,998 |
DB’s ill-formed parses | 1 | 2 |
DM’s ill-formed parses | 15 | 47 |
Biaffine attention matrices | 594 | 7,278 |
θ = 0.5 solutions | 587 | 7,114 |
Average of tries if θ≠0.5 | 12.4 | 42.9 |
FAIL + identity matrices | 0+4 | 0+81 |
We applied methods such as Boolean matrix factorization and singular value decomposition. However, they did not provide any improvement and significantly slowed down the speed. This is because θ≠0.5 cases are few. Our sequential tries to decompose might be naïve, but they are effective. In Figure 11, the number of tries does not significantly increase under θ < 0.9 within 50 tries. For θ ≥ 0.9, although some tries are expensive, we will see that they are worthy in the next section. The imbalanced signals from both datasets account for the bias of θ—more than 92% of the biaffine attention signals are ones, as shown in Figure 12. All affinity biases (i.e., baff ∈ [−1.60,−0.84]) are significant negative numbers for counteraction.
Weakness.
The design of affinity as biaffine attention is an initial but coarse attempt, which brings imbalance. If one encoded dependency within constituent to biaffine attention instead, both signal balance and multi-grammar parsing might be better addressed. However, based on the context of constituency parsing, we leave this topic for future study.
6 Sample Analysis
Continuous vs. Discontinuous Parsing.
Figure 13 highlights the value of discontinuous parsing by demonstrating the respective CM parse. Conspicuously, the branching tendency of the continuous parse is to the right, while it is not obvious for the discontinuous parsing. Meanwhile, we observed instances of similar unsupervised headedness weights. This sample is not trivial, which challenges our PLM DCCP models.
Parsing Process of DCCP.
In Figure 13, DB shows sinuous travel traces of “I,” “was,” and larger -subtree nodes that involve the turning of orientations. The varying context leads them to achieve complex movement. DB also created some grammatical substructures for “How,” “referred,” “to,” “was,” “in,” and “school.”
Meanwhile, the DM parse is more dramatic. The formation of the lower discontinuous VP involves five nodes, two of which are irrelevant words “How” and “referred” triggered by incorrect discontinuity signals. They are discontinuous but for the higher VP. The two nodes create a noisy biaffine attention matrix because their grammatical roles are compatible with the lower VP. Trained for extra robustness, the matrix decomposition with five tries found the right θ to identify the correct VP members excluding “How” and “referred.” The interply loss and the decoding process gave this parse a chance for perfection.
In Figure 14 for German, DB achieved a long distant constituent in a more subtle way. The word “zwar” joins “registriert” as an -subtree when the formation of intermediate NP shortens their distance and prevents a travel through. “Gegenwärtig” follows and forms a VP. However, DM failed.
The matters.
The above failure explains why DM is inferior to DB with . DB’s orientation system allows some free travel before joining with correct mates. The constituent formation through steps of accumulation creates a more stable context. However, DM’s group action happens all at once. An incorrect composition might create a quite different context, which leads to unseen reaction chains. The strange unsupervised headedness weights reflect the issue. On the other side, with , DM can also gradually build and discover some semantic substructures, as shown in Figure 15. In contrast, DB is not sensitive to because of its nature, in agreement with Table 6.
7 Conclusion
We proposed a pair of efficient and effective discontinuous combinatory constituency parsers and extended the neural combinator family of NCCP. A binary combinator DB is based on orientation extended into a joint-swap system. A multi-branching combinator DM leverages biaffine attention adapted for constituency. Our models (as in Table 2) achieved state-of-the-art discontinuous F1 scores with a significant advantage in speed.
In the future, we will aim to extend DCCP into a multilingual tool for directed acyclic graph (DAG) parsing with function tag prediction for predicate-argument structure.
Acknowledgments
We extend special thanks to our action editor, anonymous reviewers, and Prof. Yusuke Miyao for their invaluable comments and suggestions. This work was partly supported by TMU research fund for young scientists.
Notes
Our code with all model configuration files is available at https://github.com/tmu-nlp/UniTP.
To compare to other parsers with a lexical component, NCCP used pre-trained FastText or FastText trained on PTB. The former slightly increased F1 score by 0.2. We choose BERT (https://www.deepset.ai/german-bert) for German and adopt FFNNcxt instead of BiLSTMcxt for model connection following NCCP.
References
Author notes
Action Editor: Carlos Gómez-Rodríguez