Abstract
Although current CCG supertaggers achieve high accuracy on the standard WSJ test set, few systems make use of the categories’ internal structure that will drive the syntactic derivation during parsing. The tagset is traditionally truncated, discarding the many rare and complex category types in the long tail. However, supertags are themselves trees. Rather than give up on rare tags, we investigate constructive models that account for their internal structure, including novel methods for tree-structured prediction. Our best tagger is capable of recovering a sizeable fraction of the long-tail supertags and even generates CCG categories that have never been seen in training, while approximating the prior state of the art in overall tag accuracy with fewer parameters. We further investigate how well different approaches generalize to out-of-domain evaluation sets.
1 Introduction
Combinatory Categorial Grammar (CCG; Steedman, 2000) is a strongly lexicalized grammar formalism in which rich syntactic categories at the lexical level impose tight constraints on the constituents that can be formed. Its syntax-semantics interface has been attractive for downstream tasks such as semantic parsing (Artzi et al., 2015) and machine translation (Nǎdejde et al., 2017).
Most CCG parsers operate as a pipeline whose first task is ‘supertagging’, i.e., sequence labeling with a large search space of complex ‘supertags’ (Clark and Curran, 2004; Xu et al., 2015; Vaswani et al., 2016, inter alia). The complex categories specify valency information: expected arguments to the right are signaled with forward slashes, and expected arguments to the left with backward slashes. For example, transitive verbs in English (like “saw” in Figure 1a) are tagged (S∖ NP)/ NP to indicate that they expect a subsequent object noun phrase (NP) and a preceding subject NP to form a clause (S). Given the supertags, all that remains to parsing is applying general rules of (binary) combination between adjacent constituents until the entire input is covered. Supertagging thus represents the crux of the overall parsing process. In contrast to the simpler task of part-of-speech tagging, supertaggers are required to resolve most of the syntactic ambiguity in the input.
One key challenge of CCG supertagging is that the tagset is large and open-ended to account for combinatorial possibilities of syntactic constructions. This results in a heavy-tailed distribution of supertags, which is visualized in Figure 1b; a large proportion of unique supertags are rare or unseen (out-of-vocabulary, OOV) even in a training set as large as the Penn Treebank’s. Previous CCG supertaggers have surrendered in the face of this challenge: They treat categories as a fixed set of opaque labels, rather than modeling their compositional structure. Following Clark (2002), the standard approach is to consider only supertags appearing at least 10 times in the training data, sacrificing the possibility of predicting two thirds of the supertag types in CCGbank. Rare supertags may have little impact on overall token accuracy—but the cost of this compromise is a fundamental incapability in truly generalizing to the task.
In this paper, we confront the long-tail problem head-on by proposing a constructive framework in which supertags are built from scratch rather than predicted as opaque labels (Kogkalidis et al., 2019). In contrast to prior constructive supertaggers (Kogkalidis et al., 2019; Bhargava and Penn, 2020), our model builds upon the observation that supertags are themselves tree-structured, and hence can be generated top–down.1 Our experiments on the English CCGbank and its rebanked version show that constructing supertags as trees improves our ability to predict rare and even unseen tags, without sacrificing performance on the more common ones.
Our contributions are threefold:
We introduce a general constructive supertagger that generates each lexical category recursively as a tree. To our knowledge, this is the first tree-structured predictor of its kind.
We apply this model to English CCG supertagging. On frequent supertags, it matches the more traditional approach of using a fixed label set, while on the rare and unseen ones, we see substantial improvements in predictive performance.
We perform an array of in-depth analyses that highlight the impact of different modeling and inference choices for the task of predicting supertags.
2 Motivation
2.1 Anatomy of a Supertag
The internal structure of any CCG supertag is a tree licensed by the CFG in Figure 2. Atomic categories like S and NP are related by slashes to form functional categories, which can in turn participate in larger functional categories. By convention, the infix-notation supertag is equivalent to the tree in Figure 3a, with prefix notation , where the slash signals the direction in which the category can combine, the right child of any slash is the argument, and the left child is the result of combining the category with its argument. These hierarchical supertags constrain lexical item combination, e.g., specifying subcategorization of verbs for an object to the right . This flexibility leads to infinite2 possible supertags; in practice, they follow a power law distribution. CCGbank (comprising the WSJ portion of the Penn Treebank) contains numerous rare supertags, including several that occur only in the test set. Still others can be expected to occur in a much larger English corpus.
In previous work, CCG supertaggers have skirted this problem by ignoring the long tail of supertags: Specifically, the ones occurring fewer than 10 times in the training set. The consequences of such a threshold can be seen from Figure 1b, which visualizes the distribution of supertag types in terms of depth (representing supertag complexity) and token frequency. The supertags seen in training that would be ignored under a threshold of 10 appear in red, and the test set supertags never seen in training in dark blue. Though these only account for 0.2% of tokens in the test set, they are present in nearly 4% of sentences and represent fully two thirds of supertag types in CCGbank. Further, we see that rarer categories are increasingly more complex, i.e., their argument and result types are in turn composed of FxnCats. Note in particular that the bulk of depth-4 categories and almost all categories with depth 5 or more fall below the 10-count threshold.
Inspired by the recent proposals of Kogkalidis et al. (2019) and Bhargava and Penn (2020), we hypothesize that modeling the structure of supertags, rather than treating them holistically and thresholding by frequency, can successfully generalize to rare and unseen tags. For example, a good model should draw connections between words that are NPs themselves, words that take NPs as arguments (e.g., verbs), and words that yield NPs as their result (e.g., determiners). We examine whether such linguistically informed generalizations can benefit supertags of various frequency and structures, focusing on the rare and complex ones.
2.2 Constructivity in Supertagging
We contrast two general paradigms for supertagging below. (Our experiments will explore multiple specific modeling strategies within each.)
Most previous supervised CCG supertaggers assume a closed tagset and nonconstructively assign one complete category per word (Figure 3b). This paradigm is oblivious to the internal structure of the supertag and incapable of predicting unseen supertags. This is often combined with a frequency cutoff: Only the k supertags seen at least n times in the training data are considered by the model, making each tag decision a k-way classification task. Traditionally (Clark, 2002), systems use a threshold of n = 10 (yielding k = 425 in CCGbank and k = 511 in CCGrebank). The main motivation for this is to sidestep the most sparse and possibly noisy region of the output space without dramatically decreasing token coverage. Below we experiment with both thresholded and non-thresholded models.
In contrast, a constructive tagger models the internal structure of supertags (Kogkalidis et al., 2019). Supertags are constructed from minimal pieces (which for CCG are slashes and atomic categories).3 There is no frequency cutoff at training time.4 At test time, supertags are predicted piece by piece, and there is no constraint that predicted supertags must have been seen before. This can be done sequentially or recursively, taking the categories’ internal tree structure into account.
Two different methods of sequential decoding have been explored by Kogkalidis et al. (2019) (hereafter ‘K+19’) and Bhargava and Penn (2020) (‘BP20’). K+19 used a sequence-to-sequence model, with a single target sequence consisting of all serialized supertags for a sentence (Figure 3c). They experimented with a type-logical grammar formalism similar to CCG, and a Dutch corpus. BP20 decoded CCG supertags as a separate sequence per token, and additionally conditioned each new supertag on the prediction history.
Here we go a step further and introduce methods for directly decoding supertags as trees, freeing the models from having to learn this fundamental property from sequential data. We hypothesize that this will produce better and more compact representations that generalize to the long tail.
3 Tree-Structured Constructive Supertagging
Given a sequence of words (a sentence), our goal is to predict each word’s supertag. Constructing a supertag from its components requires a scoring function for the parts that is cognizant of both surrounding words and categories. Below we describe the decoding procedure (§3.1) and scoring functions (§3.2) we developed for this purpose, which, in line with §2, explicitly incorporate the categories’ tree structure.
3.1 Predicting Tree-structured Supertags
According to the grammar in Figure 2, each category is a binary tree with the following properties: (1) Slashes are non-terminals with two children: the category’s argument (the syntactic type it seeks to combine with), and its result (the type it yields after combining with its argument). (2) AtomCats are leaf nodes. (3) The root of the tree is either the category’s sole AtomCat, or its outermost functor, whose argument it seeks to combine with first.
Our output supertags are trees, but there is a crucial difference between our work and constituency parsing of sentences. In the latter case, the yield of a predicted tree is constrained to be the input sentence, thereby restricting both its depth and width. But in the case of supertagging, each word is associated with a binary tree–structured supertag whose breadth and depth are unknown at inference time. We therefore grow supertags for each word from the top down (Figure 3a). At the tth step, the model greedily chooses the most likely node labels at depth t, conditioned on the word encoding and the ancestors predicted so far (Figure 3a). The first decision (t = 0) is either an atomic category, or the main functor. In the latter case, the model then moves on to select the argument and result types, which may be atomic categories or functors themselves. We are thus guaranteed to always generate well-formed categories. As CCG supertags are not very deep in practice, we impose an upper limit on the depth of predicted trees based on the most complex categories found in the training and development data, with the main advantage that memory allocation during training can be bounded.5
3.2 Modeling Supertags
Contextualized Word Embeddings.
In all conditions, we encode sentences using the pretrained RoBERTa-base encoder (Liu et al., 2019), finetuning it for our task.6 Several recent studies have shown that such models can capture syntactic properties and relations (e.g., Jawahar et al., 2019; Clark et al., 2019; Hewitt and Manning, 2019).
Output-positional Encoding.
We experiment with two alternative ways of deriving hidden states for category-internal positions (k,i), where i > 0: a tree-structured recursive neural network (TreeRNN; Tai et al., 2015, inter alia), and a deterministic addressing function that accesses each node directly (AddrMLP). Both variants, described below, also take into account the current node’s ancestors.
Using tree-structured RNNs for top–down generation is reminiscent of Zhang et al. (2016).
We use a binary addressing scheme to refer to individual nodes: Each node in a category’s tree representation is addressed by a sequence of bits a0a1a2…aT, corresponding to a top-down traversal of the tree. The value at>0 = 0(or,1) is interpreted as branching to the left (or, right) at depth t. The root a0 has an arbitrary placeholder value (say, 1).8 In the example in Figure 3, the inner NP argument (the argument of the top-level result) is addressed as 101. We represent the position of a node by a vector of elements in its address, mapping at>0 = 0 to 1 and at>0 = 1 to − 1 and ignoring a0. The slashes in node’s ancestors are similarly mapped to a vector consisting of 1s for forward slashes and − 1 for backward slashes. We use 0 to pad feature vectors to a fixed maximum length. We then use a single linear layer to project these features into the encoder’s hidden space before adding it to the word’s contextualized encoding.9
Attention.
3.3 Learning
Loss function.
To achieve our goal of constructing correct and complete categories, we need our models to be correct in each atomic decision, even and especially for more complex categories. We make the loss function sensitive to this by normalizing the cross-entropy between the predictions and the ground-truth only over the number of words in a batch and retaining the unnormalized sum over individual atomic category decisions. This naturally scales with category complexity.
If instead we were normalizing over atomic decisions, too, the loss contribution of, e.g., NP when it occurs inside a complex category (S∖ NP)/ NP with size 5, would be 5 times smaller than when it occurs as a complete category on its own. The disadvantage that complex categories already have as they tend to be rarer than simpler ones (Figure 1b) would be reinforced. By keeping the atomic losses unnormalized, we therefore essentially put higher weight on the long tail in order to counterbalance this trend and improve generalizability.
4 Experimental Setup
Per our quest to supertag the long tail, we compare our TreeRNN and AddrMLP models to the following baselines:
- 1)
Thresholded classification (MLP_10): We compute the output probabilities directly from the encoder’s hidden state. (Because there is always exactly one output position for each input word, no additional encoding function is needed.) Only categories that are seen 10 times or more in training are considered. Supertags that fall below the threshold are replaced with an ¡UNKNOWN¿ symbol in training.
- 2)
Non-thresholded classification (MLP_1): Like MLP_10, except that all tags seen in training may be predicted no matter their frequency.
- 3)
Per-sentence sequential (K+19): Kogkalidis et al. (2019) construct type-logical supertags by generating for each sentence a single sequence of atomic types and functors (Figure 3c). Trees are unwrapped in prefix notation and complete tags are separated from one another by a special token. We adapt K+19’s implementation of the sequence-to-sequence Transformer model (Vaswani et al., 2017), accommodating its decoding procedure and memory requirements by training with a batch size of 32 for up to 256 epochs. We achieve the best performance using a cosine-annealed learning rate schedule that is warmed up over 10% of the total training steps and with a warm restart after 128 epochs (Loshchilov and Hutter, 2017).
- 4)
Per-tag sequential (RNN): Instead of generating a single sequence for each sentence, Bhargava and Penn (2020) generate each word’s supertag separately with an RNN. We implement a simplified version of Bhargava and Penn’s model, omitting their prediction history connections between supertags, and using GRUs for decoding. We train this model for up to 50 epochs (batch sizes and learning rates are as with the tree-structured and nonconstructive models).
If not indicated otherwise, we train the models with a batch size of 8 for a maximum of 10 epochs, and use early stopping based on the best development set performance.11 All reported results are averaged over 3 random restarts.
4.1 Model Details and Hyperparameters
In Table 1 we report the model and training hyperparameters we use to facilitate replication of our results. We performed manual grid-search based on the development data to find workable learning rates. We chose a hidden dimensionality of 768 to match RoBERTa’s. We kept the default values for the AdamW hyperparameters. We follow Kogkalidis et al. (2019) in setting up the sequential Transformer model with 8 decoder heads and 2 decoder layers, but swap out the from-scratch encoder with RoBERTa-base.
Hidden dim d | 768 | Weight decay | .01 |
Activation | gelu | LRs | 1e-4, 1e-5 (ft) |
Dropout | .2 | Seeds | 14112, 36125, |
AdamW β’s | .9, .999 | 92225 | |
AdamW ϵ | 1e-6 | Max cat depth | 6 |
Hidden dim d | 768 | Weight decay | .01 |
Activation | gelu | LRs | 1e-4, 1e-5 (ft) |
Dropout | .2 | Seeds | 14112, 36125, |
AdamW β’s | .9, .999 | 92225 | |
AdamW ϵ | 1e-6 | Max cat depth | 6 |
4.2 Datasets
We use two versions of the English CCGbank as in-domain (financial news) training and test sets: the original (Hockenmaier and Steedman, 2007) and Honnibal et al. (2010) “rebanked”, i.e., corrected and enriched version (training sets reported in Table 2; the results tables show test set counts).
. | CCGbank . | Rebank . |
---|---|---|
cat types | 1,285 | 1,574 |
≥ 100 | 172 | 199 |
10–99 (medium rare) | 253 | 312 |
< 10 (very rare) | 860 | 1,063 |
atomic | 34 | 37 |
sentences | 39,604 | 39,604 |
tokens | 929,552 | 943,204 |
medium rare cat | 7,549 | 9,640 |
very rare cat | 2,055 | 2,527 |
. | CCGbank . | Rebank . |
---|---|---|
cat types | 1,285 | 1,574 |
≥ 100 | 172 | 199 |
10–99 (medium rare) | 253 | 312 |
< 10 (very rare) | 860 | 1,063 |
atomic | 34 | 37 |
sentences | 39,604 | 39,604 |
tokens | 929,552 | 943,204 |
medium rare cat | 7,549 | 9,640 |
very rare cat | 2,055 | 2,527 |
The original CCGbank and Rebank differ in a number of conventions for atomic categories and category construction (Honnibal et al., 2010). Rebank has a larger and more diverse category space, due in large part to a more principled treatment of NP argument structure. Hence, we conduct our main experiments with Rebank and use the original CCGbank for comparisons with prior work.
A limitation of standard test sets for studying the long tail is that category types appearing rarely in training are even less frequent in evaluation (the Rebank test set contains just 107 tokens of categories seen 1–9 times in training, and only 27 tokens of OOV categories). Scores computed over these small samples may thus not reliably estimate the models’ generalization capacity. We counteract this in two ways: 1) by explicitly redistributing the training and test splits; and 2) by evaluating on out-of-domain data, with the assumption that a shift in domains means a shift in category distribution.
In the first case, we train the models on sentences containing exclusively the higher-frequency (≥10) categories, and evaluate them only on sentences with at least one rare category. We split the usual Rebank training set (WSJ sections 02–21) in this way—the distribution follows Figure 4.12 In comparison with the default data splits (Figure 1b), we see that this sampling method captures precisely the long tail of categories, while leaving the rest of the category distribution largely unchanged.
For out-of-domain evaluation we use Honnibal et al. (2009) (English) Wikipedia gold standard and the (English) gold section of the Parallel Meaning Bank, v3.0, which comprises multiple text types, including literary and biblical texts (PMB; Abzianidze et al., 2017). The Wikipedia dataset follows CCGbank in terms of category conventions, while PMB is more similar to Rebank; we evaluate models trained on one style only on in- and out-of-domain test sets matching that style. That said, PMB contains an unusually large number of unseen categories following idiosyncratic conventions that even Rebank-trained models are unlikely to pick up on without additional training data.
5 Results
We report our main results on Rebank in Table 3. In terms of overall accuracy, the tree-structured constructive supertaggers (best: 94.70%) outperform the sequential ones (90.68%, 93.92%) and are roughly on par with the nonconstructive classifiers (best: 94.83). Performance is generally very similar across all systems, except K+19. We conjecture that the main disparities between K+19 and the other models lie in the increased “cognitive load” of having to learn the correct structure of categories, as well as the missing hard alignment between words and supertags at test time.
Model . | Acc . | Acc by cat frequency in training . | Acc by cat depth . | |||||
---|---|---|---|---|---|---|---|---|
All . | ≥100 . | 10–99 . | 1–9 . | OOV . | 0 . | 1–2 . | 3–6 . | |
n =56,395 . | n =55,698 . | n =563 . | n =107 . | n =27 . | n =19,67 . | n =33,409 . | n =3,315 . | |
N=538 . | N=199 . | N=222 . | N=91 . | N=26 . | N=18 . | N=253 . | N=267 . | |
Nonconstructive Classification | ||||||||
MLP_10 | 94.77 ± .07 | 95.26 ± .07 | 68.32 ± 1.42 | – | – | 97.80 ± .14 | 94.25 ± .08 | 82.01 ± .18 |
MLP_10@ | 94.76 ± .17 | 95.25 ± .18 | 68.98 ± 0.89 | – | – | 97.73 ± .16 | 94.29 ± .23 | 81.88 ± .29 |
MLP_1 | 94.83 ± .09 | 95.27 ± .10 | 68.68 ± 1.09 | 23.99 ± 1.08 | – | 97.71 ± .16 | 94.37 ± .14 | 82.39 ± .11 |
MLP_1@ | 94.75 ± .18 | 95.18 ± .17 | 70.16 ± 0.81 | 27.10 ± 1.62 | – | 97.84 ± .18 | 94.26 ± .52 | 82.33 ± .51 |
Constructive: Sequential | ||||||||
K+19 | 90.68 ± .15 | 91.10 ± .16 | 63.65 ± 0.21 | 34.58 ± 1.62 | 7.41 ± 0.00 | 91.71 ± .29 | 91.28 ± .02 | 78.43 ± .77 |
RNN | 93.92 ± .01 | 94.39 ± .02 | 65.48 ± 0.62 | 19.00 ± 2.35 | 0.00 ± 0.00 | 95.25 ± .09 | 94.33 ± .04 | 81.77 ± .78 |
RNN@ | 94.48 ± .08 | 94.93 ± .04 | 66.90 ± 2.32 | 27.41 ± 5.31 | 1.23 ± 2.14 | 97.72 ± .11 | 93.88 ± .09 | 81.33 ± .16 |
Constructive: Tree-structured | ||||||||
TreeRNN | 94.62 ± .12 | 95.10 ± .11 | 64.24 ± 2.60 | 25.55 ± 0.54 | 2.47 ± 2.14 | 97.70 ± .21 | 94.14 ± .08 | 81.14 ± .90 |
TreeRNN@ | 94.44 ± .22 | 94.95 ± .20 | 62.17 ± 3.03 | 22.43 ± 1.87 | 0.00 ± 0.00 | 97.61 ± .05 | 93.95 ± .33 | 80.61 ± .63 |
AddrMLP | 94.58 ± .16 | 95.01 ± .16 | 67.44 ± 1.45 | 34.89 ± 2.35 | 3.70 ± 0.00 | 97.73 ± .13 | 94.02 ± .17 | 81.47 ± .24 |
AddrMLP@ | 94.70 ± .05 | 95.11 ± .06 | 68.86 ± 0.57 | 36.76 ± 2.86 | 4.94 ± 2.14 | 97.85 ± .16 | 94.11 ± .03 | 81.92 ± .26 |
Model . | Acc . | Acc by cat frequency in training . | Acc by cat depth . | |||||
---|---|---|---|---|---|---|---|---|
All . | ≥100 . | 10–99 . | 1–9 . | OOV . | 0 . | 1–2 . | 3–6 . | |
n =56,395 . | n =55,698 . | n =563 . | n =107 . | n =27 . | n =19,67 . | n =33,409 . | n =3,315 . | |
N=538 . | N=199 . | N=222 . | N=91 . | N=26 . | N=18 . | N=253 . | N=267 . | |
Nonconstructive Classification | ||||||||
MLP_10 | 94.77 ± .07 | 95.26 ± .07 | 68.32 ± 1.42 | – | – | 97.80 ± .14 | 94.25 ± .08 | 82.01 ± .18 |
MLP_10@ | 94.76 ± .17 | 95.25 ± .18 | 68.98 ± 0.89 | – | – | 97.73 ± .16 | 94.29 ± .23 | 81.88 ± .29 |
MLP_1 | 94.83 ± .09 | 95.27 ± .10 | 68.68 ± 1.09 | 23.99 ± 1.08 | – | 97.71 ± .16 | 94.37 ± .14 | 82.39 ± .11 |
MLP_1@ | 94.75 ± .18 | 95.18 ± .17 | 70.16 ± 0.81 | 27.10 ± 1.62 | – | 97.84 ± .18 | 94.26 ± .52 | 82.33 ± .51 |
Constructive: Sequential | ||||||||
K+19 | 90.68 ± .15 | 91.10 ± .16 | 63.65 ± 0.21 | 34.58 ± 1.62 | 7.41 ± 0.00 | 91.71 ± .29 | 91.28 ± .02 | 78.43 ± .77 |
RNN | 93.92 ± .01 | 94.39 ± .02 | 65.48 ± 0.62 | 19.00 ± 2.35 | 0.00 ± 0.00 | 95.25 ± .09 | 94.33 ± .04 | 81.77 ± .78 |
RNN@ | 94.48 ± .08 | 94.93 ± .04 | 66.90 ± 2.32 | 27.41 ± 5.31 | 1.23 ± 2.14 | 97.72 ± .11 | 93.88 ± .09 | 81.33 ± .16 |
Constructive: Tree-structured | ||||||||
TreeRNN | 94.62 ± .12 | 95.10 ± .11 | 64.24 ± 2.60 | 25.55 ± 0.54 | 2.47 ± 2.14 | 97.70 ± .21 | 94.14 ± .08 | 81.14 ± .90 |
TreeRNN@ | 94.44 ± .22 | 94.95 ± .20 | 62.17 ± 3.03 | 22.43 ± 1.87 | 0.00 ± 0.00 | 97.61 ± .05 | 93.95 ± .33 | 80.61 ± .63 |
AddrMLP | 94.58 ± .16 | 95.01 ± .16 | 67.44 ± 1.45 | 34.89 ± 2.35 | 3.70 ± 0.00 | 97.73 ± .13 | 94.02 ± .17 | 81.47 ± .24 |
AddrMLP@ | 94.70 ± .05 | 95.11 ± .06 | 68.86 ± 0.57 | 36.76 ± 2.86 | 4.94 ± 2.14 | 97.85 ± .16 | 94.11 ± .03 | 81.92 ± .26 |
Regarding the long tail, we ask: Can constructive models accurately predict rare and complex categories without sacrificing performance on the head of the distribution? To answer this question, we break down performance by the frequency of category types in the training data. The baseline is the thresholded classifier MLP_10, which performs well on frequent categories but cannot access rare categories occurring less than 10 times in training. The simplest way of resolving this main hurdle is to remove the threshold, and indeed we find that MLP_1 is able to predict about a quarter of long-tail categories correctly. Can we do better? The sequence-to-sequence model by K+19 does a lot better on the tail and even retrieves some unseen categories, but at the cost of frequent ones. The per-tag recurrent and tree-recursive generators (RNN and TreeRNN) come close to to the nonconstructive classifiers, but do not convincingly improve over them. The AddrMLP model, finally, outperforms all others on the rare tail while matching nonconstructive taggers on frequent and simple ones.
For comparison with existing work (Table 5), we also report results on the original CCGbank (Table 4). Our best constructive and nonconstructive models are on par with the previously reported state of the art in terms of overall accuracy. Tian et al. (2020) only report performance on categories seen at least 10 times in training, i.e., the union of our “≥ 100” and “10-99” bins; our top-3 results on this subset are MLP_1: 96.37%, MLP_10@: 96.27%, AddrMLP@: 96.22%. The rise in absolute scores from Table 3 to Table 4 is consistent with Honnibal et al. (2010) finding that Rebank is more difficult to supertag and parse than CCGbank due to its sparser category space. We therefore encourage future researchers to conduct experiments on Rebank and report detailed results for frequency- and complexity-binned subsets of the output space to facilitate more in-depth comparisons.
Model . | Acc . | Acc by cat freq in training . | Parsing . | ||||
---|---|---|---|---|---|---|---|
All . | ≥100 . | 10–99 . | 1–9 . | OOV . | LF . | Parseability . | |
n =55,371 . | n =54,825 . | n =442 . | n =82 . | n =22 . | . | n =2,407 . | |
N=435 . | N=171 . | N=176 . | N=67 . | N=21 . | . | . | |
Nonconstructive | |||||||
MLP_10@ | 96.09 ± .07 | 96.50 ± .08 | 67.27 ± 1.02 | – | – | 90.78 ± .09 | 86.95 ± 0.75 |
MLP_1 | 96.22 ± .06 | 96.58 ± .07 | 70.29 ± 2.35 | 23.17 ± 3.23 | – | 90.91 ± .09 | 88.26 ± 0.39 |
Constructive | |||||||
K+19 | 92.12 ± .21 | 92.46 ± .20 | 65.38 ± 0.99 | 34.55 ± 4.28 | 1.52 ± 2.62 | 87.66 ± .19 | 91.14 ± 0.13 |
RNN@ | 95.10 ± .07 | 95.48 ± .07 | 65.76 ± 1.71 | 26.02 ± 0.70 | 0.00 ± 0.00 | 90.63 ± .04 | 89.53 ± 0.18 |
AddrMLP@ | 96.09 ± .07 | 96.44 ± .08 | 68.10 ± 1.38 | 37.40 ± 1.41 | 3.03 ± 2.62 | 90.79 ± .08 | 86.03 ± 1.72 |
Model . | Acc . | Acc by cat freq in training . | Parsing . | ||||
---|---|---|---|---|---|---|---|
All . | ≥100 . | 10–99 . | 1–9 . | OOV . | LF . | Parseability . | |
n =55,371 . | n =54,825 . | n =442 . | n =82 . | n =22 . | . | n =2,407 . | |
N=435 . | N=171 . | N=176 . | N=67 . | N=21 . | . | . | |
Nonconstructive | |||||||
MLP_10@ | 96.09 ± .07 | 96.50 ± .08 | 67.27 ± 1.02 | – | – | 90.78 ± .09 | 86.95 ± 0.75 |
MLP_1 | 96.22 ± .06 | 96.58 ± .07 | 70.29 ± 2.35 | 23.17 ± 3.23 | – | 90.91 ± .09 | 88.26 ± 0.39 |
Constructive | |||||||
K+19 | 92.12 ± .21 | 92.46 ± .20 | 65.38 ± 0.99 | 34.55 ± 4.28 | 1.52 ± 2.62 | 87.66 ± .19 | 91.14 ± 0.13 |
RNN@ | 95.10 ± .07 | 95.48 ± .07 | 65.76 ± 1.71 | 26.02 ± 0.70 | 0.00 ± 0.00 | 90.63 ± .04 | 89.53 ± 0.18 |
AddrMLP@ | 96.09 ± .07 | 96.44 ± .08 | 68.10 ± 1.38 | 37.40 ± 1.41 | 3.03 ± 2.62 | 90.79 ± .08 | 86.03 ± 1.72 |
Model . | Acc . | Parsing . | |||
---|---|---|---|---|---|
All . | ≥ 10 . | OOV . | LF . | P/ability . | |
Nonconstructive | |||||
V+16 | 94.24 | – | – | 88.32 | – |
C+18 | 96.05 | – | – | – | – |
T+20 | – | 96.39 | – | 90.68 | – |
Constructive | |||||
BP20 | 96.00 | – | 5 | 90.9 | 96.2 |
Model . | Acc . | Parsing . | |||
---|---|---|---|---|---|
All . | ≥ 10 . | OOV . | LF . | P/ability . | |
Nonconstructive | |||||
V+16 | 94.24 | – | – | 88.32 | – |
C+18 | 96.05 | – | – | – | – |
T+20 | – | 96.39 | – | 90.68 | – |
Constructive | |||||
BP20 | 96.00 | – | 5 | 90.9 | 96.2 |
Evaluating Generalizability.
One of the inherent problems of the supertagging task is the sparsity of the output space. This is, however, not sufficiently captured by standard evaluation sets, as illustrated in Figure 1b. To test how well the models really generalize to the long tail, we evaluate them on alternatively sampled training and evaluation splits of the WSJ data (Table 6) as well as in domains diverging from the WSJ training set (Table 7). These experiments largely confirm our findings from the standard Rebank evaluation set, while the change in category distribution has several important effects on our ability to evaluate model generalization: First, OOV performance is much higher on the redistributed data (Table 6) than on the standard test splits in Tables 3, 4, and 7, highlighting all of the constructive models’ generalization capability, and in turn suggesting that the OOV categories in WSJ section 23 and PMB are truly difficult, noisy, or otherwise inconsistent with the training data. Second, the proportion of evaluation tokens of categories less than 10 times in training is 1.6% in PMB and 3.8% in our redistributed Rebank evaluation data, compared to only ≈0.2% in the standard CCGbank and Rebank test sets. This 7x–16x increase in relative size renders the tail much more consequential for overall performance. And indeed we observe slightly smaller gaps in overall accuracy between the best-performing nonconstructive and the best-performing constructive systems in Table 7 (0.08 on Wiki, 0.11 on PMB) compared to 0.13 in Tables 3 and 4, while in Table 6 AddrMLP even clearly outperforms the nonconstructive models. Third, performance on rare and unseen categories can now be measured much more reliably due to the larger absolute counts of rare and unseen categories. We provide in-depth analyses of this subset of tags in § 6.2.
Model . | Acc . | Acc by cat freq . | |||
---|---|---|---|---|---|
All . | ≥100 . | 10–99 . | 1–9 . | OOV . | |
n =53,765 . | n =50,754 . | n =989 . | n =292 . | n =1,730 . | |
N=1,351 . | N=188 . | N=240 . | N=118 . | N=805 . | |
Nonconstructive | |||||
MLP_10 | 88.76 | 92.86 | 55.71 | 13.24 | – |
MLP_1 | 88.79 | 92.87 | 55.61 | 19.29 | – |
Sequential | |||||
K+19 | 80.20 | 83.49 | 47.72 | 25.11 | 11.62 |
RNN | 88.73 | 92.64 | 52.92 | 23.52 | 5.38 |
Tree-structured | |||||
TreeRNN | 88.78 | 92.54 | 49.90 | 20.55 | 9.62 |
AddrMLP | 89.01 | 92.70 | 54.03 | 26.48 | 10.96 |
Model . | Acc . | Acc by cat freq . | |||
---|---|---|---|---|---|
All . | ≥100 . | 10–99 . | 1–9 . | OOV . | |
n =53,765 . | n =50,754 . | n =989 . | n =292 . | n =1,730 . | |
N=1,351 . | N=188 . | N=240 . | N=118 . | N=805 . | |
Nonconstructive | |||||
MLP_10 | 88.76 | 92.86 | 55.71 | 13.24 | – |
MLP_1 | 88.79 | 92.87 | 55.61 | 19.29 | – |
Sequential | |||||
K+19 | 80.20 | 83.49 | 47.72 | 25.11 | 11.62 |
RNN | 88.73 | 92.64 | 52.92 | 23.52 | 5.38 |
Tree-structured | |||||
TreeRNN | 88.78 | 92.54 | 49.90 | 20.55 | 9.62 |
AddrMLP | 89.01 | 92.70 | 54.03 | 26.48 | 10.96 |
Model . | Wiki . | PMB . | ||||
---|---|---|---|---|---|---|
Acc . | All . | ≥100 . | 10–99 . | 1–9 . | OOV . | |
n =4,151 . | n =53,739 . | n =52,010 . | n =870 . | n =191 . | n =668 . | |
N=138 . | N=243 . | N=129 . | N=47 . | N=14 . | N=53 . | |
Nonconstructive | ||||||
MLP_10 | 92.54 | 90.11 | 92.10 | 57.05 | – | – |
MLP_1 | 92.31 | 90.27 | 92.10 | 63.41 | 29.14 | – |
Constructive | ||||||
K+19 | 87.29 | 84.39 | 86.13 | 55.86 | 32.64 | 0.20 |
RNN | 92.00 | 89.52 | 91.38 | 61.42 | 24.26 | 0.25 |
AddrMLP | 92.46 | 90.16 | 92.02 | 59.00 | 36.30 | 1.55 |
Model . | Wiki . | PMB . | ||||
---|---|---|---|---|---|---|
Acc . | All . | ≥100 . | 10–99 . | 1–9 . | OOV . | |
n =4,151 . | n =53,739 . | n =52,010 . | n =870 . | n =191 . | n =668 . | |
N=138 . | N=243 . | N=129 . | N=47 . | N=14 . | N=53 . | |
Nonconstructive | ||||||
MLP_10 | 92.54 | 90.11 | 92.10 | 57.05 | – | – |
MLP_1 | 92.31 | 90.27 | 92.10 | 63.41 | 29.14 | – |
Constructive | ||||||
K+19 | 87.29 | 84.39 | 86.13 | 55.86 | 32.64 | 0.20 |
RNN | 92.00 | 89.52 | 91.38 | 61.42 | 24.26 | 0.25 |
AddrMLP | 92.46 | 90.16 | 92.02 | 59.00 | 36.30 | 1.55 |
In both in-domain and out-of-domain data, the performance gap between the nonconstructive MLPs and AddrMLP on the most frequent categories is minimal and in fact lies within the standard deviation. Given the trend we observe from Tables 3 and 4 to Tables 6 and 7, the ability to generalize to the long tail may well outweigh any minor improvement on the most frequent categories when applied to even more diverse data, within other languages, and across languages.
6 Detailed Analysis
6.1 Constructing Complex Categories
Whereas nonconstructive taggers do not distinguish between categories of varying complexity (each supertag prediction is a single k-way decision), constructive taggers are always required to make multiple atomic decisions whenever assigning a complex category, all of which need to be correct in order for the full category to be counted as correct. This raises the question: How difficult are categories of varying complexity for each of the systems?
As Figure 1b shows, deeper, i.e., more complex, categories tend to be rarer and thus are more difficult than simple ones in general, for all models. Surprisingly however, we can see in the three rightmost columns of Table 3 that it is not dramatically more difficult for constructive systems to generate complex categories of depth ≥ 1 than it is for nonconstructive systems to simply assign them (apart from K+19, which underperforms on frequent categories regardless of their complexity).
In Figure 5 we take a closer look at the models’ ability to predict categories of the appropriate depth. For the sake of brevity, we only consider three extreme cases: MLP_10, K+19, and AddrMLP. Compared with MLP_10, which tends to choose one of the very frequent but relatively shallow categories of depth 1 or 2, AddrMLP prefers both standalone atomic (depth-0) categories and those of depth 3 and 4 (column totals in the top left matrix). On the head, AddrMLP confuses depth-1 for depth-0 categories and overpredicts the depth of depth-2 and depth-3 categories more frequently than the baseline. On the subset of rare categories (which are deeper than more frequent categories on average), AddrMLP is consistently better at predicting categories of the correct depth (diagonal in the bottom left matrix); the thresholded model consistently chooses categories that are too shallow here. The sequential tagger by K+19 struggles with predicting the correct depth for frequent categories much more than the tree-structured model (top right matrix), which is almost certainly a result of its lack of an inductive bias for the tree structure of categories. On the rare tail, however, its ability to guess the right depth is almost as good as that of AddrMLP (bottom right).
6.2 Generation Behavior and Unseen Tags
Are there any distinct patterns in the output of the different models? By manually searching the corpus, we find that even in the cases where a tagger assigns a category with an incorrect structure, there are systematic confusions such as between argument and adjunct PPs and between fixed particle verbs and (aspectual) adjunct particles. This is difficult to measure at a large scale, but we present two examples in Tables 8 and 9. The thresholded tagger has the option to output an ¡UNKNOWN¿ label when it believes the correct category is not in the tagset. It makes use of this option for 0.25% of tokens on average (0.11% with standard train/test splits); when it does, the correct category is indeed missing from the tagset about 2/3 of the time. This happens, e.g., with Wh-words in elliptical questions, as in Table 10.
. | garnered . | from . | 1984 to 1986 . |
---|---|---|---|
Gold | (S[pss]∖ NP) | (ADV/ ADV)/ NP | |
MLP_10 | ✓ | ✓ | |
MLP_1 | ✓ | ✓ | |
K+19 | ✓ | ✓ | |
RNN | ✓ | ✓ | |
AddrMLP | (S[pss]∖ NP)/ PP | (PP/ ADV)/ NP |
. | garnered . | from . | 1984 to 1986 . |
---|---|---|---|
Gold | (S[pss]∖ NP) | (ADV/ ADV)/ NP | |
MLP_10 | ✓ | ✓ | |
MLP_1 | ✓ | ✓ | |
K+19 | ✓ | ✓ | |
RNN | ✓ | ✓ | |
AddrMLP | (S[pss]∖ NP)/ PP | (PP/ ADV)/ NP |
. | orders began . | piling . | up . |
---|---|---|---|
Gold | (S[ng]∖ NP)/ PR | PR | |
MLP_10 | S[ng]∖ NP | ADV | |
MLP_1 | S[ng]∖ NP | ADV | |
K+19 | S[ng]∖ NP | ADV | |
RNN | (S[ng]∖ NP)/ PP | S[adj]∖ NP | |
AddrMLP | ✓ | ✓ |
. | orders began . | piling . | up . |
---|---|---|---|
Gold | (S[ng]∖ NP)/ PR | PR | |
MLP_10 | S[ng]∖ NP | ADV | |
MLP_1 | S[ng]∖ NP | ADV | |
K+19 | S[ng]∖ NP | ADV | |
RNN | (S[ng]∖ NP)/ PP | S[adj]∖ NP | |
AddrMLP | ✓ | ✓ |
. | Why . | constructive . | ? . |
---|---|---|---|
Gold | S[wq]/(S[adj]∖ NP) | S[adj]∖ NP | |
MLP_10 | ¡UNKNOWN¿ | ✓ | |
MLP_1 | (S/ S)/(S[adj]∖ NP) | ✓ | |
K+19 | ✓ | ✓ | |
RNN | ✓ | ✓ | |
AddrMLP | ✓ | ✓ |
. | Why . | constructive . | ? . |
---|---|---|---|
Gold | S[wq]/(S[adj]∖ NP) | S[adj]∖ NP | |
MLP_10 | ¡UNKNOWN¿ | ✓ | |
MLP_1 | (S/ S)/(S[adj]∖ NP) | ✓ | |
K+19 | ✓ | ✓ | |
RNN | ✓ | ✓ | |
AddrMLP | ✓ | ✓ |
In Table 11 we quantify the structural and labeling errors more generally, based on the redistributed evaluation set to ensure reliable estimates on rare phenomena. A substantial portion of erroneous categories actually do have the correct structure (✓ struct).15 For these cases, we perform a detailed error analysis, whose results we present in Figure 6. In fact, if the structure is correct, the predicted category is often only off by the direction of a single slash or the attribute of a single atomic category. K+19 additionally struggles with atomic decisions beyond just differences in attributes.
. | Model . | Correct . | Incorrect . | ||
---|---|---|---|---|---|
✓struct . | ✓formed . | ✗formed . | |||
All | MLP_10@ | 47,542 | 1,345 | 4,746 | – |
MLP_1@ | 47,552 | 1,401 | 4,811 | – | |
K+19 | 43,120 | 2,706 | 7,812 | 127 | |
RNN@ | 47,704 | 1,395 | 4,661 | 5 | |
TreeRNN@ | 47,733 | 1,373 | 4,659 | 1 | |
AddrMLP@ | 47,851 | 1,352 | 4,562 | 1 | |
Invented | K+19 | 201 | 96 | 160 | 127 |
RNN@ | 93 | 26 | 71 | 5 | |
TreeRNN@ | 162 | 83 | 213 | 1 | |
AddrMLP@ | 190 | 89 | 240 | 1 |
. | Model . | Correct . | Incorrect . | ||
---|---|---|---|---|---|
✓struct . | ✓formed . | ✗formed . | |||
All | MLP_10@ | 47,542 | 1,345 | 4,746 | – |
MLP_1@ | 47,552 | 1,401 | 4,811 | – | |
K+19 | 43,120 | 2,706 | 7,812 | 127 | |
RNN@ | 47,704 | 1,395 | 4,661 | 5 | |
TreeRNN@ | 47,733 | 1,373 | 4,659 | 1 | |
AddrMLP@ | 47,851 | 1,352 | 4,562 | 1 | |
Invented | K+19 | 201 | 96 | 160 | 127 |
RNN@ | 93 | 26 | 71 | 5 | |
TreeRNN@ | 162 | 83 | 213 | 1 | |
AddrMLP@ | 190 | 89 | 240 | 1 |
To what extent can the constructive models generate categories that were unseen during training? We take a closer look at categories the constructive taggers invented in the bottom halves of Table 11 and Figure 6. K+19 is the most willing to invent categories, closely followed by the tree-structured models and finally RNN, which is rather conservative in this respect (see sums of the last four rows in Table 11). Merely generating more new categories irrespective of their correctness is of course not necessarily an advantage, but it is encouraging to see the models make use of their freedom to do so at an adequate rate, rather than only reproducing known categories or vastly overgenerating invented ones. Interestingly, given that a incorrect invented category has the same structure as the gold category, we again see that the majority of errors are due to only a single attribute or slash, suggesting that in these cases the models get the general idea of the category right and only err in fine-grained and context-sensitive subcategorization. In the case of a slash mistake, they are notably also able to recover from it in later predictions.
While the tree-structured taggers are guaranteed to produce valid categories,16 it is possible for the sequential taggers to generate structurally invalid categories, i.e., sequences of atomic categories and slashes that are not licensed by the grammar in Figure 2. With the tag-wise RNN generator, which generally refrains from inventing new categories, this only happens extremely rarely, but in the case of K+19, every 14th sentence is affected by an ill-formed supertag on average (every 66th sentence in the standard Rebank test set). A common source of errors is that too many slashes are predicted, whose argument and result slots can then not be filled by the predicted atomic categories. We show an example in Table 12.
. | bring . |
---|---|
Predicted sequence | //∖S[b]NP∖NP |
Predicted supertag | ((S[b]∖ NP)/(NP∖ -))/- |
Gold sequence | /∖S[b]NPNP |
Gold supertag | (S[b]∖ NP)/ NP |
. | bring . |
---|---|
Predicted sequence | //∖S[b]NP∖NP |
Predicted supertag | ((S[b]∖ NP)/(NP∖ -))/- |
Gold sequence | /∖S[b]NPNP |
Gold supertag | (S[b]∖ NP)/ NP |
And vice versa, are there any categories that are not generated despite being seen in training? There are 80 category types in the standard Rebank test set that none of the tree-structured taggers ever predict correctly, although they are attested in the training data, and there are 93 types that are never retrieved by K+19, 73 of which overlap. Out of these 73, no one occurs more than three times in the test set and almost all appear fewer than 50 times in training, with three exceptions: (NP∖ NP)∖ (NP∖ NP) (68 times in training), ((N∖ N)∖ (N∖ N))/ NP (50 times), and (NP∖ NP)/ N (50 times). The first one is usually used for the last part of complex numerical expressions (such as dates and ranges), but the one token bearing this category in the test set is “not” in “they might not miss one at all”, which is likely an annotation error.17 The second one encodes prepositions modifying an appositive bare noun, typically an appellation or postposed proper noun. The third one is for determiners of appositions or parentheticals. 67 of the 73 types that are problematic for the constructive models are never accurately predicted by the nonconstructive models either.
6.3 Parts of Speech and Sentence Parsing
Parsing performance is computed using labeled F1-score (LF) over CCG dependencies in all sentences, following Clark and Curran (2007), and Parseability, i.e., the proportion of sentences for which a complete CCG derivation can be constructed.18 Nearly all the models we compare outperform the state of the art in labeled dependency F1-score (right-most columns in Table 4). Interestingly, the K+19 model produces more parseable supertag sequences than others, despite consistently lagging behind in terms of category accuracy. Apparently this tagger prefers to be self-consistent over producing the actual correct categories, either due to its multihead attention mechanism, the fact that decisions towards the end of the sequence have access to all previously predicted categories in their entirety (rather than just parts of them), or both.
Long-range Dependencies.
We examine supertagging performance by POS class (a few are shown in Table 13) and find that constructive and nonconstructive taggers perform similarly across classes, with one notable exception: Wh-words, whose supertags are rarely seen in training and have a high type/token ratio at test time. Their special syntactic status raises the question: How important are constructivity, tree structure, and long-tail recall for recovering categories involved in long-range dependencies?
Model . | Nouns . | Verbs . | Wh . | Other . |
---|---|---|---|---|
n =16,946 . | n =7,915 . | n =542 . | n =29,968 . | |
N=83 . | N=296 . | N=54 . | N=436 . | |
f=1,158 . | f=129 . | f=38 . | f=358 . | |
MLP_10@ | 98.58 | 93.18 | 92.25 | 95.51 |
MLP_1 | 98.62 | 93.49 | 92.68 | 95.65 |
K+19 | 95.58 | 90.54 | 90.04 | 90.62 |
RNN@ | 98.60 | 93.17 | 91.88 | 93.68 |
AddrMLP@ | 98.56 | 93.62 | 93.11 | 95.43 |
Model . | Nouns . | Verbs . | Wh . | Other . |
---|---|---|---|---|
n =16,946 . | n =7,915 . | n =542 . | n =29,968 . | |
N=83 . | N=296 . | N=54 . | N=436 . | |
f=1,158 . | f=129 . | f=38 . | f=358 . | |
MLP_10@ | 98.58 | 93.18 | 92.25 | 95.51 |
MLP_1 | 98.62 | 93.49 | 92.68 | 95.65 |
K+19 | 95.58 | 90.54 | 90.04 | 90.62 |
RNN@ | 98.60 | 93.17 | 91.88 | 93.68 |
AddrMLP@ | 98.56 | 93.62 | 93.11 | 95.43 |
Somewhat surprisingly, we find that the RNN is best for these dependencies (Figure 7), which might be related to the two parsing metrics in Table 4: RNN@ strikes a good balance between LF and Parseability. We further examine the average dependency length per category, and contrary to our expectation, dependencies involving Wh-categories are relatively short (usually 3–4 intervening words). We find that the supertags with the longest dependencies on average largely are functioning as subordinators, sentence adverbials, and inverted speech verbs such as (S[dcl]∖ S[dcl])∖ NP. These supertags have in common that they all contain sentential result/argument pairs of the form S[x]—S[x] (where x is an optional attribute). The autoregressive nature of the RNN may be conducive to modeling the matching atomic categories of argument and result. Exploring various decoding orders for both sequential and tree-structured constructive taggers in order to more explicitly take advantage of these intra-category relations is an interesting avenue for future work. We also expect a major boost in Parseability from incorporating inter-category prediction history into our models (Bhargava and Penn, 2020). But this is nontrivial for tree-structured decoding and goes beyond our scope here.
6.4 Runtime and Model Size
While the constructive taggers need to make more individual decisions for each supertag than nonconstructive ones, they only have to consider a much smaller and denser output space. This trade-off between time and space complexity should be considered in addition to tagging accuracy when evaluating each model. Thus we ask: How do the constructive supertaggers compare to nonconstructive ones in terms of efficiency? In Table 14 we report model sizes (i.e., the number of learned parameters), training time until development performance plateaus, and inference speed. As model size and runtime vary greatly between different constructive taggers, the answer to our question depends on how supertags are modeled and inferred.
Model . | Params . | Train time . | Infer speed . |
---|---|---|---|
millions . | hours . | sents/s . | |
Nonconstructive Classification | |||
MLP_10 | 2.0 | 9 | 191 |
MLP_1 | 2.4 | 11 | 195 |
Constructive: Sequential | |||
K+19 | 11.8 | 120 | 0.3 |
RNN | 4.8 | 68 | 135 |
Constructive: Tree-structured | |||
TreeRNN | 8.3 | 10 | 125 |
AddrMLP | 1.3 | 10 | 126 |
Model . | Params . | Train time . | Infer speed . |
---|---|---|---|
millions . | hours . | sents/s . | |
Nonconstructive Classification | |||
MLP_10 | 2.0 | 9 | 191 |
MLP_1 | 2.4 | 11 | 195 |
Constructive: Sequential | |||
K+19 | 11.8 | 120 | 0.3 |
RNN | 4.8 | 68 | 135 |
Constructive: Tree-structured | |||
TreeRNN | 8.3 | 10 | 125 |
AddrMLP | 1.3 | 10 | 126 |
The K+19 sequential Transformer model has low efficiency for two reasons: The Transformer architecture itself has a large number of parameters; and sequential inference is slow because individual predictions for the same sentence cannot be parallelized and the number of inference steps per input sentence is linear in the sum of all category sizes (the number of atomic pieces) for that sentence. The GRUs in the RNN and TreeRNN models are much smaller than the Transformer of K+19, but the TreeRNN with its two GRUs for argument and result transitions ends up having almost as many parameters as the Transformer in total. The nonconstructive models map hidden representations into a much larger and sparser output space than the constructive models (and the output space of MLP_1, in turn, is larger and sparser than that of MLP_10). AddrMLP, on the other hand, consists exclusively of feed-forward layers, resulting in the smallest model size among the ones we compare.
The sequential models require relatively many training epochs to converge. The reason total training time is still comparable between K+19 and RNN despite the extreme disparity in inference speed is that the Transformer is trained non-autoregressively and thus performs inference only between epochs, for evaluation on the development set, whereas RNN training inherently relies on inference. The nonconstructive and tree-structured models converge within the first 10 epochs.
For the per-tag constructive models RNN, TreeRNN, and AddrMLP we parallelize inference across all supertags in a batch, and for the tree-structured ones, we further parallelize the prediction of the children of slash functors, making their inference time logarithmic in the size of the largest predicted category in a sentence.
AddrMLP is both time- and space-efficient overall. Its parameter count is only ≈1/10 of the K+19 model and ≈1/2 of the nonconstructive ones.
7 Discussion and Related Work
For a long time, researchers have addressed the large search space of CCG supertags. Baldridge (2008) and Ravi et al. (2010) were particularly concerned with high lexical ambiguity and counteracted this, respectively, by improving lexicon initialization using linguistic principles, and explicitly minimizing model sizes. Deoskar et al. (2013), working with lexico-syntactic dependencies similar to supertags, addressed difficulties arising from the long tail of rare and unseen words; and Deoskar et al. (2014) addressed a similar issue specifically for generalizing a CCG parser. The problem of out-of-vocabulary words has gotten much less severe with the advent of deep contextualized sentence encoders operating on subword units.
An alternative way of reducing the burden on the supertagger is to couple it with the parser and jointly optimize lexical and phrasal categories, subject to the combinatory rules of CCG (Auli and Lopez, 2011; Garrette et al., 2015). Garrette et al. (2015) notably included a fully constructive probabilistic model of categories in a weakly-supervised grammar-induction scenario. In the context of grammar induction for semantic parsing specifically, Kwiatkowski et al. (2011) and Artzi et al. (2015) have explored template-based methods to generalize a limited initial lexicon to likely alternative syntactic usages of observed words.
In the special case that all categories in a sentence but one are known, the combinatory rules of CCG can be reverse-engineered to infer the missing category. As an efficient and scalable example of this, Thomforde and Steedman (2011) have proposed Chart Inference.
Since the beginning of the neural era, virtually all advances in CCG supertagging have involved different means of deep sequence encoding, typically in the form of (Bi)LSTMs, with techniques including: predicting categories directly from the word-level encoder (Xu et al., 2015; Lewis et al., 2016); giving credit to likely category sequences (Vaswani et al., 2016; Kadari et al., 2018); forcing the model to distribute its attention over a fixed-size window of neighboring words (Wu et al., 2017); training the encoder specifically to be aware of each word’s neighboring categories (‘cross-view training’; Clark et al., 2018); and latently modeling parse chunks with a graph-convolutional network over word n-grams (Tian et al., 2020).
Similar techniques have been applied to supertagging in the related formalism Tree-Adjoining Grammar (TAG) (Kasai et al., 2017, 2018). Zhu and Sarkar (2019) have formulated TAG supertagging as multitask learning with respect to certain aspects of the elementary trees’ internal structure. Their system predicts the category that optimizes the weighted sum of the scores for each subtask.
A possible objection to generating categories entirely productively is that universal linguistic patterns constrain the shape of categories and the syntactic relations they may engage in (Chomsky and Lasnik, 1993; Baldridge and Kruijff, 2003), and for any given language, word order and other language-specific properties further restrict the underlying grammar. Note that for a FxnCat shape with given argument and result types, the direction of its Slash functors is largely determined by global word order properties of the respective language. Consider the prototypical category shape for adpositions, (NP—NP)—NP, where ‘—’ stands for either forward or backward direction. In English, a predominantly prepositional (as opposed to postpositional) language with postnominal-PP modifiers, this shape is most commonly instantiated as (NP∖ NP)/ NP, but different ordering patterns may dominate in other languages. Languages with more flexible word orders will show a greater variance in slash directionality than those with fixed word orders. While our approach is in principle equipped to pick up on such patterns from data, we do not explicitly prohibit unlikely category types. One potential way of incorporating such information is via logical constraints at training and/or inference time in the style of Li and Srikumar (2019); Li et al. (2019). Another approach could be a hybrid one, bridging between constructive and nonconstructive tagging in a more fluid way. We plan to explore these avenues in future work.
8 Conclusion
We introduced a novel, explicitly tree-structured CCG supertagging method, advancing the nascent paradigm of constructive supertagging. Our analysis of complex and long-tail categories highlights the positive impact of different modeling and inference choices within this paradigm: structural inductive bias as well as adequate contextualization via, e.g., attention contribute to more efficient, robust, and self-consistent models. We hope that our proposed method can be instrumental in researching and applying not only CCG and related syntactic formalisms, but also other paradigms like morphological (de)composition of complex words in morphologically rich languages, or compositional semantic parsing.
Acknowledgments
We would like to thank Aditya Bhargava and Konstantinos Kogkalidis for assistance with replicating their experiments and extended discussions of constructive models; Mark Steedman, Julia Hockenmaier, and Noah Smith for their deep insight into CCG; Kilian Evang and Lasha Abzianidze for explanations of data formats and conventions; Tao Li, Yichu Zhou, and Sean MacAvaney for help with implementing our models in PyTorch; and members of the NERT lab at Georgetown for feedback on an early abstract. We are indebted to our TACL action editor Reut Tsarfaty and editor-in-chief Ani Nenkova, as well as the anonymous reviewers for their diligent assessment and handling of logistics. This research was supported in part by NSF award IIS-1812778 and a generous gift from Google.
Notes
Our models and code are available at https://github.com/jakpra/treeconstructive-supertagging.
But see §7 for a discussion of how linguistic patterns limit the set of observed tags.
For simplicity, we consider linguistic attributes like dcl (declarative) to be part of the atomic category.
In principle, a constructive model could be trained with frequency-thresholded training data, but we do not see any value in pursuing this option, as constructivity in itself already mitigates noise and sparsity.
The limits on depth and arity are practical simplifications that follow from our task (supertags are always binary trees) and data distribution (there are no categories with depth > 6 in any of the training or development sets we use). However, our model can be generalized to trees of arbitrary depth, and not as easily, but conceivably, to a different or even variable arity. It turns out that none of the evaluation sets contain categories that are deeper that what is seen in training (except the redistributed test set in Figure 4, which contains one), so this measure has virtually no impact on tagging performance.
We also experimented with a BiGRU encoder, but obtained consistently worse results.
Only Slash operators can have children (Figure 2).
Prepending all addresses with 1 has several representational advantages, the most straightforward of which is that addresses can alternatively be read as binary numbers enumerating category pieces in breadth-first traversal.
The featurized encoder is, to a large extent, made possible by fixing the arity and maximum depth of categories. The TreeRNN will likely better admit more general setups, where outputs of unbounded depth and/or variable arity are allowed.
We also tried self-attention over previously predicted partial outputs but did not find an increase in performance.
Preliminary experiments showed that best dev performance is usually reached within 10 epochs; batches larger than 8 make our (single) GPU run out of memory.
The few supertags in the 1–9 range of the new training set are those which occurred slightly above 9 times in the original training set, but some of their tokens were moved due to occurring in the same sentence as a low-frequency tag.
Because we did not train any models on PMB itself, we analyze performance on all of PMB-gold, but for future comparisons, we also report accuracy on the suggested evaluation split: K+19: 85.43%; RNN@: 90.24%; AddrMLP@: 90.78%; MLP_1@: 90.88%; MLP_10@: 90.91%.
ADV is not an actual atomic category. We use it to abbreviate the VP-adjunct category (S∖ NP)∖ (S∖ NP). PP is a conventionalized atomic category for argument-PPs.
E.g., for “piling” in Table 9 the RNN predicts (S[ng]∖ NP)/ PP, which exhibits the correct structure (X∖ X)/ X with an incorrect atomic label (PP instead of PR).
That TreeRNN and AddrMLP still produced one malformed category can be considered a bug: They attempted to generate a category deeper than the maximally allowed depth and were unable to complete it. This is avoidable in practice.
There are a few more instances of such implausible lexical categories in the training data, like S or ((:∖ NP)/ PP)/ NP.
The C&C parser also reports coverage, the proportion of sentences for which at least one dependency relation can be recovered. Coverage is 100% in all our conditions.