Supertagging the Long Tail with Tree-Structured Decoding of Complex Categories

Although current CCG supertaggers achieve high accuracy on the standard WSJ test set, few systems make use of the categories' internal structure that will drive the syntactic derivation during parsing. The tagset is traditionally truncated, discarding the many rare and complex category types in the long tail. However, supertags are themselves trees. Rather than give up on rare tags, we investigate constructive models that account for their internal structure, including novel methods for tree-structured prediction. Our best tagger is capable of recovering a sizeable fraction of the long-tail supertags and even generates CCG categories that have never been seen in training, while approximating the prior state of the art in overall tag accuracy with fewer parameters. We further investigate how well different approaches generalize to out-of-domain evaluation sets.


Introduction
Combinatory Categorial Grammar (CCG; Steedman, 2000) is a strongly-lexicalized grammar formalism in which rich syntactic categories at the lexical level impose tight constraints on the constituents that can be formed. Its syntax-semantics interface has been attractive for downstream tasks such as semantic parsing (Artzi et al., 2015) and machine translation (Nǎdejde et al., 2017).
Most CCG parsers operate as a pipeline whose first task is 'supertagging', i.e., sequence labeling with a large search space of complex 'supertags' (Clark and Curran, 2004;Vaswani et al., 2016, inter alia). The complex categories specify valency information: expected arguments to the right are signaled with forward slashes, and expected arguments to the left with backward slashes. For example, transitive verbs in English (like "saw" in Figure 1a) are tagged (S/NP)/NP to indicate that they expect a subsequent object noun phrase (NP) and a preceding subject NP to form a clause (S). Given the supertags, all that remains to parsing is applying general rules of (binary) combination between adjacent constituents until the entire input is covered. Supertagging thus represents the crux of the overall parsing process. In contrast to the simpler task of part-of-speech tagging, supertaggers are required to resolve most of the syntactic ambiguity in the input.
One key challenge of CCG supertagging is that the tagset is large and open-ended to account for combinatorial possibilities of syntactic constructions. This results in a heavy-tailed distribution of supertags, which is visualized in Figure 1b; a large proportion of unique supertags are rare or unseen (out-of-vocabulary, OOV) even in a training set as large as the Penn Treebank's. Previous CCG supertaggers have surrendered in the face of this challenge: they treat categories as a fixed set of opaque labels, rather than modeling their compositional structure. Following Clark (2002), the standard approach is to consider only supertags appearing at least 10 times in the training data, sacrificing the possibility of predicting two thirds of the supertag types in CCGbank. Rare supertags may have little impact on overall token accuracy-but the cost of this compromise is a fundamental incapability in truly generalizing to the task.
In this paper, we confront the long-tail problem head-on by proposing a constructive framework in which supertags are built from scratch rather than predicted as opaque labels (Kogkalidis et al., 2019). In contrast to prior constructive supertaggers (Kogkalidis et al., 2019;Bhargava and Penn, 2020), our model builds upon the observation that supertags are themselves tree-structured, and hence can be generated top-down. 1 Our experiments on the English CCGbank and its rebanked version show that constructing supertags as trees improves  The 'syntax' of CCG categories, using infix notation for complex categories (FxnCat). Our model generates supertags of type Cat top-down from this grammar. our ability to predict rare and even unseen tags, without sacrificing performance on the more common ones.
Our contributions are threefold: 1. We introduce a general constructive supertagger that generates each lexical category recursively as a tree. To our knowledge, this is the first treestructured predictor of its kind. 2. We apply this model to English CCG supertagging. On frequent supertags, it matches the more traditional approach of using a fixed label set, while on the rare and unseen ones, we see substantial improvements in predictive performance. 3. We perform an array of in-depth analyses that highlight the impact of different modeling and inference choices for the task of predicting supertags.

Anatomy of a Supertag
The internal structure of any CCG supertag is a tree licensed by the CFG in Figure 2. Atomic categories like S and NP are related by slashes to form functional categories, which can in turn participate in larger functional categories. By convention, the infix-notation supertag (S/NP)/NP is equivalent to the tree in Figure 3a, with prefix no-tation (/ (/ S NP) NP), where the slash signals the direction in which the category can combine, the right child of any slash is the argument, and the left child is the result of combining the category with its argument. These hierarchical supertags constrain lexical item combination, e.g., specifying subcategorization of verbs for an object NP to the right (/). This flexibility leads to infinite 2 possible supertags; in practice, they follow a power law distribution. CCGbank (comprising the WSJ portion of the Penn Treebank) contains numerous rare supertags, including several that occur only in the test set. Still others can be expected to occur in a much larger English corpus.
In previous work, CCG supertaggers have skirted this problem by ignoring the long tail of supertags: specifically, the ones occurring fewer than 10 times in the training set. The consequences of such a threshold can be seen from Figure 1b, which visualizes the distribution of supertag types in terms of depth (representing supertag complexity) and token frequency. The supertags seen in training that would be ignored under a threshold of 10 appear in red, and the test set supertags never seen in training in dark blue. Though these only account for 0.2% of tokens in the test set, they are present in nearly 4% of sentences and represent fully two thirds of supertag types in CCGbank. Further, we see that rarer categories are increasingly more complex, i.e., their argument and result types are in turn composed of FxnCats. Note in particular that the bulk of depth-4 categories and almost all categories with depth 5 or more fall below the 10-count threshold.
Inspired by the recent proposals of Kogkalidis et al. (2019) and Bhargava and Penn (2020), we hypothesize that modeling the structure of supertags, rather than treating them holistically and thresholding by frequency, can successfully generalize to rare and unseen tags. For example, a good model should draw connections between words that are NPs themselves, words that take NPs as arguments (e.g., verbs), and words that yield NPs as their result (e.g., determiners). We examine whether such linguistically-informed generalizations can benefit supertags of various frequency and structures, focusing on the rare and complex ones.

Constructivity in Supertagging
We contrast two general paradigms for supertagging below. (Our experiments will explore multiple specific modeling strategies within each.) Most previous supervised CCG supertaggers assume a closed tagset and nonconstructively assign one complete category per word (Figure 3b). This paradigm is oblivious to the internal structure of the supertag and incapable of predicting unseen supertags. This is often combined with a frequency cutoff: only the k supertags seen at least n times in the training data are considered by the model, making each tag decision a k-way classification task. Traditionally (Clark, 2002), systems use a threshold of n = 10 (yielding k = 425 in CCGbank and k = 511 in CCGrebank). The main motivation for this is to sidestep the most sparse and possibly noisy region of the output space without dramatically decreasing token coverage. Below we experiment with both thresholded and non-thresholded models.
In contrast, a constructive tagger models the internal structure of supertags (Kogkalidis et al., 2019). Supertags are constructed from minimal pieces (which for CCG are slashes and atomic categories). 3 There is no frequency cutoff at training time. 4 At test time, supertags are predicted piece by piece, and there is no constraint that predicted supertags must have been seen before. This can be done sequentially or recursively, taking the categories' internal tree structure into account.
Two different methods of sequential decoding 3 For simplicity, we consider linguistic attributes like dcl (declarative) to be part of the atomic category. 4 In principle, a constructive model could be trained with frequency-thresholded training data, but we do not see any value in pursuing this option, as constructivity in itself already mitigates noise and sparsity.  Figure 3: Schematic of our tree-structured supertagger (left) in contrast with unstructured (top right) and sequential (bottom right) models. Supertag depth also corresponds to decoding steps. Numbers below nodes denote positions or addresses.
have been explored by Kogkalidis et al. (2019) (hereafter 'K+19') and Bhargava and Penn (2020) ('BP20'). K+19 used a sequence-to-sequence model, with a single target sequence consisting of all serialized supertags for a sentence ( Figure 3c). They experimented with a type-logical grammar formalism similar to CCG, and a Dutch corpus. BP20 decoded CCG supertags as a separate sequence per token, and additionally conditioned each new supertag on the prediction history.
Here we go a step further and introduce methods for directly decoding supertags as trees, freeing the models from having to learn this fundamental property from sequential data. We hypothesize that this will produce better and more compact representations that generalize to the long tail.

Tree-Structured Constructive Supertagging
Given a sequence of words (a sentence), our goal is to predict each word's supertag. Constructing a supertag from its components requires a scoring function for the parts that is cognizant of both surrounding words and categories. Below we describe the decoding procedure ( §3.1) and scoring functions ( §3.2) we developed for this purpose, which, in line with §2, explicitly incorporate the categories' tree structure.

Predicting Tree-structured Supertags
According to the grammar in Figure 2, each category is a binary tree with the following properties: (1) Slashes are non-terminals with two children: the category's argument (the syntactic type it seeks to combine with), and its result (the type it yields after combining with its argument).
(3) The root of the tree is either the category's sole AtomCat, or its outermost functor, whose argument it seeks to combine with first. Our output supertags are trees, but there is a crucial difference between our work and constituency parsing of sentences. In the latter case, the yield of a predicted tree is constrained to be the input sentence, thereby restricting both its depth and width. But in the case of supertagging, each word is associated with a binary tree-structured supertag whose breadth and depth are unknown at inference time. We therefore grow supertags for each word from the top down (Figure 3a). At the t th step, the model greedily chooses the most likely node labels at depth t, conditioned on the word encoding and the ancestors predicted so far ( Figure 3a). The first decision (t = 0) is either an atomic category, or the main functor. In the latter case, the model then moves on to select the argument and result types, which may be atomic categories or functors themselves. We are thus guaranteed to always generate well-formed categories. As CCG supertags are not very deep in practice, we impose an upper limit on the depth of predicted trees based on the most complex categories found in the training and development data, with the main advantage that memory allocation during training can be bounded. 5

Modeling Supertags
All supertagging models we compare consist of (a) a sequence encoder, which generates a ddimensional contextualized representation h k,0 for each word k in a sentence x (eq. (1), together forming the |x| × d matrix H 0 ); (b) an output-positional encoder, which generates the hidden representation h k,i for a position indexed by i within the k th word's category tree; and (c) a fully-connected 2-layer perceptron (MLP) with a final softmax layer which maps such a representation to a probability distribution o k,i over the inventory of possible labels L (atomic categories and slashes; eq. (2)). We use the term position and the index i to refer to any atomic 5 The limits on depth and arity are practical simplifications that follow from our task (supertags are always binary trees) and data distribution (there are no categories with depth > 6 in any of the training or development sets we use). However, our model can be generalized to trees of arbitrary depth, and not as easily, but conceivably, to a different or even variable arity. It turns out that none of the evaluation sets contain categories that are deeper that what is seen in training (except the redistributed test set in Figure 4, which contains one), so this measure has virtually no impact on tagging performance. part of a category for which a labeling decision has to be made. This could be, for example, the positions of the S category in Figures 3a and 3c, or the single output in Figure 3b.
The label y k,i is the most probable one per the MLP's prediction. Contextualized word embeddings. In all conditions, we encode sentences using the pretrained RoBERTa-base encoder (Liu et al., 2019), finetuning it for our task. 6 Several recent studies have shown that such models can capture syntactic properties and relations (e.g. Jawahar et al., 2019;Clark et al., 2019;Hewitt and Manning, 2019).
Output-positional encoding. We experiment with two alternative ways of deriving hidden states for category-internal positions (k,i), where i > 0: a tree-structured recursive neural network (TreeRNN; Tai et al., 2015, inter alia), and a deterministic addressing function that accesses each node directly (AddrMLP). Both variants, described below, also take into account the current node's ancestors. The TreeRNN (eq. (3)) computes the hidden representation for a child node c(i) from a vector embedding of its parent's label y k,i and the hidden representation h k,i . The encodings are separately computed for child nodes representing the result (c = 'left') and argument (c = 'right') of the parent. Following K+19, we use the transpose of the last layer of the MLP to embed labels. Our experiments use gated recurrent units (GRUs; Cho et al., 2014).
Using tree-structured RNNs for top-down generation is reminiscent of Zhang et al. (2016).
For the AddrMLP, we represent the position i of a node and the Slashes 7 in its ancestors (denoted by Y k,anc(i) ) as a single feature vector that augments the contextualized word embedding: We employ a binary addressing scheme to refer to individual nodes: each node in a category's tree representation is addressed by a sequence of bits a 0 a 1 a 2 ...a T , corresponding to a top-down traversal of the tree. The value a t>0 = 0 (or, 1) is interpreted as branching to the left (or, right) at depth t. The root a 0 has an arbitrary placeholder value (say, 1). 8 In the example in Figure 3, the inner NP argument (the argument of the top-level result) is addressed as 101. We represent the position of a node by a vector of elements in its address, mapping a t>0 = 0 to 1 and a t>0 = 1 to −1 and ignoring a 0 . The slashes in node's ancestors are similarly mapped to a vector consisting of 1s for forward slashes and −1 for backward slashes. We use 0 to pad feature vectors to a fixed maximum length. We then use a single linear layer to project these features into the encoder's hidden space before adding it to the word's contextualized encoding. 9 Attention. While each word's contextualized encoding contains some information about all other words in the sentence, we hope to increase the model's output consistency using attention (Bahdanau et al., 2015;Kim et al., 2017;Wu et al., 2017) over the encoder's hidden state. We compute attention weights α as in eq. (5) and then add the α-weighted context values to the hidden state, eq. (6), replacing the simpler MLP from eq. (2). 10

Learning
We train the model using the AdamW optimizer (Loshchilov and Hutter, 2019) and apply teacher forcing (Williams and Zipser, 1989) to avoid a noisy feedback loop during learning.
Loss function. To achieve our goal of constructing correct and complete categories, we need our models to be correct in each atomic decision, even and especially for more complex categories. We make the loss function sensitive to this by normalizing the cross-entropy between the predictions and the ground-truth only over the number of words in 8 Prepending all addresses with 1 has several representational advantages, the most straightforward of which is that addresses can alternatively be read as binary numbers enumerating category pieces in breadth-first traversal. 9 The featurized encoder is, to a large extent, made possible by fixing the arity and maximum depth of categories. The TreeRNN will likely better admit more general setups, where outputs of unbounded depth and/or variable arity are allowed. 10 We also tried self-attention over previously predicted partial outputs but did not find an increase in performance. a batch and retaining the unnormalized sum over individual atomic category decisions. This naturally scales with category complexity. If instead we were normalizing over atomic decisions, too, the loss contribution of, e.g., NP when it occurs inside a complex category (S/NP)/NP with size 5, would be 5 times smaller than when it occurs as a complete category on its own. The disadvantage that complex categories already have as they tend to be rarer than simpler ones (Figure 1b) would be reinforced. By keeping the atomic losses unnormalized, we therefore essentially put higher weight on the long tail in order to counterbalance this trend and improve generalizability.

Experimental Setup
Per our quest to supertag the long tail, we compare our TreeRNN and AddrMLP models to the following baselines: 1) Thresholded classification (MLP_10): We compute the output probabilities directly from the encoder's hidden state. (Since there is always exactly one output position for each input word, no additional encoding function is needed.) Only categories that are seen 10 times or more in training are considered. Supertags that fall below the threshold are replaced with an <UNKNOWN> symbol in training.
2) Non-thresholded classification (MLP_1): Like MLP_10, except that all tags seen in training may be predicted no matter their frequency.
3) Per-sentence sequential (K+19): Kogkalidis et al. (2019) construct type-logical supertags by generating for each sentence a single sequence of atomic types and functors ( Figure 3c). Trees are unwrapped in prefix notation and complete tags are separated from one another by a special token. We adapt K+19's implementation of the sequenceto-sequence Transformer model (Vaswani et al., 2017), accommodating its decoding procedure and memory requirements by training with a batch size of 32 for up to 256 epochs. We achieve the best performance using a cosine-annealed learning rate schedule that is warmed up over 10% of the total training steps and with a warm restart after 128 epochs (Loshchilov and Hutter, 2017). 4) Per-tag sequential (RNN): Instead of generating a single sequence for each sentence, Bhargava and Penn (2020)   pertags, and using GRUs for decoding. We train this model for up to 50 epochs (batch sizes and learning rates are as with the tree-structured and nonconstructive models).
If not indicated otherwise, we train the models with a batch size of 8 for a maximum of 10 epochs, and use early stopping based on the best development set performance. 11 All reported results are averaged over 3 random restarts.
For downstream parsing evaluation ( §6.3), we run the C&C parser (Clark and Curran, 2007; with the pretrained CCGbank model and default hyperparameters, providing as input our supertaggers' 1-best predictions and POS tags automatically obtained using Stanza (Qi et al., 2020).

Model Details and Hyperparameters
In Table 1 we report the model and training hyperparameters we use to facilitate replication of our results. We performed manual grid-search based on the development data to find workable learning rates. We chose a hidden dimensionality of 768 to match RoBERTa's. We kept the default values for the AdamW hyperparameters. We follow Kogkalidis et al. (2019) in setting up the sequential Transformer model with 8 decoder heads and 2 decoder layers, but swap out the from-scratch encoder with RoBERTa-base.

Datasets
We use two versions of the English CCGbank as in-domain (financial news) training and test sets: the original (Hockenmaier and Steedman, 2007) and Honnibal et al.'s (2010) 'rebanked', i.e., corrected and enriched version (training sets reported in Table 2; the results tables show test set counts).
The original CCGbank and Rebank differ in a number of conventions for atomic categories and category construction (Honnibal et al., 2010). Rebank has a larger and more diverse category space, 11 Preliminary experiments showed that best dev performance is usually reached within 10 epochs; batches larger than 8 make our (single) GPU run out of memory.   Figure 4: Shifting the tail to evaluation. The new test set (right) consists of those sentences in sections 02-21 that contain a category type occurring less than 10 times, and the new training set of the remaining sentences (left). As a result, we evaluate on many more category types that are not seen at all in training (dark blue circles/right-most horizontal offset for each depth) than before (Figure 1b). due in large part to a more principled treatment of NP argument structure. Hence, we conduct our main experiments with Rebank and use the original CCGbank for comparisons with prior work.
A limitation of standard test sets for studying the long tail is that category types appearing rarely in training are even less frequent in evaluation (the Rebank test set contains just 107 tokens of categories seen 1-9 times in training, and only 27 tokens of OOV categories). Scores computed over these small samples may thus not reliably estimate the models' generalization capacity. We counteract this in two ways: 1. by explicitly redistributing the training and test splits; and 2. by evaluating on outof-domain data, with the assumption that a shift in domains means a shift in category distribution.
In the first case, we train the models on sentences containing exclusively the higher-frequency (≥10) categories, and evaluate them only on sentences with at least one rare category. We split the usual Rebank training set (WSJ sections 02-21) in this  Table 3: Main results on Rebank evaluation set (WSJ section 23). Accuracy scores are computed for bins based on the order of magnitude of category occurrences in training, and complexity of categories in depth, with depth=0 corresponding to atomic categories like NP (Figure 3a has depth 2). Token (n) and type (N) counts for each bin are given in the first two rows. '@' refers to model variants that use an attention mechanism over the encoder's hidden states. (As a Transformer model, the K+19 model attends to both the encoder and previously predicted outputs by default.) In each column, we highlight all results r that fall within the standard deviation of the best result b, i.e., when r + stdev(r) > b − stdev(b). For comparison, the overall tagging accuracy reported in Honnibal et al. (2010) is 92.2%.
way-the distribution follows Figure 4. 12 In comparison with the default data splits (Figure 1b), we see that this sampling method captures precisely the long tail of categories, while leaving the rest of the category distribution largely unchanged.
For out-of-domain evaluation we use Honnibal et al.'s (2009) (English) Wikipedia gold standard and the (English) gold section of the Parallel Meaning Bank, v3.0, which comprises multiple text types, including literary and biblical texts (PMB; Abzianidze et al., 2017). The Wikipedia dataset follows CCGbank in terms of category conventions, while PMB is more similar to Rebank; we evaluate models trained on one style only on in-and outof-domain test sets matching that style. That said, PMB contains an unusually large number of unseen categories following idiosyncratic conventions that even Rebank-trained models are unlikely to pick up on without additional training data. 12 The few supertags in the 1-9 range of the new training set are those which occurred slightly above 9 times in the original training set, but some of their tokens were moved due to occurring in the same sentence as a low-frequency tag.

Results
We report our main results on Rebank in Table 3. In terms of overall accuracy, the tree-structured constructive supertaggers (best: 94.70%) outperform the sequential ones (90.68%, 93.92%) and are roughly on par with the nonconstructive classifiers (best: 94.83). Performance is generally very similar across all systems, except K+19. We conjecture that the main disparities between K+19 and the other models lie in the increased 'cognitive load' of having to learn the correct structure of categories, as well as the missing hard alignment between words and supertags at test time.
Regarding the long tail, we ask: Can constructive models accurately predict rare and complex categories without sacrificing performance on the head of the distribution? To answer this question, we break down performance by the frequency of category types in the training data. The baseline is the thresholded classifier MLP_10, which performs well on frequent categories but cannot access rare categories occurring less than 10 times in training. The simplest way of resolving this main hurdle is    to remove the threshold, and indeed we find that MLP_1 is able to predict about a quarter of longtail categories correctly. Can we do better? The sequence-to-sequence model by K+19 does a lot better on the tail and even retrieves some unseen categories, but at the cost of frequent ones. The pertag recurrent and tree-recursive generators (RNN and TreeRNN) come close to to the nonconstructive classifiers, but do not convincingly improve over them. The AddrMLP model, finally, outperforms all others on the rare tail while matching nonconstructive taggers on frequent and simple ones. For comparison with existing work (Table 5), we also report results on the original CCGbank (Table 4). Our best constructive and nonconstructive models are on par with the previously reported state of the art in terms of overall accuracy. Tian et al. (2020) only report performance on categories seen at least 10 times in training, i.e., the union of our '≥ 100' and '10-  from Table 3 to Table 4 is consistent with Honnibal et al.'s (2010) finding that Rebank is more difficult to supertag and parse than CCGbank due to its sparser category space. We therefore encourage future researchers to conduct experiments on Rebank and report detailed results for frequency-and complexity-binned subsets of the output space to facilitate more in-depth comparisons. Evaluating generalizability. One of the inherent problems of the supertagging task is the sparsity of the output space. This is, however, not sufficiently captured by standard evaluation sets, as illustrated in Figure 1b. To test how well the models really generalize to the long tail, we evaluate them on alternatively sampled training and evaluation splits of the WSJ data (  Table 7: Performance of the best systems (the variants with attention for each paradigm) on the Wikipedia and PMB 13 datasets. The state of the art on the Wikipedia data is 90.00% . set (Table 7). These experiments largely confirm our findings from the standard Rebank evaluation set, while the change in category distribution has several important effects on our ability to evaluate model generalization: First, OOV performance is much higher on the redistributed data (Table 6) than on the standard test splits in Tables 3, 4, and 7, highlighting all of the constructive models' generalization capability, and in turn suggesting that the OOV categories in WSJ section 23 and PMB are truly difficult, noisy, or otherwise inconsistent with the training data. Second, the proportion of evaluation tokens of categories less than 10 times in training is 1.6% in PMB and 3.8% in our redistributed Rebank evaluation data, compared to only ≈0.2% in the standard CCGbank and Rebank test sets. This 7x-16x increase in relative size renders the tail much more consequential for overall performance. And indeed we observe slightly smaller gaps in overall accuracy between the best-performing nonconstructive and the best-performing constructive systems in Table 7 (0.08 on Wiki, 0.11 on PMB) compared to 0.13 in Tables 3 and 4, while in Table 6 Addr-MLP even clearly outperforms the nonconstructive models. Third, performance on rare and unseen categories can now be measured much more reliably due to the larger absolute counts of rare and unseen categories. We provide in-depth analyses of this subset of tags in §6.2.
In both in-domain and out-of-domain data, the performance gap between the nonconstructive MLPs and AddrMLP on the most frequent categories is minimal and in fact lies within the standard deviation. Given the trend we observe from Tables 3 and 4 to Tables 6 and 7, the ability to gener- AddrMLP@ -K+19 AddrMLP@ -MLP_10@ Figure 5: Confusion matrices by category depth, based on the standard Rebank evaluation set. Rows (columns) correspond to gold (predicted) categories with the respective depth. Thus, cells above (below) the diagonal refer to categories predicted too deep (shallow). All numbers are absolute differences between confusions made by AddrMLP@ and MLP_10@ / K+19, respectively. Thus, positive numbers (red) are more typical for AddrMLP@ and negative numbers (blue) are more typical for one of the other systems.
alize to the long tail may well outweigh any minor improvement on the most frequent categories when applied to even more diverse data, within other languages, and across languages.
6 Detailed Analysis

Constructing Complex Categories
While nonconstructive taggers do not distinguish between categories of varying complexity (each supertag prediction is a single k-way decision), constructive taggers are always required to make multiple atomic decisions whenever assigning a complex category, all of which need to be correct in order for the full category to be counted as correct. This raises the question: How difficult are categories of varying complexity for each of the systems?
As Figure 1b shows, deeper, i.e., more complex categories tend to be rarer and thus are more difficult than simple ones in general, for all models. Surprisingly however, we can see in the three right-most columns of Table 3 that it is not dramatically more difficult for constructive systems to generate complex categories of depth ≥ 1 than it is for nonconstructive systems to simply assign them (apart from K+19, which underperforms on frequent categories regardless of their complexity).
In Figure 5 we take a closer look at the models' ability to predict categories of the appropriate depth. For the sake of brevity, we only consider three extreme cases: MLP_10, K+19, and Addr-MLP. Compared with MLP_10, which tends to choose one of the very frequent but relatively shallow categories of depth 1 or 2, AddrMLP prefers both standalone atomic (depth-0) categories and those of depth 3 and 4 (column totals in the top left matrix). On the head, AddrMLP confuses depth-1 for depth-0 categories and overpredicts the depth of depth-2 and depth-3 categories more frequently than the baseline. On the subset of rare categories (which are deeper than more frequent categories on average), AddrMLP is consistently better at predicting categories of the correct depth (diagonal in the bottom left matrix); the thresholded model consistently chooses categories that are too shallow here. The sequential tagger by K+19 struggles with predicting the correct depth for frequent categories much more than the tree-structured model (top right matrix), which is almost certainly a result of its lack of an inductive bias for the tree structure of categories. On the rare tail, however, its ability to guess the right depth is almost as good as that of AddrMLP (bottom right).

Generation Behavior and Unseen Tags
Are there any distinct patterns in the output of the different models? By manually searching the corpus, we find that even in the cases where a tagger assigns a category with an incorrect structure, there are systematic confusions such as between argument and adjunct PPs and between fixed particle verbs and (aspectual) adjunct particles. This is difficult to measure at a large scale, but we present two examples in Tables 8 and 9.
The thresholded tagger has the option to output an <UNKNOWN> label when it believes the correct category is not in the tagset. It makes use of this option for 0.25% of tokens on average (0.11% with standard train/test splits); when it does, the correct category is indeed missing from the tagset about 14 ADV is not an actual atomic category. We use it to abbreviate the VP-adjunct category (S\NP)\(S\NP). PP is a conventionalized atomic category for argument-PPs. garnered from 1984 to 1986 Gold Table 8: AddrMLP treats "garnered" as expecting a PP argument (which would be correct for a source-PP, e.g. "garnered information from the internet", but this is a different sense of "from"). The other models correctly identify "garnered" as an intransitive passive verb with "from" introducing an adverbial PP adjunct. The gold category of "from" is so complicated because it is correlated with "to": First it expects an NP object on the right ("1984"), then an adverbial adjunct on the right (the to-PP), after which it produces an adjunct to a VP. 14 Addr-MLP's predictions for "garnered" and "from" are consistent in treating the entire construction "from 1984 to 1986" as an argument of the verb.  Table 9: Here, the intended treatment of the particle (PR) "up" is as an argument selected by the predicate. Only AddrMLP gets this right. We assume this is preferable over treating it as a VP adjunct (as the nonconstructive and K+19 taggers do) from a semantic perspective, because "pile up" is a fixed expression with a meaning distinct from that of "(to) pile" or "pile in". The RNN categories are both wrong and inconsistent (the "piling" category expects a PP and the "up" category is predicative).

MLP_10
<UNKNOWN> ✓ MLP_1  Table 11: Analysis of predicted supertag structures in the redistributed evaluation set. Incorrect predictions are broken down in terms of having the correct structure (✓struct: the same number and arrangement of slashes, arguments, and results as the gold category), an incorrect but well-formed structure (✓formed: diverging arrangement of arguments, but still obeying the grammar in Figure 2), or an invalid structure (✗formed, e.g., missing arguments to slashes).
2/3 of the time. This happens, e.g., with WH-words in elliptical questions, as in Table 10.
In Table 11 we quantify the structural and labeling errors more generally, based on the redistributed evaluation set to ensure reliable estimates on rare phenomena. A substantial portion of erroneous categories actually do have the correct structure (✓struct). 15 For these cases, we perform a detailed error analysis, whose results we present in Figure 6. In fact, if the structure is correct, the predicted category is often only off by the direction of a single slash or the attribute of a single atomic category. K+19 additionally struggles with atomic decisions beyond just differences in attributes.
To what extent can the constructive models generate categories that were unseen during training? We take a closer look at categories the constructive taggers invented in the bottom halves of Table 11 and Figure 6. K+19 is the most willing to invent categories, closely followed by the tree-structured models and finally RNN, which is rather conservative in this respect (see sums of the last four rows in Table 11). Merely generating more new categories irrespective of their correctness is of course not necessarily an advantage, but it is encouraging to see 15 E.g., for "piling" in Table 9 the RNN predicts (S[ng]/NP)/PP, which exhibits the correct structure (X/X)/X with an incorrect atomic label (PP instead of PR). Invented more than 1 labeling error 1 atom error 1 slash error 1 attribute error Figure 6: Fine-grained analysis of correctlystructured but incorrectly labeled predictions ('✓struct' in Table 11). 'Attribute error' means that the predicted atomic category is correct except for a wrong or missing linguistic attribute (e.g., S vs. S[dcl]); 'atom error' means that an entirely wrong atomic category has been chosen (e.g., PP vs. NP); and 'slash error' means confusing / and /.  the models make use of their freedom to do so at an adequate rate, rather than only reproducing known categories or vastly overgenerating invented ones. Interestingly, given that a incorrect invented category has the same structure as the gold category, we again see that the majority of errors are due to only a single attribute or slash, suggesting that in these cases the models get the general idea of the category right and only err in fine-grained and context-sensitive subcategorization. In the case of a slash mistake, they are notably also able to recover from it in later predictions.
While the tree-structured taggers are guaranteed to produce valid categories, 16 it is possible for the sequential taggers to generate structurally invalid categories, i.e., sequences of atomic categories and slashes that are not licensed by the grammar in Table 13: Performance by part-of-speech, based on the original CCGbank test set. n and N refer to token and type counts in the test set, as before; f refers to the average frequency with which a supertag belonging to the respective POS class is seen in training. Figure 2. With the tag-wise RNN generator, which generally refrains from inventing new categories, this only happens extremely rarely, but in the case of K+19, every 14th sentence is affected by an illformed supertag on average (every 66th sentence in the standard Rebank test set). A common source of errors is that too many slashes are predicted, whose argument and result slots can then not be filled by the predicted atomic categories. We show an example in Table 12.
And vice versa, are there any categories that are not generated despite being seen in training? There are 80 category types in the standard Rebank test set that none of the tree-structured taggers ever predict correctly, although they are attested in the training data, and there are 93 types that are never retrieved by K+19, 73 of which overlap. Out of these 73, no one occurs more than three times in the test set and almost all appear fewer than 50 times in training, with three exceptions: (NP/NP)/(NP/NP) (68 times in training), ((N/N)/(N/N))/NP (50 times), and (NP/NP)/N (50 times). The first one is usually used for the last part of complex numerical expressions (such as dates and ranges), but the one token bearing this category in the test set is "not" in "they might not miss one at all", which is likely an annotation error. 17 The second one encodes prepositions modifying an appositive bare noun, typically an appellation or postposed proper noun. The third one is for determiners of appositions or parentheticals. 67 of the 73 types that are problematic for the constructive models are never accurately predicted by the nonconstructive models either. 17 There are a few more instances of such implausible lexical categories in the training data, like S or ((:/NP)/PP)/NP.  Figure 7: Parsing F1-score for varying dependency lengths, measured in terms of linear distance of the two words involved in the dependency.

Parts of Speech and Sentence Parsing
Parsing performance is computed using labeled F1score (LF) over CCG dependencies in all sentences, following Clark and Curran (2007), and Parseability, i.e., the proportion of sentences for which a complete CCG derivation can be constructed. 18 Nearly all the models we compare outperform the state of the art in labeled dependency F1-score (right-most columns in Table 4). Interestingly, the K+19 model produces more parseable supertag sequences than others, despite consistently lagging behind in terms of category accuracy. Apparently this tagger prefers to be self-consistent over producing the actual correct categories, either due to its multihead attention mechanism, the fact that decisions towards the end of the sequence have access to all previously predicted categories in their entirety (rather than just parts of them), or both.
Long-range dependencies. We examine supertagging performance by POS class (a few are shown in Table 13) and find that constructive and nonconstructive taggers perform similarly across classes, with one notable exception: WH-words, whose supertags are rarely seen in training and have a high type/token ratio at test time. Their special syntactic status raises the question: How important are constructivity, tree structure, and long-tail recall for recovering categories involved in long-range dependencies?
Somewhat surprisingly, we find that the RNN is best for these dependencies (Figure 7), which might be related to the two parsing metrics in Table 4: RNN@ strikes a good balance between LF and Parseability. We further examine the average dependency length per category, and contrary to our expectation, dependencies involving WH-categories are relatively short (usually 3-4 intervening words). We find that the supertags with the longest dependencies on average largely are functioning as subordinators, sentence adverbials, and inverted speech verbs such as (S[dcl]/S[dcl])/NP. These supertags have in common that they all contain sentential result/argument pairs of the form S[x]|S [x] (where x is an optional attribute). The autoregressive nature of the RNN may be conducive to modeling the matching atomic categories of argument and result. Exploring various decoding orders for both sequential and tree-structured constructive taggers in order to more explicitly take advantage of these intra-category relations is an interesting avenue for future work. We also expect a major boost in Parseability from incorporating inter-category prediction history into our models (Bhargava and Penn, 2020). But this is nontrivial for tree-structured decoding and goes beyond our scope here.

Runtime and Model Size
While the constructive taggers need to make more individual decisions for each supertag than nonconstructive ones, they only have to consider a much smaller and denser output space. This trade-off between time and space complexity should be considered in addition to tagging accuracy when evaluating each model. Thus we ask: How do the constructive supertaggers compare to nonconstructive ones in terms of efficiency? In Table 14 we report model sizes (i.e., the number of learned parameters), training time until development performance plateaus, and inference speed. As model size and runtime vary greatly between different constructive taggers, the answer to our question depends on how supertags are modeled and inferred.
The K+19 sequential Transformer model has low efficiency for two reasons: The Transformer architecture itself has a large number of parameters; and sequential inference is slow because individual predictions for the same sentence cannot be parallelized and the number of inference steps per input sentence is linear in the sum of all category sizes (the number of atomic pieces) for that sentence. The GRUs in the RNN and TreeRNN models are much smaller than the Transformer of K+19, but the TreeRNN with its two GRUs for argument and result transitions ends up having almost as many  parameters as the Transformer in total. The nonconstructive models map hidden representations into a much larger and sparser output space than the constructive models (and the output space of MLP_1, in turn, is larger and sparser than that of MLP_10). AddrMLP, on the other hand, consists exclusively of feed-forward layers, resulting in the smallest model size among the ones we compare. The sequential models require relatively many training epochs to converge. The reason total training time is still comparable between K+19 and RNN despite the extreme disparity in inference speed is that the Transformer is trained nonautoregressively and thus performs inference only between epochs, for evaluation on the development set, whereas RNN training inherently relies on inference. The nonconstructive and tree-structured models converge within the first 10 epochs.
For the per-tag constructive models RNN, Tree-RNN, and AddrMLP we parallelize inference across all supertags in a batch, and for the treestructured ones, we further parallelize the prediction of the children of slash functors, making their inference time logarithmic in the size of the largest predicted category in a sentence.
AddrMLP is both time-and space-efficient overall. Its parameter count is only ≈1/10 of the K+19 model and ≈1/2 of the nonconstructive ones.