Syntactic Structure Distillation Pretraining for Bidirectional Encoders

Abstract Textual representation learners trained on large amounts of data have achieved notable success on downstream tasks; intriguingly, they have also performed well on challenging tests of syntactic competence. Hence, it remains an open question whether scalable learners like BERT can become fully proficient in the syntax of natural language by virtue of data scale alone, or whether they still benefit from more explicit syntactic biases. To answer this question, we introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining, by distilling the syntactically informative predictions of a hierarchical—albeit harder to scale—syntactic language model. Since BERT models masked words in bidirectional context, we propose to distill the approximate marginal distribution over words in context from the syntactic LM. Our approach reduces relative error by 2–21% on a diverse set of structured prediction tasks, although we obtain mixed results on the GLUE benchmark. Our findings demonstrate the benefits of syntactic biases, even for representation learners that exploit large amounts of data, and contribute to a better understanding of where syntactic biases are helpful in benchmarks of natural language understanding.


Introduction
Large-scale textual representation learners trained with variants of the language modelling (LM) objective have achieved remarkable success on downstream tasks (Peters et al., 2018;Devlin et al., 2019;Yang et al., 2019). Furthermore, these models have also been shown to perform remarkably well at syntactic grammaticality judgment tasks (Goldberg, 2019), and encode substantial amounts of syntax in their learned representations (Liu et al., 2019a;Tenney et al., 2019a,b;Hewitt and Manning, 2019; * Equal contribution. Jawahar et al., 2019). Intriguingly, the success on these syntactic tasks has been achieved by Transformer architectures (Vaswani et al., 2017) that lack explicit notions of hierarchical syntactic structures.
Based on such evidence, it would be tempting to conclude that data scale alone is all we need to learn the syntax of natural language. Nevertheless, recent findings that systematically compare the syntactic competence of models trained at varying data scales suggest that model inductive biases are in fact more important than data scale for acquiring syntactic competence (Hu et al., 2020). Two natural questions, therefore, are the following: can representation learners that work well at scale still benefit from explicit syntactic biases? And where exactly would such syntactic biases be helpful in different language understanding tasks? Here we work towards answering these questions by devising a new pretraining strategy that injects syntactic biases into a BERT (Devlin et al., 2019) learner that works well at scale. We hypothesise that this approach can improve the competence of BERT on various tasks, which provides evidence for the benefits of syntactic biases in large-scale learners.
Our approach is based on the prior work of , who devised an effective knowledge distillation (Bucilǎ et al., 2006;Hinton et al., 2015, KD) procedure for improving the syntactic competence of scalable LMs that lack explicit syntactic biases. More concretely, their KD procedure utilised the predictions of an explicitly hierarchical (albeit hard to scale) syntactic LM, recurrent neural network grammars (Dyer et al., 2016, RNNGs), as a syntactically informed learning signal for a sequential LM that works well at scale.
Our setup nevertheless presents a new challenge: here the BERT student is a denoising autoencoder that models a collection of conditionals for words in bidirectional context, while the RNNG teacher is an autoregressive LM that predicts words in a left-to-right fashion, i.e. t φ (x i |x <i ). This mismatch crucially means that the RNNG's estimate of t φ (x i |x <i ) may fail to take into account the right context x >i that is accessible to the BERT student ( §3). Hence, we propose an approach where the BERT student distills the RNNG's marginal distribution over words in context, t φ (x i |x <i , x >i ). We develop an efficient yet effective approximation for this quantity, since exact inference is expensive owing to the RNNG's left-to-right parameterisation.
Our structure-distilled BERT model differs from the standard BERT only in its pretraining objective, and hence retains the scalability afforded by Transformer architectures and specialised hardwares like TPUs. Our approach also maintains complete compatibility with the standard BERT pipelines; the structure-distilled BERT models can simply be loaded as pretrained BERT weights, which can then be fine-tuned in the exact same fashion.
We hypothesise that the stronger syntactic biases from our new pretraining procedure are useful for a variety of natural language understanding (NLU) tasks that involve structured output spacesincluding tasks like semantic role labelling (SRL) and coreference resolution that are not explicitly syntactic in nature. We thus evaluate our models on 6 diverse structured prediction tasks, including phrase-structure parsing (in-domain and out-ofdomain), dependency parsing, SRL, coreference resolution, and a CCG supertagging probe, in addition to the GLUE benchmark . On the structured prediction tasks, our structuredistilled BERT BASE reduces relative error by 2% to 21%. These gains are more pronounced in the low-resource scenario, suggesting that stronger syntactic biases help improve sample efficiency ( §4).
Despite the gains on the structured prediction tasks, we achieve mixed results on GLUE: our approach yields improvements on the corpus of linguistic acceptability (Warstadt et al., 2018, CoLA), and yet performs slightly worse on the rest of GLUE. These findings allude to a partial dissociation between model performance on GLUE, and on other more syntax-sensitive benchmarks of NLU.
Altogether, our findings: (i) showcase the benefits of syntactic biases, even for representation learners that leverage large amounts of data, (ii) help better understand where syntactic biases are most helpful, and (iii) make a case for designing approaches that not only work well at scale, but also integrate stronger notions of syntactic biases.

Recurrent Neural Network Grammars
Here we briefly describe the RNNG (Dyer et al., 2016) that we use as the teacher model. An RNNG is a syntactic LM that defines the joint probability of surface strings x and phrase-structure nonterminals y, henceforth denoted as t φ (x, y), through a series of structure-building actions that traverse the tree in a top-down, left-to-right fashion. Let N and Σ denote the set of phrase-structure nonterminals and word terminals, respectively. At each time step, the decision over the next action a t ∈ {NT(n), GEN(w), REDUCE}, where n ∈ N and w ∈ Σ, is parameterised by a stack LSTM (Dyer et al., 2015) that encodes partial constituents. The choice of a t yields these transitions: • a t ∈ {NT(n), GEN(w)} would push the corresponding embeddings e n or e w onto the stack; • a t = REDUCE would pop the top k elements up to the last incomplete non-terminal, compose these elements with a separate bidirectional LSTM, and lastly push the composite phrase embedding e phrase back onto the stack. The hierarchical inductive bias of RNNGs can be attributed to this composition function, 1 which recursively combines smaller units into larger ones.
RNNGs attempt to maximise the probability of correct action sequences relative to each gold tree. 2 Extension to subwords. Here we extend the RNNG to operate over subword units (Sennrich et al., 2016) to enable compatibility with the BERT student. As each word can be split into an arbitrarylength sequence of subwords, we preprocess the phrase-structure trees to include an additional nonterminal symbol that represents a word sequence, as illustrated by the example "(S (NP (WORD the) (WORD d ##og)) (VP (WORD ba ##rk ##s)))", where tokens prefixed by "##" are subword units. 3

Approach
We begin with a brief review of the BERT objective, before outlining our structure distillation approach.
The dogs by the window [MASK/chase] the cat AGREE Figure 1: An example of the masked LM task, where [MASK] = chase and window is an attractor (red). We suppress phrase-structure annotations and corruptions on the context tokens for clarity.

BERT Pretraining Objective
The aim of BERT pretraining is to find model parametersθ B that would maximise the probability of reconstructing parts of x = x 1 , · · · , x k conditional on a corrupted version c(x) = c(x 1 ), · · · , c(x k ), where c(·) denotes the stochastic corruption protocol of Devlin et al. (2019) that is applied to each word x i ∈ x. Formally: where M (x) ⊆ {1, · · · , k} denotes the indices of masked tokens that serve as reconstruction targets. 4 This masked LM objective is then combined with a next-sentence prediction loss that predicts whether the two segments in x are contiguous sequences.

Motivation
Since the RNNG teacher is an expert on syntactic generalisations (Kuncoro et al., 2018;, we adopt a structure distillation procedure  that enables the BERT student to learn from the RNNG's syntactically informative predictions. Our setup nevertheless means that the two models here crucially differ in nature: the BERT student is not a left-to-right LM like the RNNG, but rather a denoising autoencoder that models a collection of conditionals for words in bidirectional context (Eq. 1). We now present two strategies for dealing with this challenge. The first, naïve approach is to ignore this difference, and let the BERT student distill the RNNG's marginal next-word distribution for each w ∈ Σ based on the left context alone, i.e. t φ (w|x <i ). While this approach is surprisingly effective ( §4.3), we illustrate an issue in Fig. 1  The RNNG's strong syntactic biases mean that we can expect t φ (w|The dogs by the window) to assign high probabilities to plural verbs like bark, chase, fight, and run that are consistent with the agreement controller dogs-despite the presence of a singular attractor (Linzen et al., 2016), window, that can distract the model into predicting singular verbs like chases. Nevertheless, some plural verbs that are favoured based on the left context alone, such as bark and run, are in fact poor alternatives when considering the right context (e.g. "The dogs by the window bark/run the cat" are syntactically illicit). Distilling t φ (w|x <i ) thus fails to take into account the right context x >i that is accessible to the BERT student, and runs the risk of encouraging the student to assign high probabilities for words that fit poorly with the bidirectional context.
Hence, our second approach is to learn from teacher distributions that not only: (i) reflect the strong syntactic biases of the RNNG teacher, but also (ii) consider both the left and right context when predicting w ∈ Σ. Formally, we propose to distill the RNNG's marginal distribution over words in bidirectional context, t φ (w|x <i , x >i ), henceforth referred to as the posterior probability for generating w under all available information. We now demonstrate that this quantity can, in fact, be computed from left-to-right LMs like RNNGs.

Posterior Inference
Given a pretrained autoregressive, left-to-right LM that factorises t φ (x) = |x| i=1 t φ (x i |x <i ), we discuss how to infer an estimate of t φ (x i |x <i , x >i ). By definition of conditional probabilities: 5 Intuition. After cancelling common factors t φ (x <i ), the posterior computation in Eq. 2 is decomposed into two terms: (i) the likelihood of pro-ducing x i given its prefix, and (ii) conditional on the fact that we have generated x i and its prefix x <i , the likelihood of producing the observed continuations x >i . In our running example ( Fig. 1), the posterior would assign low probabilities to plural verbs like bark that are nevertheless probable under the left context alone (i.e. t φ (bark | The dogs by the window) would be high), because they are unlikely to generate the continuations x >i (i.e. we expect t φ (the cat | The dogs by the window bark) to be low since it is syntactically illicit). In contrast, the posterior would assign high probabilities to plural verbs like fight and chase that are consistent with the bidirectional context, since we expect both t φ (fight | The dogs by the window) and t φ (the cat | The dogs by the window fight) to be probable.
Computational cost. Let k denote the maximum length of x. Our KD approach requires computing the posterior distribution (Eq. 2) for every masked token x i in the dataset D, which (excluding marginalisation cost over y) necessitates O(|Σ| * k * |D|) operations, where each operation returns the RNNG's estimate of t φ (x j |x <j ). In the standard BERT setup, 6 this procedure leads to a prohibitive number of operations (∼ 5 * 10 +16 ).

Posterior Approximation
Since exact inference of the posterior is computationally expensive, here we propose an efficient approximation procedure. Approximating While Eq. 3 is still expensive to compute, it enables us to apply the Bayes rule to compute t φ (x >i |x i ): where q(·) denotes the unigram distribution. For efficiency, we replace t φ (x i |x >i ) with a separately trained "reverse" RNNG that operates in a right-toleft fashion, denoted as r ω (x i |x >i ); a complete example of the right-to-left RNNG action sequences 6 In our BERT pretraining setup, |Σ| ≈ 29, 000 (vocabulary size of BERT-cased), |D| ≈ 3 * 10 9 , and k = 512. 7 This approximation preserves the intuition explained in §3.3. Concretely, verbs like bark would also be assigned low probabilities under this approximation, since t φ (the cat | bark) would be low since it is syntactically illicit-the alternative "bark at the cat" would be syntactically licit. is provided in Appendix C. We now apply Eq. 4 and the right-to-left parameterisation r ω (x i |x >i ) into Eq. 3, and cancel common factors t φ (x >i ): Our approximation in Eq. 5 crucially reduces the required number of operations from O(|Σ| * k * |D|) to O(|Σ| * |D|), although the actual speedup is much more substantial in practice, since Eq. 5 involves easily batched operations that considerably benefit from specialised hardwares like GPUs. Notably, our proposed approach here is a general one; it can approximate the posterior over x i from any left-to-right LM, which can be used as a learning signal for BERT through KD, irrespective of the LM's parameterisation. It does, however, necessitate a separately trained right-to-left LM.
Connection to product of experts. Eq. 5 has a similar form to a product of experts (Hinton, 2002, PoE) between the left-to-right and right-toleft RNNGs' next-word distributions, albeit with extra unigram terms q(w). If we replace the unigram distribution with a uniform one, i.e. q(w) = 1/|Σ| ∀w ∈ Σ, Eq. 5 reduces to a standard PoE.
Approximating the marginal. The approximation in Eq. 5 requires estimates of t φ (x i |x <i ) and r ω (x i |x >i ) from the left-to-right and right-to-left RNNGs, respectively, which necessitate expensive marginalisations over all possible tree prefixes y <i and y >i . Following , we approximate this marginalisation using a one-best predicted treeŷ(x) = arg max y∈Y (x) s ψ (y|x), where s ψ (y|x) is parameterised by the transitionbased parser of Fried et al. (2019), and Y (x) denotes the set of all possible trees for x. Formally: whereŷ <i (x) denotes the non-terminal symbols inŷ(x) that occur before x i . 8 The marginal nextword distributions r ω (x i |x >i ) from the right-toleft RNNG is approximated similarly.
Preliminary Experiments. Before proceeding with the KD experiments, we assess the quality and feasibility of our approximation through preliminary language modelling experiments on the Penn Treebank (Marcus et al., 1993, PTB); full details are provided in Appendix A. We find that our approximation is much faster than exact inference by a factor of more than 50,000, at the expense of a slightly worse average posterior negative loglikelihood (2.68 rather than 2.5 for exact inference).

Objective Function
In our structure distillation pretraining, we aim to find BERT parametersθ KD that emulate our approximation of t φ (x i |x <i , x >i ) through a wordlevel cross-entropy loss (Hinton et al., 2015;Kim and Rush, 2016;Furlanello et al., 2018, inter alia): , as defined in Eqs. 5 and 6.
Interpolation. The RNNG teacher is an expert on syntax, although in practice it is only feasible to train it on a much smaller dataset. Hence, we not only want the BERT student to learn from the RNNG's syntactic expertise, but also from the rich common-sense and semantics knowledge contained in large text corpora by virtue of predicting the true identity of the masked token x i , 9 as done in the standard BERT setup. We thus interpolate the KD loss and the original BERT masked LM objective: omitting the next-sentence prediction for brevity. We henceforth set α = 0.5 unless stated otherwise.

Experiments
Here we outline the evaluation setup, present our results, and discuss the implications of our findings. 9 The KD loss KD(x; θ) is defined independently of xi.

Evaluation Tasks and Setup
We conjecture that the improved syntactic competence from our approach would benefit a broad range of tasks that involve structured output spaces, including those that are not explicitly syntactic. We thus evaluate our structure-distilled BERTs on six diverse structured prediction tasks that encompass syntactic, semantic, and coreference resolution tasks, in addition to the GLUE benchmark that is largely comprised of classification tasks.
Phrase-structure parsing -PTB. We first evaluate our model on phrase-structure parsing on the WSJ section of the PTB. Following prior work, we use sections 02-21 for training, section 22 for validation, and section 23 for testing. We apply our approach on top of the BERT-augmented in-order (Liu and Zhang, 2017) transition-based parser of Fried et al. (2019), which approaches the current state of the art. Since the RNNG teacher that we distill into BERT also employs phrase-structure trees, this setup is related to self-training (Yarowsky, 1995;Charniak, 1997;Zhou and Li, 2005;Mc-Closky et al., 2006;Andor et al., 2016, inter alia).
Phrase-structure parsing -OOD. Still in the context of phrase-structure parsing, we evaluate how well our approach generalises to three outof-domain (OOD) treebanks: Brown (Francis and Kučera, 1979), Genia (Tateisi et al., 2005), and the English Web Treebank (Petrov and McDonald, 2012). Following Fried et al. (2019), we test the PTB-trained parser on the test splits 10 of these OOD treebanks without any retraining, to simulate the case where no in-domain labelled data are available. We use the same codebase as above.
Dependency parsing -PTB. Our third task is PTB dependency parsing with Stanford Dependencies (De Marneffe and Manning, 2008) v3.3.0. We use the BERT-augmented joint phrase-structure and dependency parser of Zhou and Zhao (2019), which is inspired by head-driven phrase-structure grammar (Pollard and Sag, 1994, HPSG).
Semantic role labelling. Our fourth evaluation task is span-based semantic role labelling (SRL) on the CoNLL 2012 dataset (Pradhan et al., 2013). We apply our approach on top of the BERT-augmented model of Shi and Lin (2019), as implemented in AllenNLP (Gardner et al., 2017).
Coreference resolution. Our fifth evaluation task is coreference resolution on the OntoNotes benchmark (Pradhan et al., 2012). For this task, we use the BERT-augmented model of Joshi et al. (2019), which extends the higher-order coarse-tofine model of .
CCG Supertagging Probe. All proposed tasks thus far necessitate either fine-tuning the entire BERT model, or training a task-specific model on top of the BERT embeddings. Hence, it remains unclear how much of the gains are due to better structural representations from our new pretraining strategy, rather than the available supervision at the fine-tuning stage. To better understand the gains from our approach, we evaluate on combinatory categorial grammar (Steedman, 2000, CCG) supertagging (Bangalore and Joshi, 1999;Clark and Curran, 2007) through a classifier probe (Shi et al., 2016;Adi et al., 2017;Belinkov et al., 2017, inter alia), where no BERT fine-tuning takes place. 11 CCG supertagging is a compelling probing task since it necessitates an understanding of bidirectional context information; the per-word classification setup also lends itself well to classifier probes. Nevertheless, it remains unclear how much of the accuracy can be attributed to the information encoded in the representation, as opposed to the classifier probe itself. We thus adopt the control task protocol of Hewitt and Liang (2019) that assigns each word type to a random control category, 12 which assesses the memorisation capacity of the classifier. In addition to the probing accuracy, we report the probe selectivity, 13 where higher selectivity denotes probes that faithfully rely on the linguistic knowledge encoded in the representation. We use linear classifiers to maintain high selectivities.
Commonality. All our structured prediction experiments are conducted on top of publicly available repositories of BERT-augmented models, with the exception of CCG supertagging that we evaluate as a probe. This setup means that obtaining our results is as simple as changing the pretrained BERT weights to our structure-distilled BERT, and applying the exact same steps as in the baseline.
GLUE. Beyond the 6 structured prediction tasks above, we evaluate our approach on the classification 14 tasks of the GLUE benchmark except the Winograd NLI (Levesque et al., 2012) for consistency with the original BERT (Devlin et al., 2019). For each GLUE task fine-tuning, we run a grid search over five potential learning rates, two batch sizes, and five random seeds (Appendix D), leading to 50 fine-tuning configurations that we run and evaluate on the validation set of each GLUE task.

Experimental Setup and Baselines
Here we describe the key aspects of our empirical setup, and outline the baselines for assessing the efficacy of our approach.
RNNG Teacher. We implement the subwordaugmented RNNG teachers ( §2) on DyNet (Neubig et al., 2017a), and obtain "silver-grade" phrasestructure annotations for the entire BERT training set using the transition-based parser of Fried et al. (2019). These trees are used to train the RNNG ( §2), and to approximate its marginal next-word distribution at inference (Eq. 6). We use the same WordPiece tokenisation and vocabulary as BERT-Cased; Appendix B summarises the complete list of RNNG hyper-parameters. Since our approximation (Eq. 5) makes use of a right-to-left RNNG, we train this variant (Appendix C) with the same hyperparameters and data as the left-to-right model. We train each directional RNNG teacher on a shared subset of 3.6M sentences (∼3%) from the BERT training set with automatic batching (Neubig et al., 2017b), which takes three weeks on a V100 GPU.
BERT Student. We apply our structure distillation pretraining protocol to BERT BASE -Cased, 15 using the exact same training dataset, model configuration, WordPiece tokenisation, vocabulary, and hyper-parameters (Appendix D) as in the standard pretrained BERT model. 16 The sole exception is that we use a larger initial learning rate of 3e −4 based on preliminary experiments, 17 which we apply to all models (including the no distilla-14 This setup excludes the semantic textual similarity benchmark (STS-B), which is formulated as a regression task. 15 We use BERTBASE rather than BERTLARGE to reduce the turnaround of our experiments, although our approach can easily be extended to BERTLARGE. 16 https://github.com/google-research/ bert. 17 We find this larger learning to perform better on most of our evaluation tasks. Liu et al. (2019b) has similarly found that tuning BERT's initial learning rate leads to better results. tion/standard BERT baseline) for fair comparison.
Baselines and comparisons. We compare the following set of models in our experiments: • A standard BERT BASE -Cased without any structure distillation loss, which benefits from scalability but lacks syntactic biases ("No-KD"); • Four variants of structure-distilled BERTs that: (i) only distill the left-to-right RNNG ("L2R-KD"), (ii) only distill the right-to-left RNNG ("R2L-KD"), (iii) distill the RNNG's approximated marginal for generating x i under the bidirectional context, where q(w) (Eq. 5) is the uniform distribution ("UF-KD"), and lastly (iv) a similar variant as (iii), but where q(w) is the unigram distribution ("UG-KD"). All these BERT models crucially benefit from the syntactic biases of RNNGs, although only variants (iii) and (iv) learn from teacher distributions that consider bidirectional context for predicting x i ; and • A BERT BASE model that distills the approximated marginal for generating x i under the bidirectional context, but from sequential LSTM teachers ("Seq-KD") in place of RNNGs. 18 This baseline crucially isolates the importance of learning from hierarchical teachers, since it employs the exact same approximation technique and KD loss as the structure-distilled BERTs.
Learning curves. Given enough labelled data, BERT can acquire the relevant structural information from the fine-tuning (as opposed to pretraining) procedure, although better pretrained representations can nevertheless facilitate sample-efficient generalisations (Yogatama et al., 2019). We thus additionally examine the models' fine-tuning learning curves, as a function of varying amounts of training data, on phrase-structure parsing and SRL.
Random seeds. Since fine-tuning the same pretrained BERT with different random seeds can lead to varying results, we report the mean performance from three random seeds on the structured prediction tasks, and from five random seeds on GLUE.
Test results. To preserve the integrity of the test sets, we first report all performance on the validation set, and only report test set results for: (i) the 18 For fair comparison, we train the LSTM on the exact same subset as the RNNG, with comparable number of model parameters. An alternative here is to use Transformers, although we elect to use LSTMs to facilitate fair comparison with RNNGs, which are also based on LSTM architectures.
No-KD baseline, and (ii) the best structure-distilled model on the validation set ("Best-KD").

Findings and Discussion
We report the validation and test results of the structured prediction tasks in Table 1. The validation set learning curves for phrase-structure parsing and SRL that compare the No-KD baseline and the UG-KD variant are provided in Fig. 2.
General discussion. We summarise several key observations from Table 1 and Fig. 2. • All four structure-distilled BERT models consistently outperform the No-KD baseline, including the L2R-KD and R2L-KD variants that only distill the syntactic knowledge of unidirectional RNNGs. Remarkably, this pattern holds true for all six structured prediction tasks. In contrast, we observe no such gains for the Seq-KD baseline, which largely performs worse than the No-KD model. We conclude that the gains afforded by our structure-distilled BERTs can be attributed to the hierarchical bias of the RNNG teacher.
• We conjecture that the surprisingly strong performance of the L2R-KD and R2L-KD models, which distill the knowledge of unidirectional RNNGs, can be attributed to the interpolated objective in Eq. 7 (α = 0.5). This interpolation means that the target distribution assigns a probability mass of at least 0.5 to the true masked word x i , which is guaranteed to be consistent with the bidirectional context. However, the syntactic knowledge contained in the unidirectional RNNGs' predictions can still provide a structurally informative learning signal, via the rest of the probability mass, for the BERT student.
• While all structure-distilled variants outperform the baseline, models that distill our approximation of the RNNG's distribution for words in bidirectional context (UF-KD and UG-KD) yield the best results on four out of six tasks (PTB phrase-structure parsing, SRL, coreference resolution, and the CCG supertagging probe). This finding confirms the efficacy of our approach.
• We observe the largest gains for the syntactic tasks, particularly for phrase-structure parsing and CCG supertagging. However, the improvements are not at all confined to purely syntactic tasks: we reduce relative error from strong BERT baselines by 2.2% and 3.6% on SRL and coreference resolution, respectively. While  Figure 2: The fine-tuning learning curves that examine how the number of fine-tuning instances (from 5% to 100% of the full training sets) affect validation set F1 in the case of phrase-structure parsing and SRL. We compare the "No-KD"/standard BERT BASE -Cased and the "UG-KD" structure-distilled BERT.
the RNNG's syntactic biases are derived from phrase-structure grammar, the strong improvement on CCG supertagging, in addition to the smaller improvement on dependency parsing, suggests that the RNNG's syntactic biases generalise well across different syntactic formalisms.
• We observe larger improvements in a lowresource scenario, where the model is exposed to fewer fine-tuning instances (Fig. 2), suggesting that syntactic biases are helpful for enabling more sample-efficient generalisations. This pattern holds for both tasks that we investigated: phrase-structure parsing (syntactic) and SRL (not explicitly syntactic). With only 5% of the fine-tuning data, the UG-KD model improves F1 from 79.9 to 80.6 for SRL (a 3.5% error reduction relative to the No-KD baseline, as opposed to. 2.2% on the full data). For phrase-structure parsing, the UG-KD model achieves a 93.68 F1 (a 16% relative error reduction, as opposed to 7.6% on the full data) with only 5% of the PTBthis performance is notably better than past state of the art parsers trained on the full PTB c. 2017 (Kuncoro et al., 2017).
GLUE results and discussion. We report the GLUE validation and test results in Table 2. Since we observe a different pattern of results on the Corpus of Linguistic Acceptability (Warstadt et al., 2018, CoLA) than on the rest of GLUE, we henceforth report: (i) the CoLA results, (ii) the 7-task average that excludes CoLA, and (iii) the average across all 8 tasks. We select the UG-KD model since it achieved the best 8-task average on the GLUE validation sets; the full GLUE breakdown for these two models is provided in Appendix E. The results on GLUE provide an interesting con- The validation results are derived from the average of five random seeds for each task, which accounts for variance, and the 1-best random seed, which does not. The test results are derived from the 1-best random seed on the validation set.
trast to the consistent improvement we observed on the structured prediction tasks. More concretely, our UG-KD model outperforms the baseline on CoLA, but performs slightly worse on the other GLUE tasks in aggregate, leading to a slightly lower overall test set accuracy (80.0 for the UG-KD as opposed to 80.3 for the No-KD baseline). The improvement on the syntax-sensitive CoLA provides additional evidence-beyond the improvement on the syntactic tasks (Table 1)-that our approach indeed yields improved syntactic competence. We conjecture that these improvements do not transfer to the other GLUE tasks because they rely more on lexical and semantic properties, and less on syntactic competence (McCoy et al., 2019).
We defer a more thorough investigation of how much syntactic competence is necessary for solving most of the GLUE tasks to future work, but make two remarks. First, the findings on GLUE are consistent with the hypothesis that our approach yields improved structural competence, albeit at the expense of a slightly less rich meaning representation, which we attribute to the smaller dataset used to train the RNNG teacher. Second, humanlevel natural language understanding includes the ability to predict structured outputs, e.g. to decipher "who did what to whom" (SRL). Succeeding in these tasks necessitates inference about structured output spaces, which (unlike most of GLUE) cannot be reduced to a single classification decision. Our findings indicate a partial dissociation between model performance on these two types of tasks; hence, supplementing GLUE evaluation with some of these structured prediction tasks can offer a more holistic assessment of progress in NLU.
CCG probe example. The CCG supertagging probe is a particularly interesting test bed, since it clearly assesses the model's ability to use contextual information in making its predictions, without introducing additional confounds from the BERT fine-tuning procedure. We thus provide a representative example of four different BERT variants' predictions on the CCG supertagging probe in Table 3, based on which we discuss two observations. First, the different models make different predictions, where the No-KD and L2R-KD models produce (coincidentally the same) incorrect predictions, while the R2L-KD and UG-KD models are able to predict the correct supertag. This finding suggests that different teacher distributions are able to impose different biases on the BERT students. 19 Second, the mistakes of the No-KD and L2R-KD BERTs belong to the broader category of challenging argument-adjunct distinctions (Palmer et al., 2005). Here both models fail to subcategorise for the prepositional phrase (PP) "as screens", which serves as an argument of the verb "use", as opposed to the noun phrase "TV sets". Distinguishing between these two potential dependencies naturally requires syntactic information from the right context; hence the R2L-KD BERT, which is trained to emulate the predictions of an RNNG teacher that observes the right context, is able to make the correct prediction. This advantage is crucially retained by the UG-KD model that distills the RNNG's approximate distribution over words in bidirectional context (Eq. 5), and further confirms the efficacy of our proposed approach.

Limitations
We outline two limitations to our approach. First, we assume the existence of decent-quality "silvergrade" phrase-structure trees to train the RNNG teacher. While this assumption holds true for English due to the existence of accurate phrasestructure parsers, this is not necessarily the case for other languages. Second, pretraining the BERT student in our naïve implementation is about half as fast on TPUs compared to the baseline due to I/O bottleneck. This overhead only applies at pretraining, and can be reduced through parallelisation.   Table 3: An example of the CCG supertag predictions for the verb "use" from four different BERT variants. The correct answer is "((S[b]\NP)/PP)/NP", which both the R2L-KD and UG-KD predict correctly (blue). However, the No-KD baseline and the L2R-KD model produce (the same) incorrect predictions (red); both models fail to subcategorise the prepositional phrase "as screens" as a dependent of the verb "use". Beyond this, all four models predict the correct supertags for all other words (not shown).

Related Work
Earlier work has proposed a few ways for introducing notions of hierarchical structures into BERT, for instance through designing structurally motivated auxiliary losses , or including syntactic information in the embedding layers that serve as inputs for the Transformer (Sundararaman et al., 2019). In contrast, we employ a different technique for injecting syntactic biases, which is based on the structure distillation technique of , although our work features two key differences. First,  put a sole emphasis on cases where both the teacher and student models are autoregressive, left-to-right LMs; here we extend this objective for when the student model is a representation learner that has access to bidirectional context. Second,  only evaluated their structure-distilled LMs in terms of perplexity and grammatical judgment (Marvin and Linzen, 2018). In contrast, we evaluate our structure-distilled BERT models on 6 diverse structured prediction tasks and the GLUE benchmark. It remains an open question whether, and how much, syntactic biases are helpful for a broader range of NLU tasks beyond grammatical judgment; our work represents a step towards answering this question.
More recently, substantial progress has been made in improving the performance of BERT and the broader class of masked LMs (Lan et al., 2019;Liu et al., 2019b;Raffel et al., 2019;Sun et al., 2020, inter alia). Our structure distillation technique is orthogonal, and can be applied on top of these approaches. Lastly, our findings on the benefits of syntactic knowledge for structured prediction tasks that are not explicitly syntactic in nature, such as SRL and coreference resolution, are consistent with those of prior work (Swayamdipta et al., 2018;Strubell et al., 2018, inter alia).

Conclusion
Given the remarkable success of textual representation learners trained on large amounts of data, it remains an open question whether syntactic biases are still relevant these models that work well at scale. Here we present evidence to the affirmative: our structure-distilled BERT models outperform the baseline on a diverse set of 6 structured prediction tasks. We achieve this through a new pretraining strategy that enables the BERT student to learn from the predictions of an explicitly hierarchical, but much less scalable, RNNG teacher model. Since the BERT student is a bidirectional model that estimates the conditional probabilities of masked words in context, we propose to distill an efficient yet surprisingly effective approximation of the RNNG's estimate for generating each word conditional on its bidirectional context.
Our findings suggest that syntactic inductive biases are beneficial for a diverse range of structured prediction tasks, including for tasks that are not explicitly syntactic in nature. In addition, these biases are particularly helpful for improving fine-tuning sample efficiency on downstream tasks. Lastly, our findings motivate the broader question of how we can design models that integrate stronger notions of structural biases-and yet can be easily scalable at the same time-as a promising (if relatively underexplored) direction of future research. t LSTM (x i |x <i ) can be computed exactly. This setup crucially enables us to isolate the impact of approximating the posterior distribution over x i under the bidirectional context (Eq. 2) with our proposed approximation (Eq. 5), without introducing further confounds stemming from the RNNG's marginal approximation procedure (Eq. 6).
Dataset and preprocessing. We train the LSTM LM on an open-vocabulary version of the PTB, 20 in order to simulate the main experimental setup where both the RNNG teacher and BERT student are also open-vocabulary by virtue of byte-pair encoding (BPE) preprocessing. To this end, we preprocess the dataset with SentencePiece (Kudo and Richardson, 2018) BPE tokenisation, where |Σ| = 8, 000; we preserve all case information. We follow the empirical setup of the parsing ( §4.1) experiments, with Sections 02-21 for training, Section 22 for validation, and Section 23 for testing.
Model hyper-parameters. We train the LM with 2 LSTM layers, 250 hidden units per layer, and a dropout (Srivastava et al., 2014) rate of 0.2. Model parameters are optimised with stochastic gradient descent (SGD), with an initial learning rate of 0.25 that is decayed exponentially by a factor of 0.92 for every epoch after the tenth. Since our approximation relies on a separately trained right-to-left LM (Eq. 5), we train this variant with the exact same hyper-parameters and dataset split as the left-to-right model.
Evaluation and baselines. We evaluate the models in terms of the average posterior negative log likelihood (NLL) and perplexity. 21 Since exact inference of the posterior is expensive, we evaluate the model only on the first 400 sentences of the test set. We compare the following variants: • a mixture of experts baseline that simply mixes (α = 0.5) the probabilities from the left-to-right and right-to-left LMs in an additive fashion, as opposed to multiplicative as in the case of our PoE-like approximation in Eq. 5 ("MoE"); • our approximation of the posterior over x i (Eq. 5), where q(w) is the uniform distribution ("Uniform Approx."); 20 Our open-vocabulary setup means that our results are not directly comparable to prior work on PTB language modelling (Mikolov et al., 2010, inter alia), which mostly employ a special "UNK" token for infrequent or unknown words. 21 In practice, this perplexity is derived from simply exponentiating the average posterior negative log likelihood.

Model
Posterior NLL Posterior Ppl.  The findings from the preliminary experiments that assess the quality of our posterior approximation procedure. We compare three variants against exact inference (bottom row; Eq. 2) from the left-to-right model.
• our approximation of the posterior over x i (Eq. 5), but where q(w) is the unigram distribution ("Unigram Approx."); and • exact inference of the posterior as computed from the left-to-right model, as defined in Eq. 2 ("Exact Inference").
Discussion. We summarise the findings in Table 4, based on which we remark on two observations. First, the posterior NLL of our approximation procedure that makes use of the unigram distribution (Unigram Approx.; third row) is not much worse than that of exact inference, in exchange for a more than 50,000 times speedup 22 in computation time. Nevertheless, using the uniform distribution (second row) on q(w) in place of the unigram one (Eq. 5) results in a much worse posterior NLL. Second, combining the left-to-right and right-to-left LMs using a mixture of experts-a baseline which is not well-motivated by our theoretical analysis-yields the worst result.

B RNNG Hyper-parameters
To train the subword-augmented RNNG teacher ( §2), we use the following hyper-parameters that achieve the best validation perplexity from a grid search: 2-layer stack LSTMs (Dyer et al., 2015) with 512 hidden units per layer, optimised by standard SGD with an initial learning rate of 0.5 that is decayed exponentially by a factor of 0.9 after the tenth epoch. We apply a dropout rate of 0.3.

C Right-to-left RNNG
Here we illustrate the oracle action sequences that we use to train the right-to-left RNNG teacher as part of our approximation of the posterior distribution over x i (Eq. 5). Recall that the standard 22 All three approximations in Table 4 Table 5: Summary of the full results on GLUE, comparing the No-KD baseline with the UG-KD structuredistilled BERT ( §4.2). We select the 1-best fine-tuning hyper-parameter (including random seed) on the validation set, which we then evaluate on the test set.
RNNG incrementally builds the phrase-structure tree through a top-down, left-to-right traversal in a depth-first fashion. Hence, the right-to-left RNNG employs a similar top-down, depth-first traversal strategy, although the children of each node are recursively expanded in a right-to-left fashion. We provide example action sequences (Table 6) for the subword-augmented left-to-right and rightto-left RNNGs, respectively, for an example "(S (NP (WORD The) (WORD d ##og)) (VP (WORD ba ##rk ##s)))", where tokens prefixed by "##" are subword units.

D BERT Hyper-parameters
Here we outline the hyper-parameters of the BERT student in terms of pretraining data creation, masked LM pretraining, and GLUE fine-tuning.
Pretraining data creation. We use the same codebase 23 and pretraining data as Devlin et al. (2019), which are derived from a mixture of Wikipedia and Books text corpora. To train our structure-distilled BERTs, we sample a masking from these corpora following the same hyperparameters used to train the original BERT BASE -Cased model: a maximum sequence length of 512, a per-word masking probability of 0.15 (up to a maximum of 76 masked tokens in a 512-length sequence), a dupe factor of 10. We apply a random seed of 12345. We preprocess the raw dataset using NLTK tokenisers, and then apply the same (BPE-augmented) vocabulary and WordPiece tokenisation as in the original BERT model. All other hyper-parameters are set to the same values as in the publicly released original BERT model.
Masked LM pretraining. We train all model variants (including the no distillation/standard 23 https://github.com/google-research/ bert. BERT baseline for fair comparison) with the following hyper-parameters: a batch size of 256 sequences and an initial Adam learning rate of 3e −4 (as opposed to 1e − 4 in the original BERT model). Following Devlin et al. (2019), we pretrain our models for 1M steps. All other hyper-parameters are similarly set to their default values.
GLUE fine-tuning. For each GLUE task, we fine-tune the BERT model by running a grid search over five potential learning rates {5e −6 , 1e −5 , 2e −5 , 3e −5 , 5e −5 }, two potential batch sizes {16, 32}, and five random seeds, in order to better account for variance. This setup leads to 50 fine-tuning configurations for each GLUE task. Following Devlin et al. (2019), we train each fine-tuning configuration for 4 epochs.
Structured prediction fine-tuning. For each structured prediction model, we use the model's default BERT fine-tuning settings for learning rate, batch size, and learning rate warmup schedule. We use the default settings for BERT BASE , if these are available, and the default settings for BERT LARGE otherwise. These settings are: • In-order phrase-structure parser: a BERT learning rate of 2e −5 , a batch size of 32, and a warmup period of 160 updates.
• HPSG dependency parser: a BERT learning rate of 5e −5 , a batch size of 150, and a warmup period of 160 updates.
• Coreference resolution model: a BERT learning rate of 1e −5 , a batch size of 1 document, and a warmup period of 2 epochs.
• Semantic role labelling model: a BERT learning rate of 5e −5 and a batch size of 32.
Comparison. Our no distillation baseline differs from the publicly released BERT BASE -Cased model in its larger pretraining learning rate (3e −4 as opposed to 1e −4 ) that we empirically find to work better on most of the tasks. Overall, our no distillation baseline slightly outperforms the publicly released model on all the structured prediction tasks except coreference resolution, where it performs slightly worse. Furthermore, our no distillation baseline also performs slightly better than the official pretrained BERT BASE on most of the GLUE tasks, although the difference in aggregate GLUE performance is fairly minimal (< 0.5).

E Full GLUE Results
We summarise the full GLUE results for the No-KD baseline and the UG-KD structure-distilled BERT in Table 5.