Abstract
Textual representation learners trained on large amounts of data have achieved notable success on downstream tasks; intriguingly, they have also performed well on challenging tests of syntactic competence. Hence, it remains an open question whether scalable learners like BERT can become fully proficient in the syntax of natural language by virtue of data scale alone, or whether they still benefit from more explicit syntactic biases. To answer this question, we introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining, by distilling the syntactically informative predictions of a hierarchical—albeit harder to scale—syntactic language model. Since BERT models masked words in bidirectional context, we propose to distill the approximate marginal distribution over words in context from the syntactic LM. Our approach reduces relative error by 2–21% on a diverse set of structured prediction tasks, although we obtain mixed results on the GLUE benchmark. Our findings demonstrate the benefits of syntactic biases, even for representation learners that exploit large amounts of data, and contribute to a better understanding of where syntactic biases are helpful in benchmarks of natural language understanding.
1 Introduction
Large-scale textual representation learners trained with variants of the language modeling (LM) objective have achieved remarkable success on downstream tasks (Peters et al., 2018; Devlin et al., 2019; Yang et al., 2019). Furthermore, these models have also been shown to perform remarkably well at syntactic grammaticality judgment tasks (Goldberg, 2019), and encode substantial amounts of syntax in their learned representations (Liu et al., 2019a; Tenney et al., 2019a,s; Hewitt and Manning, 2019; Jawahar et al., 2019). Intriguingly, success on these syntactic tasks has been achieved by Transformer architectures (Vaswani et al., 2017) that lack explicit notions of hierarchical syntactic structures.
Based on such evidence, it would be tempting to conclude that data scale alone is all we need to learn the syntax of natural language. Nevertheless, recent findings that systematically compare the syntactic competence of models trained at varying data scales suggest that model inductive biases are in fact more important than data scale for acquiring syntactic competence (Hu et al., 2020). Two natural questions, therefore, are the following: Can representation learners that work well at scale still benefit from explicit syntactic biases? And where exactly would such syntactic biases be helpful in different language understanding tasks? Here we work towards answering these questions by devising a new pretraining strategy that injects syntactic biases into a BERT (Devlin et al., 2019) learner that works well at scale. We hypothesize that this approach can improve the competence of BERT on various tasks, which provides evidence for the benefits of syntactic biases in large-scale models.
Our approach is based on the prior work of Kuncoro et al. (2019), who devised an effective knowledge distillation (KD; Bucilǎ et al., 2006; Hinton et al., 2015) procedure for improving the syntactic competence of scalable LMs that lack explicit syntactic biases. More concretely, their KD procedure utilized the predictions of an explicitly hierarchical (albeit hard to scale) syntactic LM, recurrent neural network grammars (RNNGs; Dyer et al., 2016) (§2) as a syntactically informed learning signal for a sequential LM that works well at scale.
Our setup nevertheless presents a new challenge: Here the BERT student is a denoising autoencoder that models a collection of conditionals for words in bidirectional context, while the RNNG teacher is an autoregressive LM that predicts words in a left-to-right fashion, that is tϕ(xi|x<i). This mismatch crucially means that the RNNG’s estimate of tϕ(xi|x<i) may fail to take into account the right context x>i that is accessible to the BERT student (§3). Hence, we propose an approach where the BERT student distills the RNNG’s marginal distribution over words in context, tϕ(xi|x<i,x>i). We develop an efficient yet effective approximation for this quantity, since exact inference is expensive owing to the RNNG’s left-to-right parameterization.
Our structure-distilled BERT model differs from the standard BERT model only in its pretraining objective, and thus retains the scalability afforded by Transformer architectures and specialized hardwares like TPUs. In fact, our approach maintains compatibility with standard BERT pipelines; the structure-distilled BERT models can simply be loaded as pretrained BERT weights, which can then be fine-tuned in the exact same fashion.
We hypothesize that the stronger syntactic biases from our new pretraining procedure are useful for a variety of natural language understanding (NLU) tasks that involve structured output spaces—including tasks like semantic role labeling (SRL) and coreference resolution that are not explicitly syntactic in nature. We thus evaluate our models on six diverse structured prediction tasks, including phrase-structure parsing (in-domain and out-of-domain), dependency parsing, SRL, coreference resolution, and a combinatory categorial grammar (CCG) supertagging probe, in addition to the GLUE benchmark (Wang et al., 2019). On the structured prediction tasks, our structure-distilled BERTBASE reduces relative error by 2% to 21%. These gains are more pronounced in the low-resource scenario, suggesting that stronger syntactic biases help improve sample efficiency (§4).
Despite the gains on the structured prediction tasks, we achieve mixed results on GLUE: Our approach yields improvements on the corpus of linguistic acceptability (Warstadt et al., 2018, CoLA), but performs slightly worse on the rest of GLUE. These findings allude to a partial dissociation between model performance on GLUE, and on structured prediction benchmarks of NLU.
Altogether, our findings: (i) showcase the benefits of syntactic biases, even for representation learners that leverage large amounts of data, (ii) help better understand where syntactic biases are most helpful, and (iii) make a case for designing approaches that not only work well at scale, but also integrate stronger notions of syntactic biases.
2 Recurrent Neural Network Grammars
Here we briefly describe the RNNG (Dyer et al., 2016) that we use as the teacher model. An RNNG is a syntactic LM that defines the joint probability of surface strings x and phrase-structure nonterminals y, henceforth denoted as tϕ(x,y), through a series of structure-building actions that traverse the tree in a top–down, left-to-right fashion. Let N and Σ denote the set of phrase-structure non-terminals and word terminals, respectively. At each time step, the decision over the next action at ∈{NT(n),GEN(w),REDUCE}, where n ∈ N and w ∈ Σ, is parameterized by a stack LSTM (Dyer et al., 2015) that encodes partial constituents. The choice of at yields these transitions:
- •
at ∈{NT(n),GEN(w)} would push the corresponding non-terminal or word embeddings —en or ew—onto the stack;
- •
at = REDUCE would pop the top k elements up to the last incomplete non-terminal, compose these elements with a separate bidirectional LSTM, and lastly push the composite phrase embedding ephrase back onto the stack. The hierarchical inductive bias of RNNGs can be attributed to this composition function,1 which recursively combines smaller units into larger ones.
RNNGs attempt to maximize the probability of correct action sequences relative to each gold tree.2
Extension to Subwords.
Here we extend the RNNG to operate over subword units (Sennrich et al., 2016) to enable compatibility with the BERT student. As each word can be split into an arbitrary-length sequence of subwords, we preprocess the phrase-structure trees to include an additional nonterminal symbol that represents a word sequence, as illustrated by the example “(S (NP (WORD the) (WORD d##og)) (VP (WORD ba##rk##s)))”, where tokens prefixed by “##” are subword units.3
3 Approach
We begin with a brief review of the BERT objective, before outlining our structure distillation approach.
3.1 BERT Pretraining Objective
3.2 Motivation
Because the RNNG teacher is an expert on syntactic generalizations (Kuncoro et al., 2018; Futrell et al., 2019; Wilcox et al., 2019), we adopt a structure distillation procedure (Kuncoro et al., 2019) that enables the BERT student to learn from the RNNG’s syntactically informative predictions. Our setup nevertheless means that the two models here crucially differ in nature: The BERT student is not a left-to-right LM like the RNNG, but rather a denoising autoencoder that models a collection of conditionals for words in bidirectional context (Eq. 1).
We now present two strategies for dealing with this challenge. The first, naïve approach is to ignore this difference, and let the BERT student distill the RNNG’s marginal next-word distribution for each w ∈ Σ based on the left context alone, that is tϕ(w|x<i). Although this approach is surprisingly effective (§4.3), we illustrate an issue in Figure 1 for “The dogs by the window [MASK=chase] the cat”.
The RNNG’s strong syntactic biases mean that we can expect tϕ(w|Thedogsbythewindow) to assign high probabilities to plural verbs like bark, chase, fight, and run that are consistent with the agreement controller dogs—despite the presence of a singular attractor (Linzen et al., 2016), window, that can distract the model into predicting singular verbs like chases. Nevertheless, some plural verbs that are favored based on the left context alone, such as bark and run, are in fact poor alternatives when considering the right context (e.g., “The dogs by the window bark/run the cat” are syntactically illicit). Distilling tϕ(w|x<i) thus fails to take into account the right context x>i that is accessible to the BERT student, and runs the risk of encouraging the student to assign high probabilities for words that fit poorly with the bidirectional context.
Hence, our second approach is to learn from teacher distributions that not only: (i) reflect the strong syntactic biases of the RNNG teacher, but also (ii) consider both the left and right context when predicting w ∈ Σ. Formally, we propose to distill the RNNG’s marginal distribution over words in bidirectional context, tϕ(w|x<i,x>i), henceforth referred to as the posterior probability for generating w under all available information. We now demonstrate that this quantity can, in fact, be computed from left-to-right LMs like RNNGs.
3.3 Posterior Inference
Intuition.
After cancelling common factors tϕ(x<i), the posterior computation in Eq. 2 is decomposed into two terms: (i) the likelihood of producing xi given its prefix—tϕ(xi|x<i), and (ii) conditional on the fact that we have generated xi and its prefix x<i, the likelihood of producing the observed continuations x>i—tϕ(x>i|xi,x<i). In our running example (Figure 1), the posterior would assign low probabilities to plural verbs like bark that are nevertheless probable under the left context alone (i.e., tϕ(bark |The dogs by the window ) would be probable), because they are unlikely to generate the continuations x>i (i.e., we expect tϕ(the cat |The dogs by the window bark ) to be low because it is syntactically illicit). In contrast, the posterior would assign high probabilities to plural verbs like fight and chase that are consistent with the bidirectional context, because we expect both tϕ(fight |The dogs by the window ) and tϕ(thecat — Thedogsbythewindowfight) to be probable.
Computational Cost.
Let k denote the maximum length of x. Our KD approach requires computing the posterior distribution (Eq. 2) for every masked token xi in the dataset D, which (excluding marginalization cost over y) necessitates O(|Σ|* k *|D|) operations, where each operation returns the RNNG’s estimate of tϕ(xj|x<j). In the standard BERT setup,6 this procedure leads to a prohibitive number of operations (∼ 5 * 10+16).
3.4 Posterior Approximation
Notably, our proposed approach here is a general one; it can approximate the posterior over xi from any left-to-right LM, which can be used as a learning signal for BERT through KD, irrespective of the LM’s parameterization. It does, however, necessitate a separately trained right-to-left LM.
Connection to a Product of Experts.
Eq. 5 has a similar form to a product of experts (PoE; Hinton, 2002) between the left-to-right and right-to-left RNNGs’ next-word distributions, albeit with extra unigram terms q(w). If we replace the unigram distribution with a uniform one, namely, q(w) = 1/|Σ|∀w ∈ Σ, Eq. 5 reduces to a standard PoE.
Approximating the Marginal.
Preliminary Experiments.
Before proceeding with the KD experiments, we assess the quality and feasibility of our approximation through preliminary LM experiments on the Penn Treebank (PTB; Marcus et al., 1993). We find that our approximation is much faster than exact inference by a factor of more than 50,000, at the expense of a slightly worse average posterior negative log-likelihood (2.68 rather than 2.5 for exact inference). More details are provided in Appendix A.
Differences Between the Models.
We now empirically validate our motivating intuition in Figure 1: A model that takes into account the bidirectional context (as is the case for our proposed posterior approximation in Eq. 5) should make different predictions compared with the unidirectional left-to-right and right-to-left models.9 To ascertain whether this is truly the case, we compute the mean Kullback-Leibler (KL) divergence between the distributions from the proposed posterior approximation (Eq. 5) and the distributions from: (i) the left-to-right model, (ii) the right-to-left model, and (iii) a simple product of experts baseline (i.e., Eq. 5, but where q(w) is the uniform distribution). The findings in Table 1 suggest that our proposed posterior approximation approach indeed yields quantifiably different distributions from the left-to-right and right-to-left baselines. To a lesser extent, it also differs from a simple product of experts baseline that similarly incorporates both the left-to-right and right-to-left models’ predictions, albeit with the uniform distribution for q(w).
3.5 Objective Function
Interpolation.
4 Experiments
Here we outline the evaluation setup, present our results, and discuss the implications of our findings.
4.1 Evaluation Tasks and Setup
We conjecture that the improved syntactic competence from our approach would benefit a broad range of tasks that involve structured output spaces, including tasks that are not explicitly syntactic. We thus evaluate our structure-distilled BERTs on six diverse structured prediction tasks that encompass syntactic, semantic, and coreference resolution tasks, in addition to the GLUE benchmark that is largely composed of classification tasks.
Phrase-structure Parsing - PTB.
We first evaluate our model on phrase-structure parsing on the WSJ section of the PTB. Following prior work, we use sections 02–21 for training, section 22 for validation, and section 23 for testing. We apply our approach on top of the BERT-augmented in-order (Liu and Zhang, 2017) transition-based parser of Fried et al. (2019), which approaches the current state of the art. Because the RNNG teacher that we distill into BERT also uses phrase-structure trees, this setup is related to self-training (Yarowsky, 1995; Charniak, 1997; Zhou and Li, 2005; McClosky et al., 2006; Andor et al., 2016, inter alia).
Phrase-structure Parsing - OOD.
Still in the context of phrase-structure parsing, we evaluate how well our approach generalizes to three out-of-domain (OOD) treebanks: Brown (Francis and Kučera, 1979), Genia (Tateisi et al., 2005), and the English Web Treebank (Petrov and McDonald, 2012). Following Fried et al. (2019), we test the PTB-trained parser on the test splits11 of these OOD treebanks without any retraining, to simulate the case where no in-domain labeled data are available. We use the same codebase as above.
Dependency Parsing - PTB.
Semantic Role Labeling.
Coreference Resolution.
CCG Supertagging Probe.
All proposed tasks thus far necessitate either fine-tuning the entire BERT model, or training a task-specific model on top of the BERT embeddings. Hence, it remains unclear how much of the gains are due to better structural representations from our new pretraining strategy, rather than the available supervision at the fine-tuning stage. To better understand the gains from our approach, we evaluate on CCG (Steedman, 2000) supertagging (Bangalore and Joshi, 1999; Clark and Curran, 2007) through a classifier probe (Shi et al., 2016; Adi et al., 2017; Belinkov et al., 2017, inter alia), where no BERT fine-tuning takes place.12
CCG supertagging is a compelling probing task because it necessitates an understanding of bidirectional context information; the per-word classification setup also lends itself well to classifier probes. Nevertheless, it remains unclear how much of the accuracy can be attributed to the information encoded in the representation, as opposed to the classifier probe itself. We thus adopt the control task protocol of Hewitt and Liang (2019) that assigns each word type to a random control category,13 which assesses the memorisation capacity of the classifier. In addition to the probing accuracy, we report the probe selectivity,14 where higher selectivity denotes probes that faithfully rely on the linguistic knowledge encoded in the representation. We use linear classifiers to maintain high selectivities.
Commonality.
All our structured prediction experiments are conducted on top of publicly available repositories of BERT-augmented models, with the exception of the CCG supertagging task that we evaluate as a probe. This setup means that obtaining our results is as simple as changing the pretrained BERT weights to our structure-distilled BERT, and applying the exact same steps as for fine-tuning the baseline model. The fine-tuning hyperparameters are summarized in Appendix C.
GLUE.
Beyond the six structured prediction tasks above, we evaluate our approach on the classification15 tasks of the GLUE benchmark except the Winograd NLI (Levesque et al., 2012) for consistency with the BERT paper (Devlin et al., 2019). The BERT GLUE fine-tuning hyperparameters are based on the fine-tuning configurations of Joshi et al. (2020); we summarize these in Appendix C.
4.2 Experimental Setup and Baselines
Here we describe the key aspects of our empirical setup, and outline the baselines for assessing the efficacy of our approach.
RNNG Teacher.
We implement the subword-augmented RNNG teachers (§2) on DyNet (Neubig et al., 2017a), and obtain “silver-grade” phrase-structure annotations for the entire BERT training set using the transition-based parser of Fried et al. (2019). These trees are used to train the RNNG (§2), and to approximate its marginal next-word distribution at inference (Eq. 6). We use the same WordPiece tokenization and vocabulary as BERT-Cased; Appendix B summarizes the complete list of RNNG hyperparameters. Because our approximation (Eq. 5) makes use of a right-to-left RNNG, we train this variant with the same hyperparameters and data as the left-to-right model. We train each directional RNNG teacher on a shared subset of 3.6M sentences (∼3%) from the BERT training set with automatic dynamic batching (Neubig et al., 2017b), which takes three weeks on a V100 GPU.
BERT Student.
We first apply our structure distillation pretraining protocol to BERTBASE-Cased. We use the exact same training dataset, model configuration, WordPiece tokenization, vocabulary, and hyperparameters (Appendix C) as in the standard pretrained BERT model.16 The sole exception is that we use a larger initial learning rate of 3e−4 based on preliminary experiments,17 which we apply to all models (including the no distillation/standard BERT baseline) for fair comparison.
Baselines and Comparisons.
We compare the following set of models in our experiments:
- •
A standard BERTBASE-Cased without any structure distillation loss, which benefits from scalability but lacks syntactic biases (“No-KD”);
- •
Four variants of structure-distilled BERTs that: (i) only distill the left-to-right RNNG (“L2R-KD”), (ii) only distill the right-to-left RNNG (“R2L-KD”), (iii) distill the RNNG’s approximated marginal for generating xi under the bidirectional context, where q(w) (Eq. 5) is the uniform distribution (“UF-KD”), and lastly (iv) a similar variant as (iii), but where q(w) is the unigram distribution (“UG-KD”). All these BERT models crucially benefit from the syntactic biases of RNNGs, although only variants (iii) and (iv) learn from teacher distributions that consider bidirectional context for predicting xi; and
- •
A BERTBASE model that distills the approximate posterior for generating xi under the bidirectional context, but from sequential LSTM teachers (“Seq-KD”) in place of RNNGs.18 This baseline crucially isolates the importance of learning from hierarchical teachers, because it utilizes the exact same approximation technique and KD loss as the structure-distilled BERTs.
Learning Curves.
Given enough labeled data, BERT can acquire the relevant structural information from the fine-tuning (as opposed to pretraining) procedure, although better pretrained representations can nevertheless facilitate sample-efficient generalizations (Yogatama et al., 2019). We thus additionally examine the models’ fine-tuning learning curves, as a function of varying amounts of training data, on phrase-structure parsing and SRL.
Random Seeds.
Because fine-tuning the same pretrained BERT with different random seeds can lead to varying results, we report the mean performance from three random seeds on the structured prediction tasks, and from five random seeds on GLUE.
Test Results.
To preserve the integrity of the test sets, we first report all performance on the validation sets, and only report the test set results for: (i) the No-KD baseline, and (ii) the best structure-distilled model on the validation set (“Best-KD”).
4.3 Findings and Discussion
Task . | Validation Set . | Test Set . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Baselines . | Structure-distilled BERTs . | No-KD . | Best-KD . | Err. Red. . | ||||||
No-KD . | Seq-KD . | L2R-KD . | R2L-KD . | UF-KD . | UG-KD . | |||||
Parsing | Const. PTB - F1 | 95.38 | 95.33 | 95.55 | 95.55 | 95.58 | 95.59 | 95.35 | 95.70 | 7.6% |
Const. PTB - EM | 55.33 | 55.41 | 55.92 | 56.18 | 56.39 | 56.59 | 55.25 | 57.77 | 5.63% | |
Const. OOD - F1† | 86.76 | 86.54 | 87.43 | 87.53 | 87.23 | 87.40 | 89.04 | 89.76 | 6.55% | |
Dep. PTB - UAS | 96.48 | 96.40 | 96.70 | 96.64 | 96.60 | 96.66 | 96.79 | 96.86 | 2.18% | |
Dep. PTB - LAS | 94.65 | 94.56 | 94.90 | 94.80 | 94.79 | 94.83 | 95.13 | 95.23 | 1.99% | |
SRL - OntoNotes | 86.17 | 86.09 | 86.34 | 86.29 | 86.30 | 86.46 | 86.08 | 86.39 | 2.23% | |
Coref. - OntoNotes | 72.53 | 69.27 | 73.74 | 73.49 | 73.79 | 73.33 | 72.71 | 73.69 | 3.58% | |
CCG supertag. probe | 93.69 | 91.59 | 93.97 | 95.21 | 95.13 | 95.21 | 93.88 | 95.2 | 21.57% | |
Probe selectivity | 24.79 | 23.77 | 23.3 | 23.57 | 27.28 | 28.3 | 23.15 | 26.07 | N/A |
Task . | Validation Set . | Test Set . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Baselines . | Structure-distilled BERTs . | No-KD . | Best-KD . | Err. Red. . | ||||||
No-KD . | Seq-KD . | L2R-KD . | R2L-KD . | UF-KD . | UG-KD . | |||||
Parsing | Const. PTB - F1 | 95.38 | 95.33 | 95.55 | 95.55 | 95.58 | 95.59 | 95.35 | 95.70 | 7.6% |
Const. PTB - EM | 55.33 | 55.41 | 55.92 | 56.18 | 56.39 | 56.59 | 55.25 | 57.77 | 5.63% | |
Const. OOD - F1† | 86.76 | 86.54 | 87.43 | 87.53 | 87.23 | 87.40 | 89.04 | 89.76 | 6.55% | |
Dep. PTB - UAS | 96.48 | 96.40 | 96.70 | 96.64 | 96.60 | 96.66 | 96.79 | 96.86 | 2.18% | |
Dep. PTB - LAS | 94.65 | 94.56 | 94.90 | 94.80 | 94.79 | 94.83 | 95.13 | 95.23 | 1.99% | |
SRL - OntoNotes | 86.17 | 86.09 | 86.34 | 86.29 | 86.30 | 86.46 | 86.08 | 86.39 | 2.23% | |
Coref. - OntoNotes | 72.53 | 69.27 | 73.74 | 73.49 | 73.79 | 73.33 | 72.71 | 73.69 | 3.58% | |
CCG supertag. probe | 93.69 | 91.59 | 93.97 | 95.21 | 95.13 | 95.21 | 93.88 | 95.2 | 21.57% | |
Probe selectivity | 24.79 | 23.77 | 23.3 | 23.57 | 27.28 | 28.3 | 23.15 | 26.07 | N/A |
General Discussion.
We summarize several key observations from Table 2 and Figure 2.
- •
All four structure-distilled BERT models consistently outperform the No-KD baseline, including the L2R-KD and R2L-KD variants that only distill the syntactic knowledge of unidirectional RNNGs. Remarkably, this pattern holds true for all six structured prediction tasks. In contrast, we observe no such gains for the Seq-KD baseline, which largely performs worse than the No-KD model. We conclude that the gains afforded by our structure-distilled BERTs can be attributed to the syntactic biases of the RNNG teacher.
- •
We conjecture that the surprisingly strong performance of the L2R-KD and R2L-KD models, which distill the knowledge of unidirectional RNNGs, can be attributed to the interpolated objective in Eq. 7 (α = 0.5). This interpolation means that the target distribution assigns a probability mass of at least 0.5 to the true masked word xi, which is guaranteed to be consistent with the bidirectional context. However, the syntactic knowledge contained in the unidirectional RNNGs’ predictions can still provide a structurally informative learning signal, via the rest of the probability mass, for the BERT student.
- •
Although all structure-distilled variants outperform the baseline, models that distill our approximation of the RNNG’s distribution for words in bidirectional context (UF-KD and UG-KD) yield the best results on four out of six tasks (PTB phrase-structure parsing, SRL, coreference resolution, and the CCG supertagging probe). This finding confirms the efficacy of our approach.
- •
We observe the largest gains for the syntactic tasks, particularly for phrase-structure parsing and CCG supertagging. However, the improvements are not at all confined to purely syntactic tasks: we reduce relative error from strong BERT baselines by 2.2% and 3.6% on SRL and coreference resolution, respectively. While the RNNG’s syntactic biases are derived from phrase-structure grammar, the strong improvement on CCG supertagging, in addition to the smaller improvement on dependency parsing, suggests that the RNNG’s syntactic biases generalize well across different syntactic formalisms.
- •
We observe larger improvements in a low- resource scenario, where the model is exposed to fewer fine-tuning instances (Figure 2), suggesting that syntactic biases are helpful for enabling more sample-efficient generalizations. This pattern holds for both tasks that we investigated: phrase-structure parsing (syntactic in nature) and SRL (not explicitly syntactic in nature). With only 5% of the fine-tuning data, the UG-KD model improves F1 score from 79.9 to 80.6 for SRL (a 3.5% error reduction relative to the No-KD baseline, as opposed to 2.2% on the full data). For phrase-structure parsing, the UG-KD model achieves a remarkable 93.68 F1 (a 16% relative error reduction, as opposed to 7.6% on the full data) with only 5% of the PTB—this performance is notably better than past state of the art phrase-structure parsers trained on the full PTB c. 2017 (Kuncoro et al., 2017).
GLUE Results and Discussion.
We report the GLUE validation and test results for BERTBASE-Cased in Table 3. Because we observe a different pattern of results on the Corpus of Linguistic Acceptability (CoLA; Warstadt et al., 2018) than on the rest of GLUE, we henceforth report: (i) the CoLA results, (ii) the seven task average that excludes CoLA, and (iii) the average across all eight tasks. We select the UG-KD model because it achieved the best validation set eight task average among the structure-distilled BERTs; the per-task GLUE breakdown is provided in Appendix D.
. | No-KD . | UG-KD . |
---|---|---|
Validation Set (Per-task average / 1-best random seed) | ||
CoLA | 50.7 / 60.2 | 54.3 / 60.6 |
7-task avg. (excl. CoLA) | 85.4 / 87.8 | 84.8 / 86.9 |
Overall 8-task avg. | 81.1 / 84.4 | 81.0 / 83.6 |
Test set (Per-task 1-best random seed on validation set) | ||
CoLA | 53.1 | 55.3 |
7-task avg. (excl. CoLA) | 84.2 | 83.5 |
Overall 8-task avg. | 80.3 | 80.0 |
. | No-KD . | UG-KD . |
---|---|---|
Validation Set (Per-task average / 1-best random seed) | ||
CoLA | 50.7 / 60.2 | 54.3 / 60.6 |
7-task avg. (excl. CoLA) | 85.4 / 87.8 | 84.8 / 86.9 |
Overall 8-task avg. | 81.1 / 84.4 | 81.0 / 83.6 |
Test set (Per-task 1-best random seed on validation set) | ||
CoLA | 53.1 | 55.3 |
7-task avg. (excl. CoLA) | 84.2 | 83.5 |
Overall 8-task avg. | 80.3 | 80.0 |
The results on GLUE provide an interesting contrast to the consistent improvements we observed on the structured prediction tasks. More concretely, our UG-KD model outperforms the baseline on CoLA, but performs slightly worse on the other GLUE tasks in aggregate, leading to a slightly lower overall test set accuracy (80.0 for the UG-KD as opposed to 80.3 for the No-KD baseline).
The improvement on the syntax-sensitive CoLA provides additional evidence—beyond the improvement on the syntactic tasks (Table 2)— that our approach indeed yields improved syntactic competence. We conjecture that these improvements do not transfer to the other GLUE tasks because they rely more on lexical and semantic properties, and less on syntactic competence (McCoy et al., 2019).
We defer a more thorough investigation of how much syntactic competence is necessary for solving most of the GLUE tasks to future work, but make two remarks. First, the findings on GLUE are consistent with the hypothesis that our approach yields improved structural competence, albeit at the expense of a slightly less rich meaning representation, which we attribute to the smaller dataset used to train the RNNG teacher. Second, human-level natural language understanding includes the ability to predict structured outputs, for example, to decipher “who did what to whom” (SRL). Succeeding in these tasks necessitates inference about structured output spaces, which (unlike most of GLUE) cannot be reduced to a single classification decision. Our findings indicate a partial dissociation between model performance on these two types of tasks; hence, supplementing GLUE evaluation with some of these structured prediction tasks can offer a more holistic assessment of progress in NLU.
CCG Probe Example.
The CCG supertagging probe is a particularly interesting test bed, because it clearly assesses the model’s ability to use contextual information in making its predictions— without introducing additional confounds from the BERT fine-tuning procedure. We thus provide a representative example of four different BERT variants’ predictions on the CCG supertagging probe in Table 4, based on which we discuss two observations. First, the different models make different predictions, where the No-KD and L2R-KD models produce (coincidentally the same) incorrect predictions, while the R2L-KD and UG-KD models are able to predict the correct supertag. This finding suggests that different teacher models are able to impose different biases on the BERT students.19
Sentence Input . | No-KD & L2R-KD Pred. . | R2L-KD & UG-KD Pred. . |
---|---|---|
“Apple II owners , for example , had to use their TVsets as screens and stored data on audiocassettes” | (S[b]∖NP)/NP | ((S[b]∖NP)/PP)/NP |
Sentence Input . | No-KD & L2R-KD Pred. . | R2L-KD & UG-KD Pred. . |
---|---|---|
“Apple II owners , for example , had to use their TVsets as screens and stored data on audiocassettes” | (S[b]∖NP)/NP | ((S[b]∖NP)/PP)/NP |
Second, the mistakes of the No-KD and L2R-KD BERTs belong to the broader category of challenging argument-adjunct distinctions (Palmer et al., 2005; Fowlie, 2017). Here both models fail to subcategorize for the prepositional phrase (PP) “as screens”, which serves as an argument of the verb “use”, as opposed to the noun phrase “TV sets”. Distinguishing between these two potential dependencies naturally requires syntactic information from the right context; hence the R2L-KD BERT, which is trained to emulate the predictions of an RNNG teacher that observes the right context, is able to make the correct prediction. This advantage is crucially retained by the UG-KD model that distills the RNNG’s approximate distribution over words in bidirectional context (Eq. 5), and further confirms the efficacy of our proposed approach.
Measuring the Models’ Differences.
Beyond the qualitative example in Table 4, we further quantify the extent to which the different BERT models produce different predictions. To this end, we compute pairwise model agreement for the phrase-structure parsing task, as measured by exact match accuracy. We present the full experimental setup and findings in Appendix E, but summarize two key findings here.
First, the highest exact match agreement between any pair of different models is fairly low at 44.92%, further supporting our conjecture that different teacher models indeed impose different biases on the BERT student, as evidenced by the different model predictions. Second, all four structure-distilled BERT variants have the lowest pairwise agreement score with the No-KD baseline (< 39% pairwise model agreement), suggesting that all variants of our structure distillation objectives yield quantifiably different outputs compared to the no distillation alternative, which does not learn from the syntactic knowledge of RNNGs.
BERTLARGE Results.
Having evaluated our structure-distilled BERTBASE-Cased, we now apply our approach on top of BERTLARGE-Cased, and present the results on the structured prediction tasks in Table 5. Overall, we observe a similar pattern of results with BERTLARGE as we do with BERTBASE: On the structured prediction tasks, our best structure distillation approach reduces error by 1.5% to 5.5% relative to the No-KD baseline. Furthermore, our structure-distilled BERTLARGE models establish new state of the art single model results—among models pretrained on the original BERT training set20 —on phrase-structure parsing (PTB and OOD), PTB dependency parsing, and SRL.
Task . | Test Set - BERTLARGE-Cased . | ||||
---|---|---|---|---|---|
No-KD . | Best-KD . | Error Red. . | BERT SoTA . | ||
Parsing | Const. PTB – F1 | 95.80 | 95.95 | 3.73% | 95.84† |
Const. PTB – EM | 56.87 | 57.74 | 2.02% | − | |
Const. OOD – F1 | 89.63 | 90.20 | 5.48% | 89.91‡ | |
Dep. PTB − UAS | 96.91 | 97.03 | 3.78% | 97.0† | |
Dep. PTB − LAS | 95.33 | 95.49 | 3.43% | 95.43† | |
SRL − OntoNotes | 87.59 | 87.77 | 1.45% | 86.5♢ | |
Coref. − OntoNotes | 74.03 | 74.69 | 2.55% | 79.6♦ |
Task . | Test Set - BERTLARGE-Cased . | ||||
---|---|---|---|---|---|
No-KD . | Best-KD . | Error Red. . | BERT SoTA . | ||
Parsing | Const. PTB – F1 | 95.80 | 95.95 | 3.73% | 95.84† |
Const. PTB – EM | 56.87 | 57.74 | 2.02% | − | |
Const. OOD – F1 | 89.63 | 90.20 | 5.48% | 89.91‡ | |
Dep. PTB − UAS | 96.91 | 97.03 | 3.78% | 97.0† | |
Dep. PTB − LAS | 95.33 | 95.49 | 3.43% | 95.43† | |
SRL − OntoNotes | 87.59 | 87.77 | 1.45% | 86.5♢ | |
Coref. − OntoNotes | 74.03 | 74.69 | 2.55% | 79.6♦ |
4.4 Limitations
We outline two limitations to our approach. First, we assume the existence of decent-quality “silver-grade” phrase-structure trees to train the RNNG teacher. Although this assumption holds true for English because of the existence of accurate phrase-structure parsers, this is not necessarily the case for other languages. Second, pretraining the BERT student in our naïve implementation is about half as fast on TPUs compared with the baseline due to I/O bottleneck. This overhead only applies at pretraining, and can be reduced through parallelization.
5 Related Work
Earlier work has proposed a few ways for introducing notions of hierarchical structures into BERT, for instance, through designing structurally motivated auxiliary losses (Wang et al., 2020), or including syntactic information in the embedding layers that serve as inputs for the Transformer (Sundararaman et al., 2019). In contrast, we use a different technique for injecting syntactic biases, which is based on the structure distillation technique of Kuncoro et al. (2019), although our work features two key differences. First, Kuncoro et al. (2019) put a sole emphasis on cases where both the teacher and student models are autoregressive, left-to-right LMs; here we extend this objective for when the student model is a representation learner that has access to bidirectional context. Second, Kuncoro et al. (2019) only evaluated their structure-distilled LMs in terms of perplexity and grammatical judgment (Marvin and Linzen, 2018). In contrast, we evaluate our structure-distilled BERT models on six diverse structured prediction tasks and the GLUE benchmark. It remains an open question whether, and how much, syntactic biases are helpful for a broader range of NLU tasks beyond grammatical judgment; our work represents a step towards answering this question.
Substantial progress has recently been made in improving the performance of BERT and other masked LMs (Lan et al., 2020; Liu et al., 2019b; Raffel et al., 2019; Sun et al., 2020, inter alia). Our structure distillation technique is orthogonal, and can be applied for these approaches. Lastly, our findings on the benefits of syntactic knowledge for structured prediction tasks that are not explicitly syntactic in nature, such as SRL and coreference resolution, are consistent with those of prior work (He et al., 2017; Swayamdipta et al., 2018; He et al., 2018; Strubell et al., 2018, inter alia).
6 Conclusion
Given the remarkable success of textual representation learners trained on large amounts of data, it remains an open question whether syntactic biases are still relevant for these models that work well at scale. Here we present evidence to the affirmative: our structure-distilled BERT models outperform the baseline on a diverse set of six structured prediction tasks. We achieve this through a new pretraining strategy that enables the BERT student to learn from the predictions of an explicitly hierarchical, but much less scalable, RNNG teacher model. Because the BERT student is a bidirectional model that estimates the conditional probabilities of masked words in context, we propose to distill an efficient yet surprisingly effective approximation of the RNNG’s posterior estimate for generating each word conditional on its bidirectional context.
Our findings suggest that syntactic inductive biases are beneficial for a diverse range of structured prediction tasks, including for tasks that are not explicitly syntactic in nature. In addition, these biases are particularly helpful for improving fine-tuning sample efficiency on these tasks. Lastly, our findings motivate the broader question of how we can design models that integrate stronger notions of structural biases—and yet can be easily scalable at the same time—as a promising (if relatively underexplored) direction of future research.
A Preliminary Experiments
Here we discuss the preliminary experiments to assess the quality and computational efficiency of our posterior approximation procedure (§3.4). Recall that this approximation procedure only applies at inference; the LM is still trained in a typical autoregressive, left-to-right fashion.
Model.
Because exactly computing the RNNG’s next-word distributions tϕ(xi|x<i) involves an intractable marginalization over all possible tree prefixes y<i, we run our experiments in the context of sequential LSTM language models, where tLSTM(xi|x<i) can be computed exactly. This setup crucially enables us to isolate the impact of approximating the posterior distribution over xi under the bidirectional context (Eq. 2) with our proposed approximation (Eq. 5), without introducing further confounds stemming from the RNNG’s marginal approximation procedure (Eq. 6).
Dataset and preprocessing.
We train the LSTM LM on an open-vocabulary version of the PTB,22 in order to simulate the main experimental setup where both the RNNG teacher and BERT student are also open-vocabulary by virtue of byte-pair encoding (BPE) preprocessing. To this end, we preprocess the dataset with SentencePiece (Kudo and Richardson, 2018) BPE tokenization, where |Σ| = 8,000; we preserve all case information. We follow the empirical setup of the parsing (§4.1) experiments, with Sections 02–21 for training, Section 22 for validation, and Section 23 for testing.
Model Hyperparameters.
We train the LM with 2 LSTM layers, 250 hidden units per layer, and a dropout (Srivastava et al., 2014) rate of 0.2. Model parameters are optimized with stochastic gradient descent (SGD), with an initial learning rate of 0.25 that is decayed exponentially by a factor of 0.92 for every epoch after the tenth. Because our approximation relies on a separately trained right-to-left LM (Eq. 5), we train this variant with the exact same hyperparameters and dataset split as the left-to-right model.
Evaluation and Baselines.
We evaluate the models in terms of the average posterior negative log likelihood (NLL) and perplexity.23 Because exact inference of the posterior is expensive, we evaluate the model only on the first 400 sentences of the test set. We compare the following variants:
- •
A mixture of experts baseline that simply mixes (α = 0.5) the probabilities from the left-to-right and right-to-left LMs in an additive fashion, as opposed to multiplicative as in the case of our PoE-like approximation in Eq. 5 (“MoE”);
- •
Our approximation of the posterior over xi (Eq. 5), where q(w) is the uniform distribution (“Uniform Approx.”);
- •
Our approximation of the posterior over xi (Eq. 5), but where q(w) is the unigram distribution (“Unigram Approx.”); and
- •
Exact inference of the posterior as computed from the left-to-right model, as defined in Eq. 2 (“Exact Inference”).
Discussion.
We summarize the findings in Table 6, based on which we remark on two observations. First, the posterior NLL of our approximation procedure that makes use of the unigram distribution (Unigram Approx.; third row) is not much worse than that of exact inference, in exchange for a more than 50,000 times speedup24 in computation time. Nevertheless, using the uniform distribution (second row) on q(w) in place of the unigram one (Eq. 5) results in a much worse posterior NLL. Second, combining the left-to-right and right-to-left LMs using a mixture of experts—a baseline which is not well-motivated by our theoretical analysis—yields the worst result.
B RNNG Hyperparameters
To train the subword-augmented RNNG teacher (§2), we use the following hyperparameters that achieve the best validation perplexity from a grid search: 2-layer stack LSTMs (Dyer et al., 2015) with 512 hidden units per layer, optimized by standard SGD with an initial learning rate of 0.5 that is decayed exponentially by a factor of 0.9 for every epoch after the tenth. We use dropout with p = 0.3.
C BERT Hyperparameters
Here we outline the hyperparameters of the BERT student in terms of pretraining data creation, masked LM pretraining, and fine-tuning.
Pretraining Data Creation.
We use the same codebase25 and pretraining data as Devlin et al. (2019), which are derived from a mixture of Wikipedia and Books text corpora. To train our structure-distilled BERTs, we sample a masking from these corpora following the same hyperparameters used to train the original BERTBASE-Cased model: a maximum sequence length of 512, a per-word masking probability of 0.15 (up to a maximum of 76 masked tokens in a 512-length sequence), and a dupe factor of 10. We apply a random seed of 12345. We preprocess the raw dataset using NLTK tokenizers, and then apply the same (BPE-augmented) vocabulary and WordPiece tokenization as in the original BERT model. All other hyperparameters are set to the same values as in the publicly released original BERT model.
Masked LM Pretraining.
We train all model variants (including the no distillation/standard BERT baseline for fair comparison) using a batch size of 256 sequences. We use an initial Adam learning rate of 3e−4 for the BERTBASE models (as opposed to 1e−4 in the original BERT model) and 1e−4 for the BERTLARGE models. Following Devlin et al. (2019), we pretrain our models for 1M steps. All other hyperparameters are similarly set to their default values.
GLUE Fine-tuning.
For each GLUE task, we fine-tune the BERT model by running a grid search over learning rates of {5e−6,1e−5,2e−5, 3e−5, 5e−5} and batch sizes of {16,32}, with 5 random seeds. Following Joshi et al. (2020), we train each fine-tuning configuration for 10 epochs, except for CoLA, where we train for 4 epochs.
Structured Prediction Fine-tuning.
For the structured prediction tasks, we use the following hyperparameters for learning rate and batch size. These hyperparameters are either the default for a given codebase, or lightly tuned on the No-KD models. We use the same hyperparameters across all models (No-KD and KD) of a given size (BASE or LARGE) on a given task.
- •
In-order phrase-structure parser: a BERT learning rate of 2e−5, a batch size of 32, and a warmup period of 160 updates.
- •
HPSG dependency parser: a BERT learning rate of 5e−5, a batch size of 150, and a warmup period of 160 updates.
- •
Coreference resolution: for the BERTBASE models: a learning rate of 1e−5 and a maximum segment length of 128 word pieces. For the BERTLARGE models: a learning rate of 5e−6 and a maximum segment length of 512 word pieces. Both sizes use a learning rate warmup period of 2 epochs and a batch size of 1 document.
- •
Semantic role labeling: for the BERTBASE models: a learning rate of 5e−5. For the BERTLARGE models: a learning rate of 1e−5. Both sizes use a batch size of 32.
D Full GLUE Results
We present the full GLUE results for the No-KD baseline and the UG-KD BERT in Table 7.
Model . | CoLA . | SST-2 . | MRPC . | QQP . | MNLI (m/mm) . | QNLI . | RTE . | GLUE Avg . | |
---|---|---|---|---|---|---|---|---|---|
Dev | No-KD | 60.2 | 92.2 | 90.0 | 89.4 | 90.3/90.9 | 90.7 | 71.1 | 84.4 |
UG-KD | 60.6 | 92.0 | 88.9 | 89.3 | 89.6/90.0 | 89.9 | 68.6 | 83.6 | |
Test | No-KD | 53.1 | 92.5 | 88.0 | 88.8 | 82.8/81.8 | 89.9 | 65.4 | 80.3 |
UG-KD | 55.3 | 91.2 | 87.6 | 88.7 | 81.9/80.8 | 89.5 | 65.0 | 80.0 |
Model . | CoLA . | SST-2 . | MRPC . | QQP . | MNLI (m/mm) . | QNLI . | RTE . | GLUE Avg . | |
---|---|---|---|---|---|---|---|---|---|
Dev | No-KD | 60.2 | 92.2 | 90.0 | 89.4 | 90.3/90.9 | 90.7 | 71.1 | 84.4 |
UG-KD | 60.6 | 92.0 | 88.9 | 89.3 | 89.6/90.0 | 89.9 | 68.6 | 83.6 | |
Test | No-KD | 53.1 | 92.5 | 88.0 | 88.8 | 82.8/81.8 | 89.9 | 65.4 | 80.3 |
UG-KD | 55.3 | 91.2 | 87.6 | 88.7 | 81.9/80.8 | 89.5 | 65.0 | 80.0 |
E Quantifying Model Differences
We quantify the extent to which learning from different teacher models results in BERT models that make different predictions. To this end, we compute pairwise model agreement,26 in terms of phrase-structure parsing exact match, between each pair of five model variants (No-KD, L2R-KD, R2L-KD, UF-KD, and UG-KD). We compute this pairwise model agreement score on the PTB dev set (§22). To better understand the differences between the models, we exclude sentences where all five models produce the exact same phrase-structure trees, leaving 826 out of 1700 sentences; further analysis indicates that the excluded sentences tend to be shorter and less ambiguous.
We present the findings in Table 8, and summarize two key observations. First, the highest exact match agreement between any pair of models is fairly low at 44.92%. This finding supports our conjecture that different teacher models indeed impose different biases for the BERT students, as evidenced by the different model predictions. Second, each RNNG-distilled BERT model has the lowest agreement rate with the No-KD baseline. This finding suggests that all variants of our structure distillation approach produce quantifiably different predictions (¡39% pairwise model agreement) from the No-KD/standard BERT baseline that does not learn from the syntactic knowledge of RNNGs.
Pairwise Exact Match Agreement . | No-KD . | L2R-KD . | R2L-KD . | UF-KD . | UG-KD . |
---|---|---|---|---|---|
No-KD | ∓ | 36.20 | 36.56 | 38.01 | 37.05 |
L2R-KD | 36.20 | − | 42.25 | 43.58 | 44.43 |
R2L-KD | 36.56 | 42.25 | – | 39.95 | 41.53 |
UF-KD | 38.01 | 43.58 | 39.95 | – | 44.92 |
UG-KD | 37.05 | 44.43 | 41.53 | 44.92 | − |
Pairwise Exact Match Agreement . | No-KD . | L2R-KD . | R2L-KD . | UF-KD . | UG-KD . |
---|---|---|---|---|---|
No-KD | ∓ | 36.20 | 36.56 | 38.01 | 37.05 |
L2R-KD | 36.20 | − | 42.25 | 43.58 | 44.43 |
R2L-KD | 36.56 | 42.25 | – | 39.95 | 41.53 |
UF-KD | 38.01 | 43.58 | 39.95 | – | 44.92 |
UG-KD | 37.05 | 44.43 | 41.53 | 44.92 | − |
Acknowledgments
We would like to thank Mandar Joshi, Zhaofeng Wu, Rui Zhang, Timothy Dozat, and Kenton Lee for answering questions regarding the evaluation of the model. We also thank Sebastian Ruder, John Hale, Kris Cao, Stephen Clark, and the three anonymous reviewers for their helpful suggestions. A. K. is supported by an EPSRC Doctoral Training Partnership studentship and a Balliol Mark Sadler scholarship; D. F. is supported by a Google PhD Fellowship.
Notes
Not all syntactic LMs have hierarchical biases; Choe and Charniak (2016) modeled strings and phrase structures sequentially with LSTMs. This model can be understood as a special case of RNNGs without the composition function.
Unsupervised RNNGs (Kim et al., 2019) exist, although they perform worse on measures of syntactic competence.
An alternative here is to represent each phrase as a flat sequence of subwords, although our preliminary experiments indicate that this approach yields worse perplexity.
In practice, the corruption protocol c(⋅) and the reconstruction targets M(x) are intertwined; M(x) denotes the indices of tokens in x (∼ 15%) that were altered by c(x).
In this setup, we assume that x is a fixed-length sequence. We aim to infer the LM’s estimate for generating a single token xi, relative to all potential single tokens w ∈ Σ (denominator in Eq. 2), conditional on the bidirectional context.
In our BERT pretraining setup, |Σ|≈ 29, 000 (vocabulary size of BERT-cased), |D|≈ 3 * 109, and k = 512.
This approximation preserves the intuition explained in §3.3. Concretely, verbs like bark would also be assigned low probabilities under this posterior approximation, since tϕ(the cat |bark ) would be low since it is syntactically illicit—the alternative “bark at the cat” would be syntactically licit.
Our approximation of tϕ(xi|x<i) relies on a tree prefix from a separate discriminative parser, which has access to yet unseen words x>i. This non-incremental procedure is justified, however, because we aim to design the most informative teacher distributions for the non-incremental BERT student, which also has access to bidirectional context.
We use the same setup as Preliminary Experiments.
The KD loss ℓKD(x; θ) is defined independently of xi.
A similar CCG probe was explored by Liu et al. (2019a); we obtain comparable results for the no distillation baseline.
Following Hewitt and Liang (2019), the cardinality of this control category is the same as the number of supertags.
A probe’s selectivity is defined as the difference between the probing task accuracy and the control task accurary.
This setup excludes the semantic textual similarity benchmark (STS-B), which is formulated as a regression task.
We find this larger learning to perform better on most of our evaluation tasks. Liu et al. (2019b) have similarly found that tuning BERT’s initial learning rate leads to better results.
For fair comparison, we train the LSTM on the exact same subset as the RNNG, with comparable number of model parameters. An alternative here is to use Transformers, although we elect to use LSTMs to facilitate fair comparison with RNNGs, which are also based on LSTM architectures.
All four BERTs have access to the full bidirectional context at test time, although some are trained to mimic the predictions of unidirectional RNNGs (L2R-KD and R2L-KD).
This comparison excludes other models like XLNet and RoBERTa, which are trained on more data.
Our open-vocabulary setup means that our results are not directly comparable to prior work on PTB language modeling (Mikolov et al., 2010, inter alia), which mostly utilize a special “UNK” token for infrequent or unknown words.
In practice, this perplexity is derived from simply exponentiating the average posterior negative log likelihood.
All three approximations in Table 6 have similar runtimes.
For instance, when comparing the agreement between the No-KD and the UG-KD models, we treat the UG-KD model’s output as “gold reference”, and compute the exact match from the No-KD model’s output with respect to that.
References
Author notes
Equal contribution.