Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive Biases at Scale

Abstract We introduce Transformer Grammars (TGs), a novel class of Transformer language models that combine (i) the expressive power, scalability, and strong performance of Transformers and (ii) recursive syntactic compositions, which here are implemented through a special attention mask and deterministic transformation of the linearized tree. We find that TGs outperform various strong baselines on sentence-level language modeling perplexity, as well as on multiple syntax-sensitive language modeling evaluation metrics. Additionally, we find that the recursive syntactic composition bottleneck which represents each sentence as a single vector harms perplexity on document-level language modeling, providing evidence that a different kind of memory mechanism—one that is independent of composed syntactic representations—plays an important role in current successful models of long text.


Introduction
Transformer language models (LMs) that are trained on vast amounts of data have achieved remarkable success at various NLP benchmarks (Peters et al., 2018;Devlin et al., 2019;Brown et al., 2020, inter alia).Intriguingly, this success is achieved by models that lack an explicit modeling of hierarchical syntactic structures, which were hypothesized by decades of linguistic research to be necessary for good generalization (Chomsky, 1957;Everaert et al., 2015).This naturally leaves a question: To what extent can we further improve the performance of Transformer LMs, through an inductive bias that encourages the model to explain the data by the means of recursive syntactic compositions?Although the benefits of modeling recursive syntax have been shown at the small data and model scales, such as in the case of recurrent neural network grammars (Dyer et al., 2016;Futrell et al., 2019;Kim et al., 2019;Hu et al., 2020;Noji and Oseki, 2021, inter alia), it remains an open question whether-and to what extent-a similar design principle is still beneficial for Transformer LMs at larger scales.
In this paper, we aim to answer these questions by introducing Transformer Grammars (TGs)a novel class of Transformer language models that combine: (i) the expressive power, scalability, and strong performance of Transformer-XL (Dai et al., 2019); (ii) joint modeling of surface strings x and their corresponding phrase-structure trees y, i.e., p(x, y); and (iii) an inductive bias that constrains the model to explain the data through built-in recursive syntactic composition operations.By implementing these recursive compositions through a novel modification of the Transformer-XL attention mask, TGs retain the computational efficiency of standard Transformer-XLs, enabling them to avoid the limitations of LSTM-based recurrent neural network grammars (Dyer et al., 2016, RNNGs), which have been proven difficult to scale (Kuncoro et al., 2019;Noji and Oseki, 2021).
TGs are related to the recent work of Qian et al. (2021) that similarly aims to augment generative Transformer language models with a stronger modeling of syntactic structures, albeit with two key differences.First, whereas Qian et al. (2021) used syntactic structure to restrict the behavior of a subset of attention heads (Strubell et al., 2018;Astudillo et al., 2020), TGs incorporate a stronger form of syntactic inductive bias by using recursive syntactic compositions to create an explicit composed representation for each constituent, in a similar fashion as RNNGs.Hence, our approach sheds light into whether, and to what extent, the recursive syntactic composition hypothesis-which has been shown to be valuable at the small data and model scale in the case of RNNGs-continues to offer additional benefits, beyond specializing a subset of attention heads for syntax.Second, in contrast to prior work, which has been limited to modeling sentences independently of document context, this work explores whether syntactic composition also benefits document-level language modeling.
• In single-sentence language modeling perplexity, the terminal-only Transformer XL baseline (TXL (terminals)) is outperformed by both TGs and a Transformer XL that predicts sentences as joint sequences of terminals and treebuilding nonterminal symbols (TXL (trees)) but without the TG's attention restrictions (Choe and Charniak, 2016;Qian et al., 2021).
• Although modeling structure improves perplexity compared to terminal-only models, the TXL (trees) model slightly outperforms the more biased TG models.Using a regression analysis, we show that while TG's recursive syntactic compositions benefit syntactic generalization, their implementation in terms of attention restriction interferes with Transformers' lexical copying ability, which turns out to play a role in obtaining low perplexities.This result indicates a partial dissociation between perplexity and syntactic generalization, both of which are important metrics for assessing LM success.
• On the benchmark of Hu et al. (2020) that is a carefully controlled test of syntactic generalization ability, TGs substantially outperform the syntax-free TXL (terminals) baseline, as well as the much stronger TXL (trees) model.Perhaps even more remarkably, TGs outperform the GPT-2-small (Radford et al., 2019), Gopher (Rae et al., 2021), and Chinchilla (Hoffmann et al., 2022) models, which are between 250× and 1, 000× larger than the TG model, trained on vastly more data, and arguably represent the most sophisti-cated LMs in existence.
• When modeling full documents and evaluating perplexity, we again find that the TXL (trees) model outperforms the terminal-only TXL (terminals) baseline.However, TGs substantially underperform both the TXL (terminals) and TXL (trees) models.The failure of TGs, which represent prior-sentence context purely in terms of a composed syntactic representation, suggests that a different memory mechanism-that works in part independently of syntactic structures-may play a role in the processing of long-form text.
All in all, our findings show that-under comparable experimental conditions-LMs with notions of syntactic structures (both TXL (trees) & TG) outperform those without on multiple evaluation metrics.We further demonstrate that encouraging the model to explain the data through the means of recursive syntactic compositions-as is the case for TGs-is a valuable inductive bias for achieving an even stronger human-like syntactic competence, outperforming prior work that also incorporates syntactic biases, albeit without recursive compositions (Qian et al., 2021), in addition to some of the largest non-syntactic LMs to date.Lastly, our findings motivate the development of scalable LMsthat nevertheless incorporate stronger notions of syntactic structures-as a promising (albeit relatively under-explored) area of NLP research.

Model
TGs are syntactic language models: they jointly model the probability of syntactic phrase-structure trees y and strings of words x, using the predicted structures to determine the structure of the computations of model states.Following a line of recent work in parsing and syntactic language models (Vinyals et al., 2015;Dyer et al., 2016;Choe and Charniak, 2016), the generation problem is decomposed into modeling a sequence of actions that construct (x, y) in a top-down, left-to-right fashion, by interleaving nonterminal nodes and their children, as shown in Figure 1 An example that represents a pair of string x and its phrase-structure tree y, which are then represented as a sequence of actions that construct (x, y) in a top-down, left-to-right fashion (Dyer et al., 2016;Choe and Charniak, 2016).
Let a = (a 0 , a 1 , . . ., a T −1 ) be a sequence of actions (of length T ) that generates (x, y), where each action is part of the action vocabulary V. TGs define a probability distribution over a through a left-to-right factorization, i.e., p(x, y) = p(a) = ∏ i p(a i | a <i ).

Recursive syntactic composition via attention
In Transformer language models, when generating a i conditionally on a <i , attention is the only mechanism by which information from other positions j < i is incorporated.The rules governing this information flow-i.e., which positions can attend to which other positions-are defined by the attention mask.We design TGs to use recursive syntactic compositions, which have been shown to lead to better generalisation in the LSTM-based RNNG model, and we implement them through the Transformer attention mechanism.
In TGs, the action sequence is generated from left-to-right, and each symbol a i can be thought of as updating a stack of indices.When the current a i is a closing nonterminal (i.e., a constituent has just ended) its index i will be represented by a singlevector-sized composed representation obtained by attending to the child positions of the currently ending constituent.Subsequent positions (> i) may attend to this composed position, but they may not attend directly to the constituent positions, and this restriction imposes a syntactic bottleneck, since everything inside the constituent that influences subsequent predictions must become part of the composed representation.This bottleneck encourages the model to learn informative representations of composed phrases and is inspired by a similar design principle as RNNGs and other tree-structured architectures.In the stack, the restriction is instantiated by popping the indices for the child nodes and then pushing the index of the composed constituent.We refer to this process as COMPOSE attention.
In addition to COMPOSE attention, at each position i, we apply STACK attention, where i is pushed onto the stack attention, and attention is restricted to positions on the stack.Both STACK and COMPOSE attention use the same parameters and attention heads-what distinguishes them is only the rule for computing the set of positions that the model can attend to.Importantly, as we need to (i) perform both COMPOSE and STACK for a closing nonterminal (e.g., to first compute a composed representation based on its parts/children, and then add the composed representation onto the stack), while (ii) performing exactly one attention operation per token, we transform the original sequence a by duplicating all closing nonterminals.This yields a sequence a ′ of length T ′ , e.g., (S (NP the blue bird NP) NP) (VP sings VP) VP) S) S).The first closing nonterminal of each pair is given the type CNT1, and implements COMPOSE, whereas the second is given the type CNT2, and implements STACK.To keep the number of prediction events (i.e., the number of times a probability distribution is emitted by the model) constant, no final prediction is made for COMPOSE positions (see Figure 2).The exact procedure for STACK/COMPOSE is described in Algorithm 1.The positions that may be attended are represented as a binary attention mask 2), such that A ij = 1 iff the position j may be attended from i, and 0 otherwise.Note that the computation of the attention mask is causal, i.e., no information from positions j > i is used to compute the positions that can be attended from i.
Relative positional encoding.In Transformer-XL, the positional information presented to the model is based on the difference between the attending position i and the attended position j, i.e., i − j.This distance does not reflect nor use the topology of the tree.We thus generalize how relative positions are provided to the attention mechanism such that any matrix R ∈ Z T ′ ×T ′ can be used, where R ij is the relative position between i and j.For TGs, we define   where δ(i) is the depth of the i-th token in the tree.Note that the relative distance R ij will only be computed if A ij = 1 (i.e., j may be attended from i).
For instance, for the action sequence in Figure 1, the relative distance between (the positions corresponding to) the words sings and bird is never computed, but it will be computed between sings and its sibling NP covering the blue bird.

Segmentation and recurrence
In the same manner as Transformer-XL, Transformer Grammars are recurrent neural networks that can process arbitrarily long sequences 1 as consecutive segments that contain a fixed number of tokens L, maintaining and updating a memory of temporal dimension M from one segment to the next.With τ (L+1)−1 .The core of the model is composed of K stacked 1 This desirable property of Transformer-XL is the main justification for our using it as baseline and starting point.

Algorithm 1 STACK/COMPOSE attention
Input: a ′ sequence of tokens while type(a end while 10: S.push(i) S.push(i) 15: end if 16: for j ∈ S do 17: end for 19: end if 20: end for 21: return A ▷ Attention mask recurrent layers, i.e., for 1 ≤ k ≤ K: where for each segment τ : L×d is the sequence of hidden states, which forms the input for layer k + 1, is the attention mask from the current segment to the current segment and the memory, • and R τ ∈ Z L×(M +L) is the corresponding relative positions matrix.
All layers receive the same attention mask and relative positions matrix.Each layer k is composed of a multi-head self-attention (SelfAttn) sub-layer and a position-wise feed-forward network (FFN) sub-layer (with residual connections followed by layer normalization-omitted for clarity), as well as an update to the memory for the next segment: The output of the last layer, h (K) τ , is multiplied by the transpose of the embedding matrix E T to get the unnormalized next-token log probabilities.
Self-attention.Using the notation of Dai et al. (2019), let W q , W k,E , W k,R , W v , and u and v be the trainable model parameters.Let [⋅, ⋅] denote a concatenation operation along the time dimension.For a single head, we have: The attention score for an attending position i and an attended position j is ).Much like in Transformer-XL, the second term can be computed efficiently as the relative positions take values within a small interval [R min , R max ].
The mask A ( §2.1) is applied element-wise on the scores, which sets masked entries to −∞.The normalized attention weights are then obtained by applying a softmax activation function to the scores; the final attention outputs are the product of the attention weights and the values.In practice, we use multiple heads-the outputs of each are concatenated and passed onto a linear transformation.
Memory update.In Transformer-XLs, the memory is updated by shifting the current input into it.Here we take advantage of the fact that positions within a subtree that have been COMPOSEd are never attended to in the future, and a fortiori in the following segments.Hence, only positions that may be attended need to be added or kept in the memory.This requires careful book-keeping of which position in the memory corresponds to which original position in the input sequence, both (i) to perform the update, and (ii) to compute the correct attention mask and relative positions.

Properties
Recursive composition.Transformer Grammars accomplish recursive compositions via a custom attention mask that reflects the hierarchical phrase structures within natural language.Although the mask at a position i + 1 depends on the mask at position i, during training the entire attention mask matrix can be precomputed in advance, and then applied independently to compute multiple syntactic compositions in parallel for the whole segment.For instance, in the example sequence from Figure 2, during training the representations of NP and VP are computed in parallel, even though their closing nonterminals are at different positions (6 and 10, respectively) in the sequence.Every following layer of Transformer Grammars then takes the composed representations at previous layers, and composes them further.For instance, at position 12, the second layer will form a composed representation of a sentence constituent S) by using as input the first layer representations of NP) and VP).A consequence of this approach is that at least d layers are needed for tokens of depth d to affect the topmost composed representation, a property it shares with conventional Transformers applied to trees (Hahn, 2020).
Context-modulated composition.TGs' composition steps use a COMPOSE attention mask at each closing nonterminal of type CNT1, and all other actions use a STACK attention mask.The stack mask makes available the representations of the completed constituents, words, and open nonterminals on the stack.Thus, in the example in Figure 2, the word sings can attend to the closed constituent NP), as well as ancestor nonterminals (S and (VP.But, importantly, at sings, information about all preceding words is accessible only through the composed NP) representation, thus enforcing the compressive effect of syntactic composition.
At higher layers, STACK and COMPOSE attentions have a subtle interaction worth making explicit.The STACK attention that is used to compute the representation for sings can look at the composed representation of the preceding subject NP), meaning that a certain amount of "outside information" can enter into the computation of the composed VP.The availability of outside information deviates from the strict bottom-up compositionality of RNNGs and similar models.
How does this outside information impact composition?In TGs (in contrast to Transformer-XLs), the influence of outside context on composed representations is indirect, and we therefore argue that the TG learner has a bias against capturing such outside information in the composed representation.Our argument relies on two facts: (i) that learning to compose a representation of a constituent ending at position i is driven by predictions/prediction failures of a subsequent symbol a j , where j > i and (ii) that if a j 's prediction does crucially depend information outside of the constituent ending at i, then there will always be a more direct attention path than the one via the composed representation at a i .The existence of two paths with different numbers of operations-a more direct one (directly via attention) and a less direct one (via composition followed by attention)-explains the bias against including outside information in composed representations, and in favor of bottom-up information.
Finally, we remark that questions of whether and how contextual information plays a role in composition are complex and unresolved.Bowman et al. (2016) showed that allowing outside information to modulate compositional computations leads to better composed representations, and justified this design on the grounds that outside information may play a crucial disambiguating role in the composition function.

Experiments
We compare Transformer Grammars with two Transformer-XL baselines: (i) one trained only on the terminal word sequences (TXL (terminals)), and (ii) another trained on the linearized tree sequence as done by Choe and Charniak (2016), henceforth denoted as TXL (trees).We remark that model (i) is a word-level language model that estimates the probability of surface strings p(x), whereas model (ii) is a syntactic language model that estimates p(x, y).We additionally compare against two prior syntactic LMs: (i) the "generative parsing as language modeling" approach of Qian et al. ( 2021), which operates in a similar fashion as the linearized Choe and Charniak (2016) baseline, albeit with two attention heads that are specialized for syntax (though differently from TGs' explicit recursive syntactic compositions); and (ii) the batched RNNG model of Noji and Oseki (2021).
Datasets.We conduct experiments on both the Penn Treebank (Marcus et al., 1993, PTB) dataset (≈ 1M words), and the BLLIP-LG (Charniak et al., 2000) dataset according to the split by Hu et al. (2020) (≈ 40M words).We use the parsed, preprocessed, sentence-level PTB dataset of Dyer et al. (2016), where unseen words and singletons on the training set are mapped according to a special set of unknown word symbols as proposed by Petrov and Klein (2007).For BLLIP-LG, we use the parse trees provided by Hu et al. (2020).Tokenization is performed with SentencePiece (Kudo and Richardson, 2018) using a unigram language model subword algorithm (Kudo, 2018) and a vocabulary of 32K word-pieces.For BLLIP-LG, we consider two settings: (i) we model each sentence independently and (ii) we model each document-each of which is composed of multiple sentences-independently.
Experimental details.To account for training variance, for each model (TGs, TXL (terminals), and TXL (trees)), we train 100 models of the same size with independent random initializations.On PTB, we use 16-layer models with 12M parameters; whereas on BLLIP-LG, we use 16-layer models with 252M parameters.We select for each training run the model checkpoint with the lowest validation loss, computed using with a single gold proposal tree for each sentence.

Language modeling perplexity
Experimental setup Whereas the probability of a string x can be computed directly by left-to-right decomposition for models operating on strings, for models operating on the joint distribution of strings and syntax trees, we define p(x) as the marginal distribution: p(x) = ∑ y∈Y x p(x, y) where Y x is the set of possible trees for x.As the cardinality of this set is infinite, exact computation of this probability is intractable.However, we can compute a lower bound on p(x) by approximately marginalizing over a much smaller set of proposal trees For a given x, we would want Y ′ x to be the set of trees for which p(y | x) is largest.As this parsing distribution is unavailable, we approximate it with a proposal model q(y | x).The better this approximation is, the tighter the upper lower bound on p(x) is.We use as q(y | x) a separatelytrained discriminative RNNG, and Y ′ x is a set of N = 300 trees, sampled without replacement, as an approximation to the set of N trees with largest p(y | x).Naturally, regardless of how Y ′ x is chosen, the approximate marginal ∑ y∈Y ′ x p(x, y) computed on a subset of Y x is a lower bound of the true probability p(x).
We compute the word perplexity of the validation and test splits of the datasets under the models as PPL(D) = ∏ x∈D p(x) − 1 N w , where N w is the total number of words in the dataset D. It is exact for the models operating on words, and a conservative approximation (an upper bound) for the models operating on the joint distribution of strings and syntax trees.
For the document-level language models, given a document that consists of N s sentences, for each sentence i in the document, we need to marginalize over all possible syntax trees for every single i − 1 preceding sentence in that document.We approximate this by greedily picking the single most likely syntax tree under the model for the first i − 1 sentences, before concatenating this single-path prefix with the 300 tree proposals for the last sentence.
Discussion We report the mean and sample standard deviation of perplexity (first 3 columns) in Table 1, and plot their distributions in Figure 3.
Although all three models share the exact same number of model parameters and training dataset, our very first observation is that both the Transformer-XL baseline trained on the linearized tree sequences (TXL (trees)) and the proposed TG model achieve a lower perplexity-even though the reported perplexity is in fact only an upper boundthan a Transformer-XL trained only on terminals (TXL (terminals))-for which the perplexity calculation is exact-on PTB and the sentence-level BLLIP.This shows that joint modeling of syntactic structures and surface strings in Transformerseven without any explicit inductive bias for making use of the syntactic information (e.g., TXL (trees)), is still helpful for improving perplexity.We conjecture that the next-token prediction task is made easier by the presence of nonterminals within the context, which restricts the word classes that may appear next.Although there are more such prediction events for the linearized tree sequences than for the words-only model, the predictions of the nonterminals are marginalized out at evaluation time.At training time, it might seem that the learning demands placed on the model are higher, and that having to predict the syntactic structures could produce an interference and reduce the available model capacity for predicting the words.Here we do not find this to be the case, as evidenced by both syntactic language models' better perplexity.
relative increase) at the document-level on BLLIP- LG.We investigate the causes of this degredation in the Analysis section below ( §4).

Parse reranking
As human-annotated syntax trees are available for the PTB test split, we also compare models in their parsing accuracy by reranking the 300 candidate samples produced by RNNG for each sentence.We report the mean and sample standard deviation of bracketing F 1 as computed with EVALB (Sekine and Collins, 1997) in Table 1, and plot its distribution in Figure 3.We observe that TG does slightly better (+0.1%) than TXL (trees) on this task; the small difference in mean is nevertheless statistically significant (two-sided Welch's unequal variances t-test, p < 10 −3 ).

Syntactic generalization
Experimental setup Hu et al. ( 2020) developed a series of test suites that probe the syntactic ability of language models on a large set of syntactic phenomena.The aim of this task is to comprehensively assess the ability of language models to syntactically generalize in a human-like fashion, which constitutes a key feature of human linguistic ability.
A model succeeds on a given test case when the probabilities it assigns to specifically crafted exam-ples obey an inequality (or conjunctions thereof) justified by how humans process language.We use models trained on independent sentences from BLLIP-LG, evaluate on parse trees provided by Hu et al. (2020) and generated using an RNNG proposal model, and we report the average syntactic generalization (SG) score across the same set of 31 test suites.
Discussion We report the mean and standard deviation of the average SG score in Table 1, and plot its distribution in Figure 3.Our first observation is that the average SG score is substantially higher for models trained on linearized trees than on words alone-both TG and TXL (trees) outperform TXL (terminals).Interestingly, this also extends to models that are orders of magnitude larger and trained on much more data, such as GPT-2, Gopher (Rae et al., 2021, 280B params.),and Chinchilla (Hoffmann et al., 2022, 70B params).We believe that this result can be explained in three steps.First, the modeling of the structure via the nonterminals by TG and TXL (trees) can be seen, during training, as providing additional syntactic supervision.This enables them to pick, from a large number of candidate trees, good parses for a sentence.Second, as the SG score is computed from inequalities involving model surprisals on words, we perform an approximate marginalization step for TG and TXL (trees).In this approximate marginalization, valid parses are therefore heavily weighted.Finally, when the model has a strong preference for syntactically correct parses, the tasks from the test suite become easier, accounting for these models' higher scores.The results of the large LMs show model scale alone is insufficient to offset this effect.
Our second observation is that-comparing Transformer Grammars to TXL (trees)-our approach is most beneficial on tasks that are most related to modeling structure, i.e., parse reranking, in addition to the comprehensive SG test suite.On both tasks, Transformer Grammars achieve higher bracketing F 1 and average SG scores with a statistically significant difference.We believe that this performance is explained by the restricted attention in Transformer Grammars, thus preventing the model from attending to syntactically irrelevant parts of the input, and encouraging it to learn informative composed representations of subtrees.
In Figure 4, we present a breakdown of the SG results.As expected from the average SG score, the TXL (terminals) performs worse than both the TXL (trees) and TG, except for Gross Syntactic State where it nearly reaches 100%.TG and TXL (trees) have very similar scores on all circuits except licensing, where TG substantially outperforms TXL (trees).
2 Altogether, these results demonstrate the benefits of recursive syntactic compositions for improving LM performance at syntax-sensitive benchmarks of human linguistic competence, even in the case of powerful Transformer-based language models that are trained at the medium data scale (≈ 40M words).Furthermore, our findings shed more light on which syntactic constructions benefit the most from explicit syntactic compositions.
2 One might wonder why Licensing and Gross syntactic state deviate from the pattern of results seen in other circuits.
Licensing, where TGs excel, involves evaluating restrictions on pairs of words that are linearly separated, but "structurally local" (specifically, they stand in a c-command relationship).
Since TGs are strongly biased to learn to make predictions in terms of such structurally local relations, it is unsurprising to find success in Licensing.On the other hand, success in Gross Syntactic State requires tracking whether an initial clause is subordinate (in which case a main clause should follow), or a main clause (in which case the sentence should end).Since subordinate clauses are introduced by one of a few subordinating conjunctions, and the content of a main clause is relatively independent of its subordinate, a syntax-free learner will easily learn that the subordinating conjunction determines the Gross syntactic state, explaining the result.

Analysis
To better understand what is causing the pattern of results in the previous section, we perform a series of analysis experiments: ablations, comparative analysis of emitted probabilities, and probing of the representations.

Ablation experiments
As TGs jointly model the words and syntax tree, we perform ablations experiments where the words are preserved, but where the syntactic structure is transformed.The aim of this experiment is to better understand to what extent is having access to the "right" syntactic structures at training time an important factor behind TGs' success?Would TGs still do just as well when they are trained with trivial or deterministically transformed syntax trees?
We also study the effect of different kinds of position information on our metrics.

Transformed structures
We transform a syntax tree into a binary left-branching one by moving the opening nonterminals to the left without reordering them, and, after the first two terminals, placing the required closing nonterminal after each terminal.For instance, (S (NP the blue bird NP) (VP sings VP) S) is transformed into (S (NP (VP the blue VP) bird NP) sings S).Symmetrically, we form a binary right-branching tree from this example, (S the (NP blue (VP bird sings VP) NP) S).Lastly, we define the reversed trees where the structure is reversed, i.e., the children of a node are put in reverse order, but the order of the terminals is preserved, in this case (S (VP the VP) (NP blue bird sings NP) S).In all three transformations, the apparent syntactic structure, as indicated by the nonterminals, is no longer the true syntactic structure of the sentence.
We train and evaluate (on the validation sets) TG and TXL (trees) on such transformed trees, and report our results in Table 2.For TG, perplexity, syntactic generalization, and bracketing F 1 3 are much worse, regardless of the transformation, compared to using the original trees.This is unsurprising considering that the operations performed by the model mechanically depend on the syntactic structure represented by the nonterminals.An apparent structure that does not correspond to the sentence will therefore lead to an unsuitable sequence of operations.More precisely, perplexity is most impacted on reversed trees than on left-branching trees, and right-branching trees have the least impact.Indeed, left-branching trees are comparatively easier to model, because these sequences are formed of a prefix of opening nonterminals, followed by an interleaved sequence of terminals and closing nonterminals.Right-branching trees are similarly easy, and furthermore, the COMPOSE operations specific to TG only happen when the closing nonterminals are encountered towards the end of the sequence, which is deterministically determined by its left context.Unlike TG, TXLs (trees) have no constraints to use the syntactic information to model terminals, and thus they are free to use it or not.However, it is training data, and model capacity must be used 3 Here, we do not use the DELETE_LABEL directives from COLLINS.prm.
to account for its distribution.This explains why performance degradation the TXL (trees) performance is harmed by the tree transformations, but less than the TG performance.

Positional information
As observed by Haviv et al. (2022), not using positional information for TXL (terminals) and TXL (trees) has a negative but small impact on perplexity, which the authors conjecture is due to the ability of the model to learn positional information using the causal mask.Under this hypothesis, it follows that the impact on the syntactic generalization and the bracketing F 1 scores should be small, which is what we observe.The most impacted model is TXL (trees) on BLLIP-LG documents, which we conjecture is due to the long sequences of tokens, including repeated identical nonterminals, making access to a good positional signal important.For TG, the results are similar, and the same mechanism can be posited to be at play.Its attention mask is not only causal, but also very sparse.Because there are few tokens that can be attended to, position-based querying is at the same time less critical and easier to learn.
We train and evaluate a variant of TG using difference in linear position as relative position function, instead of difference in tree depth, and find almost no impact on performance.This is readily explained by the same reasons as above-as TG's attention mask is so sparse, position-based querying matters little.Given the same empirical performance, we solely ground our choice in its theoretical justification (see §2.1).

Regression analysis of probabilities
To determine when TGs are more or less successful than the unrestricted TXL (trees) model, we predict the differences in log probabilities of the true terminal a i under the two models: To reduce variance stemming from model initialization, we use an ensemble of 100 Transformer Grammars and an ensemble of 100 TXLs (trees).

Terminal frequencies
We hypothesize that the syntactically-restricted attention pattern of TGs-where subsequent predictions can only attend to composed representationsprevents it from learning the non-syntactic dimensions of the data distribution, such as rare cooccurrences, to the same extent as TXLs (trees).Based on this hypothesis, we expect the TGs' predictions to be worse for rare tokens.
We therefore compute the empirical unigram distribution of the terminals in the training split of BLLIP-LG documents, and partition terminals into high-frequency (f ≥ 10 −3 ), medium-frequency (10 −5 ≤ f < 10 −3 ), and low-frequency (f < 10 −5 ) buckets.We then define three binary variables, indicating whether the terminal at a given position has a high, medium, low frequency, and use these in an ordinary least squares model to predict the difference in log probabilities on the BLLIP-LG validation set: ∆ ∼ HighFreq + MediumFreq + LowFreq.
This shows that-although TGs can predict the terminals appearing most frequently almost as well as TXLs (trees) do-they struggle to predict rarer ones.We hypothesize that lexical co-occurrences that cross syntactic units can be learnt directly by TXLs (trees), whereas this is more difficult to do for TGs.Indeed, a consequence of STACK/COMPOSE attention is that a terminal A can only attend to another terminal B iff B is in A's left-context, and B is A's sibling.Our result suggests that this is not happening sufficiently often for TGs to predict rare terminals as well as TXLs (trees) do.

Copying
Likewise, we hypothesize that TXL (trees) is better at copying words from the context than are TGs.
We define three binary variables, indicating (i) whether the true terminal to predict appears in the context in a previous sentence, but not in the current one; (ii) whether it appears in the context in the current sentence; or (iii) does not appear in the context at all.We use these in a new ordinary least squares model to predict the difference in log probabilities: ∆ ∼ InContextPrevSentences + InContextCurSentence + NotInContext.
We find an adjusted R 2 value of 0.010, and coefficients β InContextPrevSentences = −0.2871,β InContextCurSentence = −0.1003,β NotInContext = −0.1340,all statistically different from 0 with a p-value < 10 −3 .This finding suggests that TGs perform worse than TXL (trees) on all three conditions, although the difference is most pronounced for terminals appearing in a previous sentence (but not in the current one).This observation suggests that TXLs (trees) benefit from a priming effectpreviously seen tokens becoming more likelywhereas this effect is diminished in TGs.

Probing analysis of representations
Experimental setup To quantify how well the representations learnt by TG, TXL (trees) and TXL (terminals) encode syntactic information, we use the information-theoretic probing framework developed by Voita and Titov (2020), capturing in a principled way how well the probes explain the labels given the representations, as well as how readily available the information is in the representations (i.e., how complex the probes are).If the probe labels can be easily predicted from the model's hidden state activations, even with a simple probe, then this provides an indication that the probed phenomenon is more saliently encoded within the model's learnt vector representations.Here we used four probing tasks.The first two probing tasks are taken from Liu et al. (2019), where we predict (i) the part-of-speech tag of each terminal, and (ii) the grandparent constituent tag of each completed phrase/subtree, which requires an understanding of the relevant phrase-structure information.We then use two other probing tasks from Conneau et al. ( 2018): (iii) Predicting the top constituent of the syntax tree and (iv) word-content tagging.We train the probes on the activations output by each layer on the training set of BLLIP-LG, then evaluate the code length on the validation set.For the POS tagging task, we use a single representation for each word.For the three other tasks, we use the representation corresponding to the closing nonterminal for each subtree-for TG, two such representations are available due to the closing nonterminal duplication ( §2.1), and we consider them separately; for TXL (terminals), we aggregate the representations of the first and last words of the phrase/subtree.
Discussion We report in Figure 5 the code length of the probing labels given the model representations on the four probing tasks.We observe that the representations from TG and TXL (trees) explain equally well the POS tags.Predicting the ancestor constituents is much easier using the representations from the closing nonterminals in the STACK positions in TG compared to TXL (trees), which is expected considering that the ancestors, by construction of the tree, are still on the stack when the subtree is closed.Conversely, predicting the top constituents (i.e., the sequence of immediate children) is easy to do from the COMPOSE representations, which precisely attend over them.On these three syntactic tasks, TXL (terminals), which does not benefit from syntactic supervision, does predictably worse.Finally, it is much easier to predict whether a subtree contains a word from a given set from the representations from TXLs (trees) compared to either type of representations from TG, and even from the TXL (terminals), suggesting that TXL (trees) models retain which words appear in the context better than TGs do, echoing our regression analysis results.

Related Work
A variety of work has augmented language models with syntactic or hierarchical biases.The RNNG model (Dyer et al., 2016;Kuncoro et al., 2017) jointly models trees and strings and uses recursive networks to build representations of phrases (similar to the approach taken here); however, scaling RNNG training is nontrivial (Noji and Oseki, 2021).Other forms of structural bias that do not use observed syntax trees have come in the form of stack-structured memory (Yogatama et al., 2018), running RNNs at multiple scales (Chung et al., 2017), and structuring the "forget" gates in LSTMs to encourage hierarchy (Shen et al., 2019).
With the advent of Transformers and large-scale pretraining, the question of whether syntax and hierarchy "is still needed" received renewed inter-est.While bidirectional encoders do learn a great deal about syntax from pretraining (Manning et al., 2020), long-tail syntactic phenomena continue to pose a problem, and may be implicated in systematic semantic failures (Ettinger, 2020).Prior work has devised multiple strategies for injecting syntactic inductive biases into Transformers (Wang et al., 2020;Sundararaman et al., 2019;Kuncoro et al., 2020;Sachan et al., 2021;Bai et al., 2021).However, improved syntactic awareness is not found to be beneficial for some language understanding tasks (Warstadt et al., 2020;Kuncoro et al., 2020;Pruksachatkun et al., 2020;Sachan et al., 2021).
Our approach to injecting syntactic biases into generative transformer language models combines two modeling traditions: (i) syntactic language models that estimate the joint probability of strings and trees (Jurafsky et al., 1995;Chelba and Jelinek, 2000;Roark, 2001;Henderson, 2004;Mirowski and Vlachos, 2015;Choe and Charniak, 2016;Kim et al., 2019), and (ii) constraining attention patterns in accordance with syntactic structures (Strubell et al., 2018;Wang et al., 2019;Peng et al., 2019;Zhang et al., 2020;Nguyen et al., 2020;Astudillo et al., 2020).TGs are perhaps most closely related to the model proposed in Qian et al. (2021), who similarly combined syntactic language modeling with syntax-based attention constraints.We differ from this model in two primary ways.First, TGs use a new kind of typed-attention mask with duplicated closing nonterminal symbols that implement recursive syntactic compositions, which was identified as a critical component of RNN-based syntax models (Dyer et al., 2016;Kim et al., 2019;Wilcox et al., 2019;Futrell et al., 2019).Second, this paper explores an extension of sentence-level syntactic models to models of full documents.Modeling of multi-sentence sequences has been a key feature behind recent language modeling successes (Radford et al., 2019;Brown et al., 2020), and thus understanding how syntax interacts with this modeling problem is of considerable interest.

Conclusion
Transformer Grammars are a new syntactic language model that implements recursive syntactic composition of phrase representations through attention.Experiments show that TGs outperform prior work on two syntax-sensitive language modeling evaluation metrics.On sentence-level language modeling, TGs outperform a strong Transformer- XL that operates only on the word sequences, although we find that they perform worse at the document-level, when restricted to use a single composed representation for each previous sentence.We also find that the presence of structural information is strictly better on all metrics than in Transformers trained on words alone.While just as efficient to sample from as any autoregressive language model, TGs however do not provide probability estimates as easily, and using it where these are needed requires accepting more computation, or further research into efficient methods for probability estimation.Taken more broadly, our findings emphasize the ongoing importance of finding better and scalable ways to encourage language models to internalize the structural properties of language.
Our implementation is available upon request.
Figure1: An example that represents a pair of string x and its phrase-structure tree y, which are then represented as a sequence of actions that construct (x, y) in a top-down, left-to-right fashion(Dyer et al., 2016;Choe and Charniak, 2016).
<s> (S (NP the blue bird NP) NP) (VP sings VP) VP) SAttention mask with STACK/COMPOSE attention.STACK is represented in blue, whereas COMPOSE is denoted in orange.

Figure 2 :
Figure 2: Processing of an example sentence: (S (NP the blue bird NP) (VP sings VP) S)

Figure 3 :
Figure 3: Distributions of the metrics of interest on the test sets, with 100 random initializations for each model.All the differences in means are statistically significant (p < 10 −3 ).

Figure 4 :
Figure 4: Per-circuit breakdown of the SG scores.

Figure 5 :
Figure 5: Code length of the labels given the model representations on probing tasks.

Table 1 :
Results on the test sets obtained for 100 models of TG, TXL (trees) and TXL.The results marked ♢ are directly taken from prior work, which may not directly comparable due to differences in model ♡ we select the best perplexity results for the batched RNNG (35M parameters, beam size of 1,000), ♥ model trained on 100M Wikipedia tokens.The PTB parsing results of the batched RNNG (Noji and Oseki, 2021, Table 2) are not directly comparable since they reported results with beam search inference, whereas we use a parse reranking setup.Perplexities of the large LMs (last 3 rows) are not reported because of likely test data contamination

Table 2 :
Results on the validation split of the datasets.† Perplexities reported for TG (all variants) and TXL (trees) are upper bounds, derived from approximately marginalizing over a set of proposal trees.