The Return of Lexical Dependencies: Neural Lexicalized PCFGs

Abstract In this paper we demonstrate that context free grammar (CFG) based methods for grammar induction benefit from modeling lexical dependencies. This contrasts to the most popular current methods for grammar induction, which focus on discovering either constituents or dependencies. Previous approaches to marry these two disparate syntactic formalisms (e.g., lexicalized PCFGs) have been plagued by sparsity, making them unsuitable for unsupervised grammar induction. However, in this work, we present novel neural models of lexicalized PCFGs that allow us to overcome sparsity problems and effectively induce both constituents and dependencies within a single model. Experiments demonstrate that this unified framework results in stronger results on both representations than achieved when modeling either formalism alone.1


Introduction
Unsupervised grammar induction aims at building a formal device for discovering syntactic structure from natural language corpora. Within the scope of grammar induction, there are two main directions of research: unsupervised constituency parsing, which attempts to discover the underlying structure of phrases, and unsupervised dependency parsing, which attempts to discover the underlying relations between words. Early work on induction of syntactic structure focused on learning phrase structure and generally used some variant of probabilistic context-free grammars (PCFGs; Lari and Young, 1990;Charniak, 1996;Clark, 2001). In recent years, dependency grammars have gained favor as an alternative 1 Code is available at https://github.com /neulab/neural-lpcfg. syntactic formulation (Yuret, 1998;Carroll and Charniak, 1992;Paskin, 2002). Specifically, the dependency model with valence (DMV) (Klein and Manning, 2004) forms the basis for many modern approaches in dependency induction. Most recent models for grammar induction, be they for PCFGs, DMVs, or other formulations, have generally coupled these models with some variety of neural model to use embeddings to capture word similarities, improve the flexibility of model parameterization, or both (He et al., 2018;Jin et al., 2019;Kim et al., 2019;Han et al., 2019).
Notably, the two different syntactic formalisms capture very different views of syntax. Phrase structure takes advantage of an abstracted recursive view of language, while the dependency structure more concretely focuses on the propensity of particular words in a sentence to relate to each other syntactically. However, few attempts at unsupervised grammar induction have been made to marry the two and let both benefit each other. This is precisely the issue we attempt to tackle in this paper.
As a specific formalism that allows us to model both formalisms at once, we turn to lexicalized probabilistic context-free grammars (L-PCFGs; Collins, 2003). L-PCFGs borrow the underlying machinery from PCFGs but expand the grammar by allowing rules to include information about the lexical heads of each phrase, an example of which is shown in Figure 1. The head annotation in the L-PCFG provides lexical dependencies that can be informative in estimating the probabilities of generation rules. For example, the probability of VP[CHASING] → VBZ[IS] VP [CHASING] is much higher than VP → VBZ VP, because ''chasing'' is a present participle. Historically, these grammars have been mostly used for supervised parsing, combined with traditional count-based estimators of rule probabilities (Collins, 2003). Within this context, lexicalized grammar rules are powerful, 647 Figure 1: Lexicalized phrase structure tree for ''the dog is chasing the cat.'' The head word of each constituent is indicated with parentheses.
but the counts available are sparse, and thus required extensive smoothing to achieve competitive results (Bikel, 2004;Hockenmaier and Steedman, 2002).
In this paper, we contend that with recent advances in neural modeling, it is time to return to modeling lexical dependencies, specifically in the context of unsupervised constituent-based grammar induction. We propose neural L-PCFGs as a parameter-sharing method to alleviate the sparsity problem of lexicalized PCFGs. Figure 2 illustrates the generation procedure of a neural L-PCFG. Different from traditional lexicalized PCFGs, the probabilities of production rules are not independently parameterized, but rather conditioned on the representations of nonterminals, preterminals, and lexical items ( §3). Apart from devising lexicalized production rules ( §2.3) and their corresponding scoring function, we also follow Kim et al.'s (2019) compound PCFG model for (non-lexicalized) constituency parsing with compound variables ( §3.2), enabling modeling of a continuous mixture of grammar rules. 2 We define how to efficiently train ( §4.1) and perform inference ( §4.2) in this model using dynamic programming and variational inference.
Put together, we expect this to result in a model that both is effective, and simultaneously induces both phrase structure and lexical dependencies, 3 whereas previous work has focused on only one. Our empirical evaluation examines this hypothesis, asking the following question: In neural grammar induction models, is it possible to jointly and effectively learn both phrase structure and lexical dependencies? Is using both in concert better at the respective tasks than specialized methods that model only one at a time?
Our experiments ( §5.3) answer in the affirmative, with better performance than baselines designed specially for either dependency or constituency parsing under multiple settings. Importantly, our detailed ablations show that methods of factorization play an important role in the performance of neural L-PCFGs ( §5.3.2). Finally, qualitatively ( §5.4), we find that latent labels induced by our model align with annotated gold non-terminals in PTB.

Motivation and Definitions
In this section, we will first provide the background of constituency grammars and dependency grammars, and then formally define the general L-PCFG, illustrating how both dependencies and phrase structures can be induced from L-PCFGs.

Phrase Structures and CFGs
The phrase structure of a sentence is formed by recursively splitting constituents. In the parse above: The sentence is split into a noun phrase (NP) and a verb phrase (VP), which can themselves be further split into smaller constituents; for example, the NP comprises a determiner (DT) ''the'' and a normal noun (NN) ''dog.'' Such phrase structures are represented as a context-free grammar 4 (CFG), which can generate an infinite set of sentences via the repeated application of a finite set R of rules: S denotes a start symbol, N is a finite set of non-terminals, P is a finite set of preterminals, and Σ is a set of terminal symbols (i.e., words and punctuation).

Dependency Structures and Grammars
The dog is chasing the cat ROOT det nsubj aux nobj det In a dependency tree of a sentence, the syntactic nodes are the words in the sentence. Here the root is the root word of the sentence, and the children of each word are its dependents. Above, the root word is chasing, which has three dependents, its subject (nsubj) dog, auxiliary verb (aux) is, and object (nobj) cat. A dependency grammar 5 specifies the possible head-dependent pairs D = {(α i , β i )} ∈ (V ∪{ROOT})×V, where the set V denotes the vocabulary.

Lexicalized CFGs
Although both the constituency and dependency grammars capture some aspects of syntax, we aim to leverage their relative strengths in a single unified formalism. In a unified grammar, these two types of structure can benefit each other. For example, in The dog is chasing the cat of my neighbor's, while the phrase of my neighbor's might be incorrectly marked as the adverbial phrase of chasing in a dependency model, the 4 Note ǫ / ∈ T and Σ ∩ T = ∅, so this formulation does not capture the structure of sentences of length zero or one. 5 This work assumes a projective tree.
constituency parser can provide the constraint that the cat of my neighbor's is a constituent, thereby requiring chasing to be the head of the phrase. Lexicalized CFGs are based on a backbone similar to standard CFGs but parameterized to be sensitive to lexical dependencies such as those used in dependency grammars. Similarly to CFGs, L-CFGs are defined as a five-tuple T = (S, N , P, Σ, R). The differences lie in the formulation of rules R: where α, β ∈ Σ are words, and mark the head of constituent when they appear in ''[·]''. 6 Branching rules 2l and 2r encode the dependencies (α, β). 7 In a lexicalized CFG, a sentence x can be generated by iterative binary splitting and emission, forming a parse tree t = [r 1 , r 2 , . . . , r 2|x| ], where rules r i are sorted from top to bottom and from left to right. We will denote the set of parse trees that generate x within grammar T as T x .

Grammar Induction with L-PCFGs
In this subsection, we will introduce L-PCFGs, the probabilistic formulation for L-CFGs. The task of grammar induction is to ask, given a corpus C ⊂ Σ + , how can we obtain the probabilistic generative grammar that maximizes its likelihood. With the induced grammar, we are also interested in how to obtain the trees that are most likely given an individual sentence-in other words, syntactic parsing according to this grammar.
We begin by defining the probability distribution over sentences x, by marginalizing over all parse trees that may have generated x: wherep z (t) is an unnormalized probability of a parse tree (which we will refer to as an energy function), Z(T , z) = t∈Tp z (t) is the normalizing constant, and z is a compound variable ( §3.2) that allows for more complex and expressive generative grammars (Robbins et al., 1951). We define the energy function of a parse tree by exponentiating a score G θ (t, z) where θ is the parameter of function G θ . Theoretically, G θ (t) could be an arbitrary scoring function, but in this paper, as with most previous work, we consider a context-free scoring function, where the score of each rule r i is independent of the other rules in the parse tree t: where g θ (r, z) : R × R n → R is the rule-scoring function which maps the rule and latent variable z ∈ R n to real space, assigning a log likelihood to each rule. This formulation allows for efficient calculation using dynamic programming. We also include a restriction that the energies must be topdown locally-normalized, under which the partition function should automatically equate to 1 To train an L-PCFG, we maximize the log likelihood of the corpus (the latent variable is marginalized out): And obtain the most likely parse tree of a sentence by maximizing the posterior probability:

Neural Lexicalized PCFGs
As noted, one advantage of L-PCFGs is that the obtained t * encodes both dependencies and phrase structures, allowing both to be induced simultaneously. We also expect this to improve performance, because different information is capture by each of these two structures. However, this expressivity comes at a price: more complex rules. In contrast to the traditional PCFG, which has O(|N |(|N | + |P|) 2 ) production rules, the L-PCFG requires O(|V ||N |(|N |+|P|) 2 ) production rules. Because traditionally rules of L-PCFGs have been parameterized independently by scalars, namely, g θ (r i , z) = θ i (Collins, 2003), these parameters were hard to estimate because of data sparsity.
We propose an alternate parameterization, the neural L-PCFG, which ameliorates these sparsity problems through parameter sharing, and the compound L-PCFG, which allows a more flexible sentence-by-sentence parameterization of the model. Below, we explain the neural L-PCFG factorization that we found performed best but include ablations of our decisions in Section 5.3.2.

Neural L-PCFG Factorization
The score of an individual rule is calculated as the combination of several component probabilities: Probability that the start symbol produces a non-terminal A. 8 word emission probability p z (A → α): Probability that the head word of a constituent is α conditioned on that the non-terminal of the constituent is A.
head non-terminal probability p z (B, | A, α) or p z (C, | A, α): Probability of the headedness direction and head-inheriting child 9 conditioned on the parent non-terminal and head words.

non-inheriting child probability
Probability of the non-inheriting child conditioned on the headedness direction, and parent and head-inheriting child non-terminals.
The score of root-seeking rule 1 is factorized as the product of the root to non-terminal score and word emission scores, as shown in Equation (7).
The scores of branching rules 2l and 2r are factorized as the sum of a binary nonterminal score, a head non-terminal score, and a word emission score. Equation (8) describes the factorization of the score of rule 2l and 2r : Because the head of preterminals is already specified upon generation of one of the ancestor non-terminals, the score of emission rule 3 is 0.
The component probabilities are all similarly parameterized, vectors corresponding to component non-terminals or terminals are fed through a multi-layer perceptron denoted f (·), and a dot product is taken with another vector corresponding to a component non-terminal or terminal. Specifically, the root to non-terminal probability is where ; denotes concatenation and the word emission probability is The non-inheriting child probabilities for leftand right-headed dependencies are where partition functions satisfy C p z (C | A, B, α, ) and B p z (B | A, C, α, ) = 1.
The respective head non-terminal scores are where the partition function satisfies B p z (B, Here vectors u, v, w ∈ R d represent the embeddings of non-terminals, preterminals and words. f (i) , i = 1, 2, 3 are multilayer perceptrons with different set of parameters, where we use residual connections 10 (He et al., 2016) between layers to facilitate training of deeper models.

Compound Grammar
Among various existing grammar induction models, the compound PCFG model of Kim et al. (2019) both shows highly competitive results and follows a PCFG-based formalism similar to ours, and thus we build upon this method. The compound in compound PCFG refers to the fact that it uses a compound probability distribution (Robbins et al., 1951) in modeling and estimation of its parameters. A compound probability distribution enables continuous variants of grammars, allowing the probabilities of the grammar to change based on the unique characteristics of the sentence. In general, compound variables can be devised in any way that may inform the specification of the rule probabilities (e.g., a structured variable to provide frame semantics or the social context in which the sentence is situated). In this way, compound grammar increases the capacity of the original PCFG.
In this paper, we use a latent compound variable z that is sampled from a standard spherical Gaussian distribution.
Algorithm 1 Generative Procedure of Neural L-PCFGs: Sentences Are Generated from Start Symbol S and Compound Variable z Recursively. Require: N , T , P 1 , P 2 function RECURSIVE(N, α, z) Algorithm 1 shows the procedure to generate a sentence recursively from a random compound variable and a distribution over the production rules in a pre-order traversal manner, where P 1 and P 2 are defined using g θ from Equations (7) and (8), respectively: 4 Training and Inference

Training
It is intractable to obtain either the exact log likelihood by integration over z, and estimation by Monte Carlo sampling would be hopelessly inefficient. However, we can optimize the evidence lower bound (ELBo): where q φ (z | x) is the proposal probability parameterized by an inference network, similar to those used in variantial autoencoders (Kingma and Welling, 2014). The ELBo can be estimated by Monte Carlo sampling: where We model the proposal probability as an orthogonal Gaussian distribution: where (µ, σ) are output by the inference network Both f µ and f σ are parameterized as LSTMs (Hochreiter and Schmidhuber, 1997). Note that the inference network could be optimized by the reparameterization trick (Kingma and Welling, 2014):ẑ where ⊙ denotes Hadamard operation. The KL divergence between q φ (z | x) and N (0, I) is Initialization We initialize word embeddings using GloVe embeddings (Pennington et al., 2014). We further cluster word embeddings with k-means (MacQueen, 1967), as shown in Figure 3, and use the centroids of the clusters to initialize the embeddings of preterminals. The k-means algorithm is initialized using the k-means++ method and trained until convergence. The intuition therein is that this gives the model a rough idea of syntactic categories before starting grammar induction. We also consider the variant without pretrained word embeddings, where we initialize word embeddings and preterminals both by drawing from N (0, I). Other parameters are initialized by Xavier normal initialization (Glorot and Bengio, 2010).

Curriculum Learning
We also apply curriculum learning (Bengio et al., 2009;Spitkovsky et al., 2010) to learn the grammar gradually. Starting at half of the maximum length in the training set, we raise the length limit by α% each epoch.

Inference
We are interested in the induced parse tree for each sentence in the task of unsupervised parsing, that is, the most probable treet where p(z | x) is the posterior over compound variables. However, it is intractable to get the most probable tree. Hence we use the mean µ = f µ (x) predicted by the inference network and replace p(z | x) with a Dirac delta distribution δ(z − µ) in place of the real distribution to approximate the integral 11 The most probable tree can be obtained via the CYK algorithm. 11 Note that it is also possible to use other methods for approximation. For example, we can use q φ (z | x) in place of posterior distribution. However, using it still results in high prediction variance of the max function. We did not observe a significant improvement with other methods.

Data Setup
All models are evaluated using the Penn Treebank (Marcus et al., 1993) as the test corpus, following the splits and preprocessing methods, including removing punctuation, provided by Kim et al. (2019). To convert the original phrase bracket and label annotations to dependency annotations, we use Stanford typed dependency representations (De Marneffe and Manning, 2008). We use three standard metrics to measure the performance of models on the validation and test sets: directed and undirected attachment score (DAS and UAS) for dependency parsing, and unlabeled constituent F1 score for constituency parsing.
We tune hyperparameters of the model to minimize perplexity on the validation set. We choose perplexity because it requires only plain text and not annotated parse trees. Specifically, we tuned the architecture of f (i) , i = 1, 2, 3 in the space of multilayer perceptrons, with the dimension of each layer being n + d, with residual connections and different non-linear activation functions. Table 1 shows the final hyper-parameters of our model. Due to memory constraints on a single graphic card, we set the number of non-terminals and preterminals to 10 and 20, respectively. Later we will show that the compound PCFG's performance is benefited by a larger grammar; it is therefore possible the same is true for our neural L-PCFG. Section 7 includes a more detailed discussion of space complexity.

Baselines
We compare our neural L-PCFGs with the following baselines: Compound PCFG The compound PCFG (Kim et al.,   and neural L-PCFG is the modeling of headedness and the dependency between head word and generated non-terminals or preterminals. We apply the same hyperparameters and techniques, including number of non-terminals and preterminals, initialization, curriculum learning and variational training to compound PCFGs for a fair comparison. Because compound PCFGs have no notion of dependencies, we extract dependencies from the compound PCFG with three kinds of heuristic head rules: left-headed, right-headed, and largeheaded. Left-/right-headed mean always choosing the root of the left/right child constituent as the root of the parent constituent, whereas largeheadedness is generated by a heuristic rule which chooses the root of larger child constituent as the root of the parent constituent. Among these, we choose the method that obtains the best parsing accuracy on the dev set (making these results an oracle with access to more information than our proposed method). Therefore, it could be seen as a special case of lexicalized PCFG where the generation rules provide inductive biases for dependency parsing but are also restricted-for example, a voidvalence constituent cannot produce a full-valence constituent with the same head. Note that DMV uses far fewer parameters than the PCFG-based models, O(|P| 2 ). The neural L-PCFG uses a similar number of parameters as we do, O(n(|P| + |N |) + n 2 ).
We compare models under two settings: (1) with gold tag information and (2) without it, denoted by ✓ and ✗, respectively in Table 2. To use gold tag information in training the neural L-PCFG, we assign the 19 most frequent tags as categories and combine the rest into a 20th ''other'' category. These categories are used as supervision for the preterminals. In this setting, instead of optimizing the log probability of the sentence, we optimize the log joint probability of the sentence and the tags.

Quantitative Results
First, in this section, we present and discuss quantitative results, as shown in Table 2.

Main Results
First comparing neural L-PCFGs with compound PCFGs, we can see that L-PCFGs perform slightly better on phrase structure prediction and achieve much better dependency accuracy. This shows that (1) lexical dependencies contribute somewhat to the learning of phrase structure, and (2) Table 3: Fraction of ground truth constituents that were predicted as a constituent by the models broken down by label (i.e., label recall). Results of PRPN and ON are from Kim et al. (2019).
head rules learned by neural L-PCFGs are significantly more accurate than the heuristics that we applied to standard compound PCFGs. We also find that GloVe embeddings can help (unsupervised) dependency parsing, but do not benefit constituency parsing. Next, we can compare the dependency induction accuracy of the neural L-PCFGs with the DMV. The results indicate that neural L-PCFGs without gold tags achieve even better accuracy than DMV with gold tags on both directed accuracy and undirected accuracy. As discussed before, DMV can be seen as a special case of L-PCFG where the attachment of children is conditioned on the valence of the parent tag, while in L-PCFG the generated head directions are conditioned on the parent non-terminal and the head word, which is more general. Comparatively positive results show that conditioning on generation rules not only is more general but also yields a better prediction of attachment. Table 3 shows label-level recall (i.e., unlabeled recall of constituents annotated by each nonterminal). We observe that the neural L-PCFG outperforms all baselines on these frequent constituent categories. Table 4 compares the effects of three alternate

Impact of Factorization
F III: log p z (B, | A, α) + log p z (C | A,B,α) Factorization I assumes that the child nonterminals do not depend on the head lexical item,  Table 4: An ablation of dependency and constituency parsing results on the validation set with different settings of neural L-PCFG. All models are trained with GloVe word embeddings and without gold tags. ''w/ xavier init'' means that preterminals are not initialized by clustering centroids by xavier normal distribution. ''w/ Factorization N'' represents different factorization methods ( §5.3.2).
which influences the parsing result significantly. Although Factorization II is as general as our proposed method, it uses separate representations for different directions, v BC and v BC . Factorization III assumes the independence between direction and dependent non-terminals. These results indicate that our factorization strikes a good balance between modeling lexical dependencies and directionality, and avoiding over-parameterization of the model that may lead to sparsity and difficulties in learning.

Qualitative Analysis
We analyze our best model without gold tags in detail. Figure 4 visualizes the alignment between our induced non-terminals and gold constituent labels on the overlapping constituents of induced trees and the ground-truth. For each constituent label, we show the frequency of it annotating the same span of each non-terminal. We observe from the first map that a clear alignment between certain linguistic labels and induced non-terminals (e.g., PP,. But for other non-terminals, there's no clear alignment with induced classes. One hypothesis for this diffusion is the diversity of the syntactic roles of these constituents. To investigate this, we zoom in on noun phrases in the second map, and observe that NP-SBJ, NP-TMP, and NP-MNR are combined into a single non-terminal NT-5 in the induced grammar, and that NP, NP-PRD, and NP-CLR corresponds to NT-2, NT-6, and NT-0, respectively. Figure 4: Alignment between all induced non-terminals (x-axis) and gold non-terminals annotated in the PTB (y-axis). In the upper figure, we show the seven most frequent gold non-terminals, and list them by frequency from left to right. For each gold non-terminal, we show the proportion of each induced non-terminal. In the lower map, we breakdown the results of the noun phrase (NP) into subcategories. Darker color indicates higher proportion, and vice versa.
We also include an example set of parses for comparing the DMV and neural L-PCFG in Table 5. Note that DMV uses ''to'' as the head of ''know'', the neural L-PCFG correctly inverts this relationship to produce a parse that is better aligned with the gold tree. One of the possible reasons that the DMV tends to use ''to'' as the head is that DMV has to carry the information that the verb is in the infinitive form, which will be lost if it uses ''know'' as the head. In our model, however, such information is contained in the types of non-terminals. In this way, our model uses the open class word ''know'' as the root. Note that we also illustrate a similar failure case in this example. Neural L-PCFG uses ''if'' as the head of the if-clause, which is probably due to the independency between the root of the if-clause and ''know''.
A common mistake made by the neural L-PCFG is treating auxiliary verbs like adjectives that combine with the subject instead of modifying verb phrases. For example, the neural L-PCFG parses ''...the exchange will look at the performance...'' as ''((the exchange) will) (look (at (the performance)))'', whereas the compound PCFG produces the correct parse ''((the exchange) (will (look (at (the performance)))))''. A possible reason for this mistake is that English verb phrases are commonly left-headed, which makes attaching an auxiliary verb less probable as the left child of a verb phrase. This type of error may stem from the model's inability to assess the semantic function of auxiliary verbs (Bisk and Hockenmaier, 2015).

Related Work
Dependency vs Constituency Induction The decision to focus on modeling dependencies and constituencies has largely split the grammar induction community into two camps. The most popular approach has been focused on dependency formalisms (Klein and Manning, 2004;Spitkovsky et al., 2010Spitkovsky et al., , 2011Spitkovsky et al., , 2013Mareček and Straka, 2013;Jiang et al., 2016;Tran and Bisk, 2018), whereas a second community has focused on inducing constituencies (Lane and Henderson, 2001;Ponvert et al., 2011;Golland et al., 2012;Jin et al., 2018). Induced constituencies can in the case of CCG Hockenmaier, 2012, 2013) produce dependencies, but unlike our proposal, existing approaches do not jointly model both representations. CFGs have been used for decades to represent, analyze and model the phrase structure of language (Chomsky, 1956;Pullum and Gazdar, 1982;Lari and Young, 1990;Klein and Manning, 2002;Bod, 2006).
Similarly, the compound PCFG (Kim et al., 2019), which we extend, falls into this camp of models that induce only phrase-structure grammar. However, in this paper we demonstrate a novel lexically informed neural parameterization that extends their model to induce a unified phrase-structure and dependency-structure grammar.
Unifying Phrase Structure and Dependency Grammar Head-driven phrase structure grammar (Sag and Pollard, 1987) and lexicalized tree adjoining grammar (Schabes et al., 1988) are approaches to representing dependencies directly in phrase structure.
The notion that abstract syntactic structure should provide scaffolding for dependencies, and that lexical dependencies should provide a semantic guide for syntax, was most famously explored in Collins (2003) through the introduction of an L-PCFG. In addition, Carroll and Rooth (1998) explored the problem of head induction in L-PCFG; Charniak and Johnson (2005) improves L-PCFGs with coarse-to-fine parsing and reranking. Recently, (Green andŽabokrtskỳ, 2012;Ren et al., 2013;Yoshikawa et al., 2017) explored various methods to jointly infer phrase structure and dependencies. Klein and Manning (2004) show that a combined DMV and CCM (Klein and Manning, 2002) model, where each tree is scored with the product of the probabilities from the individual models, outperforms either individual model. These results demonstrate that the two varieties of unsupervised parsing models can benefit from ensembling. In contrast, our model considers both phrase-and dependency structure jointly. Seginer (2007) introduces a parser using a representation like dependency structure, which helps constituency parsing.
Bikel (2004)'s analysis of prominent models at the time found that lexical dependencies provided only very minor benefits and that choosing appropriate smoothing parameters was key to performance and robustness. Hackenmaier and Steedman (2002) also explores this for combinatorial categorial grammar (CCG), showing that lexical sparsity and smoothing have dramatic effects regardless of the formalism. The sparsity and expense of lexicalized PCFGs have precluded their use in most contexts, though Prescher (2005) proposes a latent-head model to alleviate the sparse data problem.

Conclusion
In this paper, we propose neural L-PCFG, a neural parameterization method for lexicalized PCFGs, for both unsupervised dependency parsing and constituency parsing. We also provide a variational inference method to train our model. By modeling both representations together, our approach outperforms methods specially designed for either grammar formalism alone.
Importantly, our work also adds novel insights for the unsupervised grammar induction literature by probing the role that factorizations and initialization have on model performance. Different factorizations of the same probability distribution can lead to dramatically different performance and should be viewed as playing an important role in the inductive bias of learning syntax. Additionally, where others have used pretrained word vectors before, we show that they too contain abstract syntactic information which can bias learning.
Finally, although out of scope for one paper, our results point to several interesting potential roads forward, including the study of the effectiveness of jointly modeling constituency-dependency representations on freer word order languages, and whether other distributed word presentations (e.g., large-scale transformers) might provide even stronger syntactic signals for grammar induction.
Despite the demonstrated success of lexical dependencies, it should be noted that these are only unilexical dependencies, in contrast to bilexical dependencies, which also consider the dependencies between head and dependent words. Modeling these dependencies would require marginalizing over all possible dependents for each span-head pair. In this case, the time complexity of exhaustive dynamic programming over one sentence would become O(L 5 |N |(|N |+ |P|) 2 ), where L stands for the length of the sentence. Assuming enough parallel workers, this time complexity can be reduced to O(L), but it still requires O(L 4 |N |(|N | + |P|) 2 ) auxiliary space. In contrast, our model runs for O(L 4 |N |(|N | + |P|) 2 ). Assuming enough parallel workers, this time complexity can also be reduced to O(L), but still requires O(L 3 |N |(|N | + |P|) 2 ) auxiliary space. These auxiliary data can be stored in a 32GB graphic card in our experiments (e.g., with N = 20), whereas the bilexical model cannot. There are several potential methods to side-step this problem, including the use of sampling in lieu of dynamic programming, using heuristic methods to prune the grammar, and designing acceleration methods on GPU (Hall et al., 2014).