Abstract
In this paper we demonstrate that context free grammar (CFG) based methods for grammar induction benefit from modeling lexical dependencies. This contrasts to the most popular current methods for grammar induction, which focus on discovering either constituents or dependencies. Previous approaches to marry these two disparate syntactic formalisms (e.g., lexicalized PCFGs) have been plagued by sparsity, making them unsuitable for unsupervised grammar induction. However, in this work, we present novel neural models of lexicalized PCFGs that allow us to overcome sparsity problems and effectively induce both constituents and dependencies within a single model. Experiments demonstrate that this unified framework results in stronger results on both representations than achieved when modeling either formalism alone.1
1 Introduction
Unsupervised grammar induction aims at building a formal device for discovering syntactic structure from natural language corpora. Within the scope of grammar induction, there are two main directions of research: unsupervised constituency parsing, which attempts to discover the underlying structure of phrases, and unsupervised dependency parsing, which attempts to discover the underlying relations between words. Early work on induction of syntactic structure focused on learning phrase structure and generally used some variant of probabilistic context-free grammars (PCFGs; Lari and Young, 1990; Charniak, 1996; Clark, 2001). In recent years, dependency grammars have gained favor as an alternative syntactic formulation (Yuret, 1998; Carroll and Charniak, 1992; Paskin, 2002). Specifically, the dependency model with valence (DMV) (Klein and Manning, 2004) forms the basis for many modern approaches in dependency induction. Most recent models for grammar induction, be they for PCFGs, DMVs, or other formulations, have generally coupled these models with some variety of neural model to use embeddings to capture word similarities, improve the flexibility of model parameterization, or both (He et al., 2018; Jin et al., 2019; Kim et al., 2019; Han et al., 2019).
Notably, the two different syntactic formalisms capture very different views of syntax. Phrase structure takes advantage of an abstracted recursive view of language, while the dependency structure more concretely focuses on the propensity of particular words in a sentence to relate to each other syntactically. However, few attempts at unsupervised grammar induction have been made to marry the two and let both benefit each other. This is precisely the issue we attempt to tackle in this paper.
As a specific formalism that allows us to model both formalisms at once, we turn to lexicalized probabilistic context-free grammars (L-PCFGs; Collins, 2003). L-PCFGs borrow the underlying machinery from PCFGs but expand the grammar by allowing rules to include information about the lexical heads of each phrase, an example of which is shown in Figure 1. The head annotation in the L-PCFG provides lexical dependencies that can be informative in estimating the probabilities of generation rules. For example, the probability of VP[chasing] → VBZ[is] VP[chasing] is much higher than VP → VBZ VP, because “chasing” is a present participle. Historically, these grammars have been mostly used for supervised parsing, combined with traditional count-based estimators of rule probabilities (Collins, 2003). Within this context, lexicalized grammar rules are powerful, but the counts available are sparse, and thus required extensive smoothing to achieve competitive results (Bikel, 2004; Hockenmaier and Steedman, 2002).
In this paper, we contend that with recent advances in neural modeling, it is time to return to modeling lexical dependencies, specifically in the context of unsupervised constituent-based grammar induction. We propose neural L-PCFGs as a parameter-sharing method to alleviate the sparsity problem of lexicalized PCFGs. Figure 2 illustrates the generation procedure of a neural L-PCFG. Different from traditional lexicalized PCFGs, the probabilities of production rules are not independently parameterized, but rather conditioned on the representations of non-terminals, preterminals, and lexical items (§3). Apart from devising lexicalized production rules (§2.3) and their corresponding scoring function, we also follow Kim et al.’s (2019) compound PCFG model for (non-lexicalized) constituency parsing with compound variables (§3.2), enabling modeling of a continuous mixture of grammar rules.2 We define how to efficiently train (§4.1) and perform inference (§4.2) in this model using dynamic programming and variational inference.
Put together, we expect this to result in a model that both is effective, and simultaneously induces both phrase structure and lexical dependencies,3 whereas previous work has focused on only one. Our empirical evaluation examines this hypothesis, asking the following question:
In neural grammar induction models, is it possible to jointly and effectively learn both phrase structure and lexical dependencies? Is using both in concert better at the respective tasks than specialized methods that model only one at a time?
Our experiments (§5.3) answer in the affirmative, with better performance than baselines designed specially for either dependency or constituency parsing under multiple settings. Importantly, our detailed ablations show that methods of factorization play an important role in the performance of neural L-PCFGs (§5.3.2). Finally, qualitatively (§5.4), we find that latent labels induced by our model align with annotated gold non-terminals in PTB.
2 Motivation and Definitions
In this section, we will first provide the background of constituency grammars and dependency grammars, and then formally define the general L-PCFG, illustrating how both dependencies and phrase structures can be induced from L-PCFGs.
2.1 Phrase Structures and CFGs
The phrase structure of a sentence is formed by recursively splitting constituents. In the parse above: The sentence is split into a noun phrase (NP) and a verb phrase (VP), which can themselves be further split into smaller constituents; for example, the NP comprises a determiner (DT) “the” and a normal noun (NN) “dog.”
2.2 Dependency Structures and Grammars
In a dependency tree of a sentence, the syntactic nodes are the words in the sentence. Here the root is the root word of the sentence, and the children of each word are its dependents. Above, the root word is chasing, which has three dependents, its subject (nsubj) dog, auxiliary verb (aux) is, and object (nobj) cat. A dependency grammar5 specifies the possible head-dependent pairs , where the set denotes the vocabulary.
2.3 Lexicalized CFGs
Although both the constituency and dependency grammars capture some aspects of syntax, we aim to leverage their relative strengths in a single unified formalism. In a unified grammar, these two types of structure can benefit each other. For example, in The dog is chasing the cat of my neighbor’s, while the phrase of my neighbor’s might be incorrectly marked as the adverbial phrase of chasing in a dependency model, the constituency parser can provide the constraint that the cat of my neighbor’s is a constituent, thereby requiring chasing to be the head of the phrase.
Lexicalized CFGs are based on a backbone similar to standard CFGs but parameterized to be sensitive to lexical dependencies such as those used in dependency grammars. Similarly to CFGs, L-CFGs are defined as a five-tuple . The differences lie in the formulation of rules :
where α, β ∈ Σ are words, and mark the head of constituent when they appear in “[⋅]”.6 Branching rules and encode the dependencies (α, β).7
In a lexicalized CFG, a sentence x can be generated by iterative binary splitting and emission, forming a parse treet = [r1, r2, …, r2|x|], where rules ri are sorted from top to bottom and from left to right. We will denote the set of parse trees that generate x within grammar as .
2.4 Grammar Induction with L-PCFGs
In this subsection, we will introduce L-PCFGs, the probabilistic formulation for L-CFGs. The task of grammar induction is to ask, given a corpus , how can we obtain the probabilistic generative grammar that maximizes its likelihood. With the induced grammar, we are also interested in how to obtain the trees that are most likely given an individual sentence—in other words, syntactic parsing according to this grammar.
3 Neural Lexicalized PCFGs
As noted, one advantage of L-PCFGs is that the obtained t* encodes both dependencies and phrase structures, allowing both to be induced simultaneously. We also expect this to improve performance, because different information is capture by each of these two structures. However, this expressivity comes at a price: more complex rules. In contrast to the traditional PCFG, which has production rules, the L-PCFG requires production rules. Because traditionally rules of L-PCFGs have been parameterized independently by scalars, namely, gθ(ri, z) = θi (Collins, 2003), these parameters were hard to estimate because of data sparsity.
We propose an alternate parameterization, the neural L-PCFG, which ameliorates these sparsity problems through parameter sharing, and the compound L-PCFG, which allows a more flexible sentence-by-sentence parameterization of the model. Below, we explain the neural L-PCFG factorization that we found performed best but include ablations of our decisions in Section 5.3.2.
3.1 Neural L-PCFG Factorization
The score of an individual rule is calculated as the combination of several component probabilities:
- root to non-terminal probability :
Probability that the start symbol produces a non-terminal A.8
- word emission probability :
Probability that the head word of a constituent is α conditioned on that the non-terminal of the constituent is A.
- head non-terminal probability or :
Probability of the headedness direction and head-inheriting child9 conditioned on the parent non-terminal and head words.
- non-inheriting child probability or :
Probability of the non-inheriting child conditioned on the headedness direction, and parent and head-inheriting child non-terminals.
Because the head of preterminals is already specified upon generation of one of the ancestor non-terminals, the score of emission rule is 0.
3.2 Compound Grammar
Among various existing grammar induction models, the compound PCFG model of Kim et al. (2019) both shows highly competitive results and follows a PCFG-based formalism similar to ours, and thus we build upon this method. The compound in compound PCFG refers to the fact that it uses a compound probability distribution (Robbins et al., 1951) in modeling and estimation of its parameters. A compound probability distribution enables continuous variants of grammars, allowing the probabilities of the grammar to change based on the unique characteristics of the sentence. In general, compound variables can be devised in any way that may inform the specification of the rule probabilities (e.g., a structured variable to provide frame semantics or the social context in which the sentence is situated). In this way, compound grammar increases the capacity of the original PCFG.
4 Training and Inference
4.1 Training
Initialization
We initialize word embeddings using GloVe embeddings (Pennington et al., 2014). We further cluster word embeddings with k-means (MacQueen, 1967), as shown in Figure 3, and use the centroids of the clusters to initialize the embeddings of preterminals. The k-means algorithm is initialized using the k-means++ method and trained until convergence. The intuition therein is that this gives the model a rough idea of syntactic categories before starting grammar induction. We also consider the variant without pretrained word embeddings, where we initialize word embeddings and preterminals both by drawing from . Other parameters are initialized by Xavier normal initialization (Glorot and Bengio, 2010).
Curriculum Learning
4.2 Inference
The most probable tree can be obtained via the CYK algorithm.
5 Experiments
5.1 Data Setup
All models are evaluated using the Penn Treebank (Marcus et al., 1993) as the test corpus, following the splits and preprocessing methods, including removing punctuation, provided by Kim et al. (2019). To convert the original phrase bracket and label annotations to dependency annotations, we use Stanford typed dependency representations (De Marneffe and Manning, 2008).
We use three standard metrics to measure the performance of models on the validation and test sets: directed and undirected attachment score (DAS and UAS) for dependency parsing, and unlabeled constituent F1 score for constituency parsing.
We tune hyperparameters of the model to minimize perplexity on the validation set. We choose perplexity because it requires only plain text and not annotated parse trees. Specifically, we tuned the architecture of f (i),i = 1,2,3 in the space of multilayer perceptrons, with the dimension of each layer being n + d, with residual connections and different non-linear activation functions. Table 1 shows the final hyper-parameters of our model. Due to memory constraints on a single graphic card, we set the number of non-terminals and preterminals to 10 and 20, respectively. Later we will show that the compound PCFG’s performance is benefited by a larger grammar; it is therefore possible the same is true for our neural L-PCFG. Section 7 includes a more detailed discussion of space complexity.
5.2 Baselines
We compare our neural L-PCFGs with the following baselines:
Compound PCFG
The compound PCFG (Kim et al., 2019) is an unsupervised constituency parsing model that is a PCFG model with neural scoring. The main difference between this model and neural L-PCFG is the modeling of headedness and the dependency between head word and generated non-terminals or preterminals. We apply the same hyperparameters and techniques, including number of non-terminals and preterminals, initialization, curriculum learning and variational training to compound PCFGs for a fair comparison. Because compound PCFGs have no notion of dependencies, we extract dependencies from the compound PCFG with three kinds of heuristic head rules: left-headed, right-headed, and large-headed. Left-/right-headed mean always choosing the root of the left/right child constituent as the root of the parent constituent, whereas large-headedness is generated by a heuristic rule which chooses the root of larger child constituent as the root of the parent constituent. Among these, we choose the method that obtains the best parsing accuracy on the dev set (making these results an oracle with access to more information than our proposed method).
Dependency Model with Valence (DMV)
The DMV (Klein and Manning, 2004) is a model for unsupervised dependency parsing, where valence stands for the number of arguments controlled by a head word. The choices to attach words as children are conditioned on the head words and valences. As shown in Smith (2006), the DMV model can be expressed as a head-driven context-free grammar with a set of generation rules and scores, where the non-terminals represent the valence of head words. For example, “L[chasing] →L0[is] R[chasing]” denotes that a left-hand constituent with full left valence produces a word and a constituent with full right valence. Therefore, it could be seen as a special case of lexicalized PCFG where the generation rules provide inductive biases for dependency parsing but are also restricted—for example, a void-valence constituent cannot produce a full-valence constituent with the same head.
Note that DMV uses far fewer parameters than the PCFG-based models, . The neural L-PCFG uses a similar number of parameters as we do, .
We compare models under two settings: (1) with gold tag information and (2) without it, denoted by ✓ and ✗, respectively in Table 2. To use gold tag information in training the neural L-PCFG, we assign the 19 most frequent tags as categories and combine the rest into a 20th “other” category. These categories are used as supervision for the preterminals. In this setting, instead of optimizing the log probability of the sentence, we optimize the log joint probability of the sentence and the tags.
. | . | . | DAS . | UAS . | F1 . | |||
---|---|---|---|---|---|---|---|---|
. | Gold Tags . | Word Embedding . | Dev . | Test . | Dev . | Test . | Dev . | Test . |
Compound PCFG** | ✗ | 𝒩 (0, N) | 21.2 | 23.5 | 38.9 | 40.8 | - | 55.2 |
Compound PCFG | ✗ | 𝒩 (0, I) | 15.6 (3.9) | 17.8 (4.2) | 27.7 (4.1) | 30.2 (5.3) | 45.63 (1.71) | 47.79 (2.32) |
Compound PCFG | ✗ | GloVe | 16.4 (2.4) | 18.6 (3.7) | 28.7 (3.5) | 31.6 (4.5) | 45.52 (2.14) | 48.20 (2.53) |
DMV | ✗ | - | 24.7 (1.5) | 27.2 (1.9) | 43.2 (1.9) | 44.3 (2.2) | - | - |
DMV | ✓ | - | 28.5 (1.9) | 29.9 (2.5) | 45.5 (2.8) | 47.3 (2.7) | - | - |
Neural L-PCFGs | ✗ | 𝒩 (0, I) | 37.5 (2.7) | 39.7 (3.1) | 50.6 (3.1) | 53.3 (4.2) | 52.90 (3.72) | 55.31 (4.03) |
Neural L-PCFGs | ✗ | GloVe | 38.2 (2.1) | 40.5 (2.9) | 54.4 (3.6) | 55.9 (3.8) | 45.67 (0.95) | 47.23 (2.06) |
Neural L-PCFGs | ✓ | 𝒩 (0, I) | 35.4 (0.5) | 39.2 (1.1) | 50.0 (1.3) | 53.8 (1.7) | 51.16 (5.11) | 54.49 (6.32) |
. | . | . | DAS . | UAS . | F1 . | |||
---|---|---|---|---|---|---|---|---|
. | Gold Tags . | Word Embedding . | Dev . | Test . | Dev . | Test . | Dev . | Test . |
Compound PCFG** | ✗ | 𝒩 (0, N) | 21.2 | 23.5 | 38.9 | 40.8 | - | 55.2 |
Compound PCFG | ✗ | 𝒩 (0, I) | 15.6 (3.9) | 17.8 (4.2) | 27.7 (4.1) | 30.2 (5.3) | 45.63 (1.71) | 47.79 (2.32) |
Compound PCFG | ✗ | GloVe | 16.4 (2.4) | 18.6 (3.7) | 28.7 (3.5) | 31.6 (4.5) | 45.52 (2.14) | 48.20 (2.53) |
DMV | ✗ | - | 24.7 (1.5) | 27.2 (1.9) | 43.2 (1.9) | 44.3 (2.2) | - | - |
DMV | ✓ | - | 28.5 (1.9) | 29.9 (2.5) | 45.5 (2.8) | 47.3 (2.7) | - | - |
Neural L-PCFGs | ✗ | 𝒩 (0, I) | 37.5 (2.7) | 39.7 (3.1) | 50.6 (3.1) | 53.3 (4.2) | 52.90 (3.72) | 55.31 (4.03) |
Neural L-PCFGs | ✗ | GloVe | 38.2 (2.1) | 40.5 (2.9) | 54.4 (3.6) | 55.9 (3.8) | 45.67 (0.95) | 47.23 (2.06) |
Neural L-PCFGs | ✓ | 𝒩 (0, I) | 35.4 (0.5) | 39.2 (1.1) | 50.0 (1.3) | 53.8 (1.7) | 51.16 (5.11) | 54.49 (6.32) |
5.3 Quantitative Results
First, in this section, we present and discuss quantitative results, as shown in Table 2.
5.3.1 Main Results
First comparing neural L-PCFGs with compound PCFGs, we can see that L-PCFGs perform slightly better on phrase structure prediction and achieve much better dependency accuracy. This shows that (1) lexical dependencies contribute somewhat to the learning of phrase structure, and (2) the head rules learned by neural L-PCFGs are significantly more accurate than the heuristics that we applied to standard compound PCFGs. We also find that GloVe embeddings can help (unsupervised) dependency parsing, but do not benefit constituency parsing.
Next, we can compare the dependency induction accuracy of the neural L-PCFGs with the DMV. The results indicate that neural L-PCFGs without gold tags achieve even better accuracy than DMV with gold tags on both directed accuracy and undirected accuracy. As discussed before, DMV can be seen as a special case of L-PCFG where the attachment of children is conditioned on the valence of the parent tag, while in L-PCFG the generated head directions are conditioned on the parent non-terminal and the head word, which is more general. Comparatively positive results show that conditioning on generation rules not only is more general but also yields a better prediction of attachment.
Table 3 shows label-level recall (i.e., unlabeled recall of constituents annotated by each non-terminal). We observe that the neural L-PCFG outperforms all baselines on these frequent constituent categories.
. | PRPN . | ON . | Compound PCFG . | Neural L-PCFG . |
---|---|---|---|---|
SBAR | 50.0% | 51.2% | 42.36% | 53.60% |
NP | 59.2% | 64.5% | 59.25% | 67.38% |
VP | 46.7% | 41.0% | 39.50% | 48.58% |
PP | 57.2% | 54.4% | 62.66% | 65.25% |
ADJP | 44.3% | 38.1% | 49.16% | 49.83% |
ADVP | 32.8% | 31.6% | 50.58% | 58.86% |
. | PRPN . | ON . | Compound PCFG . | Neural L-PCFG . |
---|---|---|---|---|
SBAR | 50.0% | 51.2% | 42.36% | 53.60% |
NP | 59.2% | 64.5% | 59.25% | 67.38% |
VP | 46.7% | 41.0% | 39.50% | 48.58% |
PP | 57.2% | 54.4% | 62.66% | 65.25% |
ADJP | 44.3% | 38.1% | 49.16% | 49.83% |
ADVP | 32.8% | 31.6% | 50.58% | 58.86% |
5.3.2 Impact of Factorization
. | DAS . | UAS . | F1 . |
---|---|---|---|
Neural L-PCFG | 35.5 | 51.4 | 44.5 |
w/ xavier init | 27.2 | 47.6 | 43.6 |
w/ Factorization I | 16.4 | 33.3 | 25.7 |
w/ Factorization II | 22.3 | 42.7 | 39.6 |
w/ Factorization III | 25.9 | 46.9 | 34.7 |
. | DAS . | UAS . | F1 . |
---|---|---|---|
Neural L-PCFG | 35.5 | 51.4 | 44.5 |
w/ xavier init | 27.2 | 47.6 | 43.6 |
w/ Factorization I | 16.4 | 33.3 | 25.7 |
w/ Factorization II | 22.3 | 42.7 | 39.6 |
w/ Factorization III | 25.9 | 46.9 | 34.7 |
Factorization I assumes that the child non-terminals do not depend on the head lexical item, which influences the parsing result significantly. Although Factorization II is as general as our proposed method, it uses separate representations for different directions, and . Factorization III assumes the independence between direction and dependent non-terminals. These results indicate that our factorization strikes a good balance between modeling lexical dependencies and directionality, and avoiding over-parameterization of the model that may lead to sparsity and difficulties in learning.
5.4 Qualitative Analysis
We analyze our best model without gold tags in detail. Figure 4 visualizes the alignment between our induced non-terminals and gold constituent labels on the overlapping constituents of induced trees and the ground-truth. For each constituent label, we show the frequency of it annotating the same span of each non-terminal. We observe from the first map that a clear alignment between certain linguistic labels and induced non-terminals (e.g., VP and NT-4, S and NT-2, PP, and NT-8). But for other non-terminals, there’s no clear alignment with induced classes. One hypothesis for this diffusion is the diversity of the syntactic roles of these constituents. To investigate this, we zoom in on noun phrases in the second map, and observe that NP-SBJ, NP-TMP, and NP-MNR are combined into a single non-terminal NT-5 in the induced grammar, and that NP, NP-PRD, and NP-CLR corresponds to NT-2, NT-6, and NT-0, respectively.
We also include an example set of parses for comparing the DMV and neural L-PCFG in Table 5. Note that DMV uses “to” as the head of “know”, the neural L-PCFG correctly inverts this relationship to produce a parse that is better aligned with the gold tree. One of the possible reasons that the DMV tends to use “to” as the head is that DMV has to carry the information that the verb is in the infinitive form, which will be lost if it uses “know” as the head. In our model, however, such information is contained in the types of non-terminals. In this way, our model uses the open class word “know” as the root. Note that we also illustrate a similar failure case in this example. Neural L-PCFG uses “if” as the head of the if-clause, which is probably due to the independency between the root of the if-clause and “know”.
A common mistake made by the neural L-PCFG is treating auxiliary verbs like adjectives that combine with the subject instead of modifying verb phrases. For example, the neural L-PCFG parses “...the exchange will look at the performance...” as “((the exchange) will) (look (at (the performance)))”, whereas the compound PCFG produces the correct parse “((the exchange) (will (look (at (the performance)))))”. A possible reason for this mistake is that English verb phrases are commonly left-headed, which makes attaching an auxiliary verb less probable as the left child of a verb phrase. This type of error may stem from the model’s inability to assess the semantic function of auxiliary verbs (Bisk and Hockenmaier, 2015).
6 Related Work
Dependency vs Constituency Induction
The decision to focus on modeling dependencies and constituencies has largely split the grammar induction community into two camps. The most popular approach has been focused on dependency formalisms (Klein and Manning, 2004; Spitkovsky et al., 2010, 2011, 2013; Mareček and Straka, 2013; Jiang et al., 2016; Tran and Bisk, 2018), whereas a second community has focused on inducing constituencies (Lane and Henderson, 2001; Ponvert et al., 2011; Golland et al., 2012; Jin et al., 2018). Induced constituencies can in the case of CCG (Bisk and Hockenmaier, 2012, 2013) produce dependencies, but unlike our proposal, existing approaches do not jointly model both representations. CFGs have been used for decades to represent, analyze and model the phrase structure of language (Chomsky, 1956; Pullum and Gazdar, 1982; Lari and Young, 1990; Klein and Manning, 2002; Bod, 2006).
Similarly, the compound PCFG (Kim et al., 2019), which we extend, falls into this camp of models that induce only phrase-structure grammar. However, in this paper we demonstrate a novel lexically informed neural parameterization that extends their model to induce a unified phrase-structure and dependency-structure grammar.
Unifying Phrase Structure and Dependency Grammar
Head-driven phrase structure grammar (Sag and Pollard, 1987) and lexicalized tree adjoining grammar (Schabes et al., 1988) are approaches to representing dependencies directly in phrase structure.
The notion that abstract syntactic structure should provide scaffolding for dependencies, and that lexical dependencies should provide a semantic guide for syntax, was most famously explored in Collins (2003) through the introduction of an L-PCFG. In addition, Carroll and Rooth (1998) explored the problem of head induction in L-PCFG; Charniak and Johnson (2005) improves L-PCFGs with coarse-to-fine parsing and reranking. Recently, (Green and žabokrtskỳ, 2012; Ren et al., 2013; Yoshikawa et al., 2017) explored various methods to jointly infer phrase structure and dependencies.
Klein and Manning (2004) show that a combined DMV and CCM (Klein and Manning, 2002) model, where each tree is scored with the product of the probabilities from the individual models, outperforms either individual model. These results demonstrate that the two varieties of unsupervised parsing models can benefit from ensembling. In contrast, our model considers both phrase-and dependency structure jointly. Seginer (2007) introduces a parser using a representation like dependency structure, which helps constituency parsing.
Bikel (2004)’s analysis of prominent models at the time found that lexical dependencies provided only very minor benefits and that choosing appropriate smoothing parameters was key to performance and robustness. Hockenmaier and Steedman (2002) also explores this for combinatorial categorial grammar (CCG), showing that lexical sparsity and smoothing have dramatic effects regardless of the formalism. The sparsity and expense of lexicalized PCFGs have precluded their use in most contexts, though Prescher (2005) proposes a latent-head model to alleviate the sparse data problem.
7 Conclusion
In this paper, we propose neural L-PCFG, a neural parameterization method for lexicalized PCFGs, for both unsupervised dependency parsing and constituency parsing. We also provide a variational inference method to train our model. By modeling both representations together, our approach outperforms methods specially designed for either grammar formalism alone.
Importantly, our work also adds novel insights for the unsupervised grammar induction literature by probing the role that factorizations and initialization have on model performance. Different factorizations of the same probability distribution can lead to dramatically different performance and should be viewed as playing an important role in the inductive bias of learning syntax. Additionally, where others have used pretrained word vectors before, we show that they too contain abstract syntactic information which can bias learning.
Finally, although out of scope for one paper, our results point to several interesting potential roads forward, including the study of the effectiveness of jointly modeling constituency-dependency representations on freer word order languages, and whether other distributed word presentations (e.g., large-scale transformers) might provide even stronger syntactic signals for grammar induction.
Despite the demonstrated success of lexical dependencies, it should be noted that these are only unilexical dependencies, in contrast to bilexical dependencies, which also consider the dependencies between head and dependent words. Modeling these dependencies would require marginalizing over all possible dependents for each span-head pair. In this case, the time complexity of exhaustive dynamic programming over one sentence would become , where L stands for the length of the sentence. Assuming enough parallel workers, this time complexity can be reduced to , but it still requires auxiliary space. In contrast, our model runs for . Assuming enough parallel workers, this time complexity can also be reduced to , but still requires auxiliary space. These auxiliary data can be stored in a 32GB graphic card in our experiments (e.g., with N = 20), whereas the bilexical model cannot. There are several potential methods to side-step this problem, including the use of sampling in lieu of dynamic programming, using heuristic methods to prune the grammar, and designing acceleration methods on GPU (Hall et al., 2014).
Acknowledgments
This work was supported by the DARPA GAILA project (award HR00111990063), and some experiments made use of computation credits graciously provided by Amazon AWS. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. The authors would like to thank Junxian He and Yoon Kim for helpful feedback about the project.
Notes
Code is available at https://github.com/neulab/neural-lpcfg.
In other words, we do not induce a single PCFG, but a distribution over a family of PCFGs.
Note that by “lexical dependencies” we are referring to unilexical dependencies between the head word and child non-terminals, as opposed to bilexical dependencies between two words (as are modeled in many dependency parsing models).
Note and , so this formulation does not capture the structure of sentences of length zero or one.
This work assumes a projective tree.
Without loss of generality, we only consider binary branching in .
Note that root seeking rule encodes (ROOT,α).
that is, A is the non-terminal of the whole sentence
Child non-terminals that inherit the parent’s head word.
f(x) = σ(W2(σ(W1x + b1)) + b) + x.
Note that it is also possible to use other methods for approximation. For example, we can use qϕ(z∣x) in place of posterior distribution. However, using it still results in high prediction variance of the max function. We did not observe a significant improvement with other methods.