Abstract
This article describes a simple PCFG induction model with a fixed category domain that predicts a large majority of attested constituent boundaries, and predicts labels consistent with nearly half of attested constituent labels on a standard evaluation data set of child-directed speech. The article then explores the idea that the difference between simple grammars exhibited by child learners and fully recursive grammars exhibited by adult learners may be an effect of increasing working memory capacity, where the shallow grammars are constrained images of the recursive grammars. An implementation of these memory bounds as limits on center embedding in a depth-specific transform of a recursive grammar yields a significant improvement over an equivalent but unbounded baseline, suggesting that this arrangement may indeed confer a learning advantage.
1. Introduction
Chomsky (1965) postulates that as human children are naturally exposed to a language, the quantity and nature of the linguistic examples to which they are exposed is insufficient to fully explain the children’s successful acquisition of the grammar of the language; Chomsky (1980) dubs this claim the poverty of the stimulus. Chomsky (1965) asserts that the space of possible human languages must therefore be constrained by a set of linguistic universals with which children’s brains are innately primed, and that this biological fact is a necessary precondition for human language learning. Chomsky (1986) uses the term Universal Grammar to describe this proposed innate mental model that underlies human language acquisition.
The argument from the poverty of the stimulus and the associated claim of an innate Universal Grammar gained wide acceptance within the Chomskyan generative tradition. The specific details of exactly what aspects of language cannot be learned without Universal Grammar has not always been well defined; similarly, the nature of exactly what proposed linguistic universals constitute Universal Grammar have been widely debated. In striving to identify empirical mechanisms by which poverty of the stimulus claims might be rigorously tested, Pullum and Scholz (2002) conclude that although such claims could potentially be true, the linguistic examples most widely cited in support fail to hold up to close scrutiny.
Pullum and Scholz argue that mathematical learning theory and corpus linguistics have a key role to play in empirically testing poverty of the stimulus claims. Preliminary work along these lines using manually constructed grammars of child-directed speech was performed by Perfors, Tenenbaum, and Regier (2006), who demonstrate empirically that a basic learner, when presented with a corpus of child-directed speech, can learn to prefer a hierarchical grammar (a probabilistic context-free grammar) over linear and regular grammars using a simple Bayesian probabilistic measure of the complexity of a grammar.
However, full induction of probabilistic context-free grammars (PCFGs) has long been considered a difficult problem (Solomonoff 1964; Fu and Booth 1975; Carroll and Charniak 1992; Johnson, Griffiths, and Goldwater 2007; Liang et al. 2007; Tu 2012). Lack of success for direct estimation was attributed either to a lack of correlation between the linguistic accuracy and the optimization objective (Johnson, Griffiths, and Goldwater 2007), or the likelihood function or the posterior being filled with weak local optima (Smith 2006; Liang et al. 2007). The first contribution of this article is to describe a simple PCFG induction model with a fixed category domain that predicts a large majority of attested constituent boundaries, and predicts labels consistent with nearly half of attested constituent labels on the Eve corpus, a standard evaluation data set of child-directed speech.
But evidence suggests that children learn very constrained grammars (Lieven, Pine, and Baldwin 1997; Tomasello 2003, and more). These non-nativist models (Bannard, Lieven, and Tomasello 2009) usually assume that the grammar children first acquire is linear and templatic, consisting of multiword frames with slots to be filled in or just n-grams. The grammar may also include various kinds of rule-like probabilities for the frames or transition probabilities for the words or n-grams. Much work (Redington, Chater, and Finch 1998; Mintz 2003; Freudenthal et al. 2007; Thompson and Newport 2007) shows that syntactic categories and surface word order may be captured with these simple statistics without hypothesizing hierarchical structures. However, the transition between those linear or very shallow grammars and fully recursive grammars is never explicitly modeled; therefore, there is no empirical evidence from computational modeling about how easy this transition may be. The second contribution of this article is to explore the idea that this difference between shallow and fully recursive grammars is determined by working memory, so the shallow and recursive grammars are unified into different performance grammars sharing the same underlying competence grammar.
There has long been a distinction within the linguistic discipline of theoretical syntax between a hypothesized model of language that is posited to exist within in the brain of each speaker of that language and the phenomenon of language as it is actually spoken and encountered in the real world. The concept of a mental model of language has been described in terms of langue (de Saussure 1916), linguistic competence (Chomsky 1965), or simply as the grammar of the language, while the details of how language is actually spoken and used have been described as parole (de Saussure 1916), linguistic performance (Chomsky 1965), or sometimes as usage.
Chomsky (1965) argues that models of linguistic performance should be informed by models of linguistic competence, but that models of competence should not take performance into account: “Linguistic theory is concerned primarily with an ideal speaker-listener, in a completely homogeneous speech-community, who knows its language perfectly and is unaffected by such grammatically irrelevant conditions as memory limitations, distractions, shifts of attention and interest, and errors” (page 3). Within the Chomskyan generative tradition, this idea that syntactic theory should model an idealized grammar of linguistic competence (rather than one that incorporates performance) has remained dominant in the decades since (see Newmeyer 2010, for example). Others outside this tradition have criticized the Chomskyan position in part for its failure to connect idealized theories of competence to actual language usage (for example, see Pylyshyn 1973; Miller 1975; Kates 1976).
The framework for unsupervised grammar induction presented in this article is significant in that it represents a concrete discovery procedure that can produce both a competence grammar G (a PCFG in Chomsky normal form) and a corresponding formally defined performance grammar GD (another PCFG defined to be sensitive to center-embedding depth). Although PCFGs in principle allow for unlimited recursion in the form of center-embedding (Chomsky and Miller 1963), evidence from corpus studies of spoken and written language use strongly indicates that such recursion essentially never extends beyond the limits of human cognitive memory constraints (Schuler et al. 2010; Noji, Miyao, and Johnson 2016). Given a cognitively motivated recursive depth bound D, performance grammar GD can be viewed as a specific instantiation of competence grammar G that is guaranteed to never violate the depth bound. In this analysis of model behavior and depth-bounding (§8) we observe that by utilizing a depth bound, the grammar induction procedure is more consistent in discovering a highly accurate grammar than it is when inducing an unbounded grammar over the same corpus. This fact argues against Chomsky’s assertion that memory limitations are an irrelevant consideration in the search for a grammar of a language.
This article is an extended presentation of Jin et al. (2018a) with additional evaluation and analyses of PCFG induction prior to depth bounding. These additional evaluations and analyses include quantitative analyses of effects of manipulation of hyperparameters, and quantitative and qualitative linguistic analyses of categories and rules in generated grammars for several languages. Code used in this work can be found at https://github.com/lifengjin/pcfg_induction.
The remainder of this article is organized as follows: Section 2 describes related work in unsupervised grammar induction. Section 3 describes an unbounded PCFG induction model based on Gibbs sampling. Section 4 describes a depth-bounded version of this model. Section 5 describes a method for evaluating labeled parsing accuracy for unsupervised grammar induction. Section 6 describes experiments to evaluate the unbounded PCFG induction model on synthetic data with a known solution. Section 7 describes experiments to evaluate the unbounded PCFG induction model on child-directed speech. Section 8 describes experiments to evaluate the depth-bounded PCFG induction model on child-directed speech. Section 9 describes experiments to explore the phenomena of natural bounding in induction on child-directed and adult language data. Section 10 describes replication of these results on newswire data. Finally, Section 11 provides some concluding remarks.
2. Related work
Unsupervised grammar inducers hypothesize hierarchical structures for strings of words. Using context-free grammars (CFGs) to define these structures with labels, previous attempts at either CFG parameter estimation (Carroll and Charniak 1992; Pereira and Schabes 1992; Johnson, Griffiths, and Goldwater 2007) or directly inducing a CFG as well as its probabilities (Liang et al. 2007; Tu 2012) have not achieved as much success as experiments with other kinds of formalisms that produce unlabeled constituents (Klein and Manning 2004; Seginer 2007a; Ponvert, Baldridge, and Erk 2011). The assumption has been made that the space of grammars is so big that constraints must be applied to the learning process to reduce the burden of the learner (Gold 1967; Cramer 2007; Liang et al. 2007).
Much of this grammar induction work used strong linguistically motivated constraints or direct linguistic annotation to help the inducer eliminate some local optima. Pereira and Schabes (1992) use bracketed corpora to provide extra structural information to the inducer. Use of part-of-speech (POS) sequences in place of word strings is popular in the dependency grammar induction literature (Klein and Manning 2002; Berg-Kirkpatrick et al. 2010; Jiang, Han, and Tu 2016; Noji, Miyao, and Johnson 2016). Combinatory Categorial Grammar (CCG) induction also relies on a limited number POS tags to assign basic categories to words (Bisk and Hockenmaier 2012; Bisk, Christodoulopoulos, and Hockenmaier 2015), among other constraints such as CCG combinators, to induce labeled dependencies. Other linguistic constraints and heuristics such as constraints of root nodes (Noji, Miyao, and Johnson 2016), attachment rules (Naseem et al. 2010), acoustic cues (Pate and Goldwater 2013), and punctuation as phrasal boundaries (Seginer 2007a; Ponvert, Baldridge, and Erk 2011) have also been used in induction. More recently, neural PCFG induction systems (Jin et al. 2019; Kim et al. 2019; Kim, Dyer, and Rush 2019) and unsupervised parsing models (Shen et al. 2018, 2019; Drozdov et al. 2019) have been shown to predict accurate syntactic structures. These more complex neural network models may not contain explicit biases, but may contain implicit confounding factors implemented during development on English or other natural languages, which may function like linguistic universals in constraining the search over possible grammars. Experiments described in this article use only Bayesian PCFG induction in order to eliminate these possible confounds and evaluate the hypothesis that grammar may be acquired using only event categorization and decomposition into categorized sub-events using mathematically transparent parameters.1
Depth-like constraints have been applied in work by Seginer (2007a) and Ponvert, Baldridge, and Erk (2011) to help constrain the search over possible structures. Both of these systems are successful in inducing phrase structure trees from only words, but only generate unlabeled constituents. Center-embedding constraints on recursion depth have also been applied to parsing (Schuler et al. 2010; Ponvert, Baldridge, and Erk 2011; Shain et al. 2016; Noji, Miyao, and Johnson 2016; Jin et al. 2018b), motivated by human cognitive constraints on memory capacity (Chomsky and Miller 1963). Center-embedding recursion depth can be defined in a left-corner parsing paradigm (Rosenkrantz and Lewis 1970; Johnson-Laird 1983; Abney and Johnson 1991) as the number of left children of right children that occur on the path from a word to the root of a parse tree. Left-corner parsers require only minimal stack memory to process left-branching and right-branching structures, but require an extra stack element to process each center embedding in a structure. For example, a left-corner parser must add a stack element for each of the first three words in the sentence, For parts the plant built to fail was awful, shown in Figure 1. These kinds of depth bounds in sentence processing have been used to explain the relative difficulty of center-embedded sentences compared with more right-branching paraphrases like It was awful for the plant’s parts to fail. However, depth-bounded grammar induction has never been compared against unbounded induction in the same system, in part because most previous depth-bounding models are built around sequence models, the complexity of which grows exponentially with the maximum allowed depth.
Stack elements after the word the in a left-corner parse of the sentence For parts the plant built to fail was awful.
Stack elements after the word the in a left-corner parse of the sentence For parts the plant built to fail was awful.
In order to compare the effects of depth-bounding more directly, this work extends a chart-based Bayesian PCFG induction model (Johnson, Griffiths, and Goldwater 2007) to include depth bounding, which allows both bounded and unbounded PCFGs to be induced from unannotated text. Experiments reported in this article confirm that depth-bounding does empirically have the effect of significantly limiting the search space of the inducer. This work also shows that it is possible to induce an accurate unbounded PCFG from raw text with no strong linguistic constraints.
3. Unbounded Statistical Grammar Induction Model
Example matrix representation (b) of a probabilistic context-free grammar (a).
Indexing using a Kronecker delta (a), and a Kronecker product of Kronecker deltas (b).
Indexing using a Kronecker delta (a), and a Kronecker product of Kronecker deltas (b).
Finally, because each tree contains the words of a sentence, the probability of sentences given trees P(sentences ∣ trees) in Equation (1) is simply one if the words in all the trees match the sentences in the corpus, and zero otherwise.
4. Bounded Statistical Grammar Induction Model
Example depth- and side-specific grammar matrix G2, based on the grammar in Figure 2.
Example depth- and side-specific grammar matrix G2, based on the grammar in Figure 2.
5. Labeled Parsing Evaluation
Note that this use of recall and homogeneity is distinct from commonly used F-score and V-measure for hypothesized constituents and category labels, respectively. F-score is the harmonic mean of recall and precision (which has the same form as recall but with τ and reversed), and V-measure is the harmonic mean of homogeneity and completeness (which has the same form as homogeneity but with τ and reversed). These aggregated measures are usually used as checks on evaluated models that can generate unlimited numbers of hypotheses or hypotheses of unlimited granularity. However, in the present application, hypothesized constituents in parse trees are limited by the number of words in each sentence, and hypothesized category labels are limited to a constant set of categories of size C, so checks on the number and granularity of hypotheses are not necessary. Moreover, the use of recall rather than F-score in these evaluations assumes the decision to suppress annotation of constituents to make flatter trees is motivated by expediency on the part of the annotators, rather than linguistic theory, so extra constituents in binary-branching trees that are not present in attested trees are not counted against induced grammars unless they interfere with the recall of other attested constituents. Likewise, the use of homogeneity rather than V-measure in these evaluations assumes the decision to suppress annotation of information about case or subcategorization information in category labels is motivated by expediency rather than linguistic theory, so the use of categories to make such additional distinctions is not counted against induced grammars unless it interferes with the homogeneity of predictions of other attested categories from hypothesized categories.
Experiments in Section 7.2 show that even without completeness as a check on the size of the category label set, results peak at C = 45 and decline thereafter.
Notwithstanding this use of RH in tuning and internal evaluations, comparisons of models proposed in this article to other existing models do use F-score, in order to ensure a fair comparison using the same measure to which these other models have been optimized.
6. Experiment 1: Evaluation of Unbounded PCFG Induction on Synthetic Data
The unbounded model described in Section 3 is evaluated first on synthetic data (Jin et al. 2018b) to determine whether it can reliably learn a recursive grammar from data with a known optimum solution. The symmetric concentration hyper-parameter β is set to be 0.2, following Jin et al. (2018b). The corpus consists of 50 sentences each of the form a b c; a b b c; a b a b c; and a b b a b b c, which has optimal tree structures as shown in Figure 7.4 The (b) and (d) trees require the system to hypothesize depth 2 structures. The system was able to recall all optimal tree structures with an equivalent category allocation.
Synthetic center-embedding structure. Note that tree structures (b) and (d) have depth 2 because they have complex sub-trees spanning a b and a b b, respectively, embedded in the center of the yield of their roots.
Synthetic center-embedding structure. Note that tree structures (b) and (d) have depth 2 because they have complex sub-trees spanning a b and a b b, respectively, embedded in the center of the yield of their roots.
The accuracy of the unbounded model was also compared against that of existing induction models by Seginer (2007a),5 Ponvert, Baldridge, and Erk (2011),6 Shain et al. (2016),7 as well as Kim, Dyer, and Rush (2019).8 The two models from Kim, Dyer, and Rush (2019) differ in that the model with z induces sentence-specific grammars, and the model without z induces one grammar for all sentences. The results are shown in Table 1. No other system was able to recall all optimal tree structures.
The oracle best accuracy scores of unlabeled parse evaluation of different systems on synthetic data.
System . | Recall . | Precision . | F1 . | RH . |
---|---|---|---|---|
Seginer (2007a) | 0.71 | 0.83 | 0.77 | – |
Ponvert, Baldridge, and Erk (2011) | 0.81 | 0.91 | 0.86 | – |
Shain et al. (2016) | 0.38 | 0.38 | 0.38 | – |
Kim, Dyer, and Rush (2019) without z | 0.73 | 0.73 | 0.73 | 0.73 |
Kim, Dyer, and Rush (2019) with z | 0.73 | 0.73 | 0.73 | 0.73 |
Unbounded PCFG §3 | 1.00 | 1.00 | 1.00 | 1.00 |
System . | Recall . | Precision . | F1 . | RH . |
---|---|---|---|---|
Seginer (2007a) | 0.71 | 0.83 | 0.77 | – |
Ponvert, Baldridge, and Erk (2011) | 0.81 | 0.91 | 0.86 | – |
Shain et al. (2016) | 0.38 | 0.38 | 0.38 | – |
Kim, Dyer, and Rush (2019) without z | 0.73 | 0.73 | 0.73 | 0.73 |
Kim, Dyer, and Rush (2019) with z | 0.73 | 0.73 | 0.73 | 0.73 |
Unbounded PCFG §3 | 1.00 | 1.00 | 1.00 | 1.00 |
7. Experiment 2: Evaluation of Unbounded PCFG Induction on Child-Directed Speech
Observing that the model is able to correctly identify known grammars from data, we then evaluate the unbounded PCFG inducer on a corpus of child-directed speech from the Adam and Eve sections of the Brown corpus (Brown 1973) of CHILDES (Macwhinney 1992). The Adam data set consists of transcripts of interactions between Adam and his caregivers recorded at ages ranging from 2 years 3 months to 5 years 2 months. Eve is similar, with interactions recorded between age 1 year 6 months and 2 years 3 months. Penn Treebank–style syntactic annotation for the child-directed utterances is provided by Pearl and Sprouse (2013) using an automatic parser (Charniak and Johnson 2005) and human annotators. There are 28,779 sentences in the annotated Adam corpus, with average sentence length of 6 words. There are 67 unique syntactic categories used in the data set. N-ary branching is not binarized in the human annotation, but unary branching chains are collapsed and the topmost category in the chain is used as the category for the constituent. The Eve section has 14,251 sentences, with 64 unique syntactic categories, and the average sentence length is 5.6 words. The number of unique phrasal categories after unary chain collapse is 25 and 21, respectively.
Hyperparameters β and C are set to optimize accuracy on the Adam section. Several analyses are performed using grammars and trees induced using Adam. Finally held-out evaluation is performed on the Eve section.
Following previous work, these experiments leave all punctuation in the input for learning as a proxy for prosodic cues about phrasal boundaries (Seginer 2007b). Punctuation is then removed in all evaluations on development and test data. All results reported for each condition include induced grammars and trees from running the system with 10 random seeds. Each run contains 700 sampling cycles, and the final sampled grammar is used to generate the final parses of the corpus. Accuracy is evaluated by comparing optimal (Viterbi) parses instead of sampled parses. These evaluated parses are strictly binary-branching trees, although annotations may contain flatter n-ary trees. Results include all runs for each condition, shown in plots as boxes with boundaries at the first and the third quartiles, with medians as green lines inside, and with upper and lower whiskers showing the minimum and maximum of each set of data points. Circles are used for outliers, which are data points with values more extreme than 1.5 times of the interquartile range, the distance between the first and the third quartile.
7.1 Optimization of Concentration Parameter on Exploratory Partition
In Bayesian induction, the Dirichlet concentration hyperparameter β controls the probability of a sampled multinomial distribution, with high values yielding more uniform distributions over expansion rules in the grammar and with low values concentrating the probability mass on only a few expansions. Figure 8 shows RH scores for runs with different β values on Adam with the number of syntactic categories C = 30. Results show a peak at β = 0.1, indicating a preference for sparse, highly concentrated probabilities over a few expansion rules.
Indeed, human grammars are generally sparse in this way (Johnson, Griffiths, and Goldwater 2007; Goldwater and Griffiths 2007). For example, in the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993), there are 73 unique nonterminal categories. In theory, there can be more than 28 million possible unary, binary, and trinary branching rules in the grammar. However, there are only 17,020 unique rules found in the corpus, showing the high sparsity of attested rules in the grammar. In other frameworks like CCG (Steedman 2002), where lexical categories can be in the thousands, the number of attested lexical categories is still small compared to all possible lexical categories.
The sparsity also shows up in POS assignments of words. Usually the number of POS tags a word can have is very small. For words with low-frequency and hapax legomena, β has a particularly strong influence on their posterior uniformity of POS assignment, with natural language grammars clearly preferring low uniformity.
Constituency grammar induction is often measured using F1 scores over unlabeled spans (Seginer 2007a; Ponvert, Baldridge, and Erk 2011, inter alia). Figure 9 shows unlabeled F1 scores with different β values on Adam. Contrary to the prediction, grammar accuracy peaks at high values for β when measured using unlabeled F1. However, upon close inspection, these grammars with high unlabeled F1 are almost purely right-branching grammars, which does indeed perform very well on English child-directed speech in unlabeled parsing evaluation, but the right-branching grammars have phrasal labels that do not correlate with human annotation when evaluated with RH. This indicates that instead of capturing human intuitions about syntactic structure, such grammars have only captured broad branching tendencies.
Unlabeled F1 scores for various β values on exploratory partition (Adam).
7.2 Optimization of Category Domain Size on Exploratory Partition
Previous work on PCFG induction usually has used fewer than 20 syntactic categories (Shain et al. 2016; Jin et al. 2018b). This number is substantially smaller than the number of categories in human annotations, but it may be expected because there may not be enough statistical clues in the data for the inducer to distinguish some categories from other ones. For example, determiners and cardinal numbers may appear to be very similar distributionally, because they usually occur before bare nouns and bare noun phrases. However, the labeled evaluation with the sparsity parameter in Section 7.1 indicates that unlabeled evaluation is not informative enough about the accuracy of induced grammars, as it includes no measure of accuracy for induced constituent labels. Figure 10 shows induction results on the Adam data set with several category domain sizes given this optimal value of β. Results show a peak of RH at C = 45. This suggests that the inducer may have insufficient categories to use at lower domain sizes, yielding much lower RH values at C = 15. The accuracy at C = 45 is a well-formed peak with the smallest variance among all experimental settings, but there is a secondary peak at C = 75, with some induced grammars as accurate as induced grammars with 45 categories. This may indicate some statistical evidence in the data for further subcategorization of the grammars with 45 categories, but such evidence may not be strong enough to reduce posterior multimodality.
RH scores for various C values and β = 0.1 on exploratory partition (Adam) (***: p < 0.001).
RH scores for various C values and β = 0.1 on exploratory partition (Adam) (***: p < 0.001).
7.3 Correlation of Model Fit and Parsing Accuracy
Model fit, or data likelihood, has been reported not to be correlated or to be correlated only weakly with parsing accuracy for some unsupervised grammar induction models when the model has converged to a local maximum (Smith 2006; Johnson, Griffiths, and Goldwater 2007; Liang et al. 2007). Figure 11 shows the correlation between data likelihood and RH at convergence for all 70 runs with β = 0.1. There is a significant (p < 0.001) positive correlation (Pearson’s r = 0.737) between data likelihood and RH at convergence for our model. This indicates that although noisy and unreliable, the data likelihood can be used as a metric to do preliminary model selection. The figure also shows that the distribution of likelihoods from various C values also indicates the correlation between likelihood and model performance, with most of the induced grammars with high performing C values such as 45 or 75 in the region of the highest likelihoods, and most of the low performing C values such as 15 or 90 in the region of the lowest likelihoods. The difference between this significant correlation of parsing accuracy and data likelihood and previous results of weak or no correlation may be due to the use here of labeled (RH) accuracy as a more natural measure of parsing accuracy than unlabeled (F1) accuracy. It may also be due to the simpler language used in Adam compared to that of newswire data sets used in previous work. Finally, the discrepancy may be due to the use of Expectation Maximization in previous work, which may overfit a grammar to a data set, and could give unrealistically high likelihood to grammars that are too specific for a particular set of sentences.
The correlation between likelihood and RH on Adam over various C values for β = 0.1.
The correlation between likelihood and RH on Adam over various C values for β = 0.1.
7.4 Results for Unbounded Induction on Held-Out Partition
With the hyperparameters tuned on Adam, experiments are run on the held-out section of Eve. Results are shown in Figure 12. The median unlabeled F1 score is around 0.6, and the median RH score is 0.38. The RH of the highest-likelihood run is 0.44. Table 2 shows the unlabeled F1 and labeled RH scores for published systems, using the induced grammar for this work from the run with the highest likelihood on the whole section. The inducer optimized for RH still achieves good unlabeled parsing accuracy, although the unlabeled F1 score is still lower than that of a purely right-branching baseline. Figure 9 shows that β = 1.0 does help induce grammars that are mostly right-branching but still retain some linguistically meaningful constituents, which push the unlabeled F1 score above the right-branching baseline accuracy at 0.75 on the Adam section. It is reasonable to assume that using β = 1.0 on Eve will also achieve the same result. However, the deterioration of the quality of constituent labeling at high βs makes optimizing for unlabeled F1 much less attractive. For some of the published systems, there is no way to produce labeled trees, therefore the labeled evaluation is not applicable to them. For the right-branching baseline, because there is no trivial and automatic way to assign different category labels to constituents, its RH score is 0.0.
Unbounded induction experiment on the held-out partition (Eve) with β = 0.1 and C = 45.
Unbounded induction experiment on the held-out partition (Eve) with β = 0.1 and C = 45.
PARSEVAL scores on Eve data set with previously published induction systems.
System . | F1 . | RH . |
---|---|---|
Seginer (2007a) | 0.52 | – |
Ponvert, Baldridge, and Erk (2011) | 0.56 | – |
Shain et al. (2016) | 0.66 | – |
Kim, Dyer, and Rush (2019) without z | 0.51 | 0.44 |
Kim, Dyer, and Rush (2019) with z | 0.31 | 0.39 |
this work (D = ∞, C = 45) | 0.62 | 0.44 |
Right-branching | 0.76 | 0.00 |
7.5 Analysis of Learned Syntactic Categories and Grammatical Rules
We are interested in examining the learned categories and rules and compare them to annotation. Many of the most common induced rules look linguistically sensible. The twenty most frequent rules generated by the run of the unbounded inducer with the highest likelihood probability using optimal β = 0.1 and C = 45 parameters on the Adam data set are shown in Table 3. Each rule is followed by the most common attested rule for the same decomposition (on the upper line), and some randomly sampled examples (on the lower line, with a vertical bar showing the split point between left and right child spans).9 The recall homogeneity for this run is 0.57. Of these twenty most frequent rules, only six (the first, seventh, eighth, ninth, nineteenth, and twentieth) do not seem to correspond to any linguistically recognizable syntactic analysis. Those that do are the following:
The second most frequent and the sixteenth most frequent rules seem to simply undo the programmatic tokenization of contractions of modals and negation adverbs (e.g., is — n’t) which is common in Penn Treebank annotations (Marcus, Santorini, and Marcinkiewicz 1993).
The third rule, the fourth and fifteenth rules, and the eighteenth rule fairly selectively attach subjects to verb phrases, direct objects to transitive verbs, and complements to prepositions, respectively.
The fifth rule decomposes content questions into question words followed by sentences containing gaps (but also conflates these with sentences followed by echo questions).
The sixth most common rule right-adjoins particles and adverbial modifiers onto verb phrases.
The tenth rule left-adjoins interjections onto sentences.
Rules eleven through thirteen decompose noun phrases into determiners followed by common nouns. It is also interesting to note that the model reliably distinguishes subjects (category 33 in this run) from direct objects, (category 8) and complements of prepositions (category 7).10 This suggests that case systems, which treat subjects, direct objects, and oblique objects as different categories, might naturally arise from distributions of words in sentences, rather than from a biological bias. Rules eleven and thirteen have the same children categories but different parent categories. Further inspection of the rules using the parent categories shows that these two types of noun phrases are distinguished by whether the main verb needs further complements or adjuncts.
Rules fourteen and seventeen perform subject-auxiliary inversion by attaching subjects to auxiliaries below the complements. This kind of structure is unusual in movement-based analyses, but is a common feature of categorial grammar analyses because it allows both the subject and the complement to be adjacent to the auxiliary as its arguments.
The most frequent rules induced in Adam (β = 0.1, C = 45) and their correspondences in the attested trees. The examples are randomly sampled from the induced trees. Question marks (‘??’) for parent, left child, or right child indicate no constituent was attested at that location.

Figure 13a shows a confusion matrix for this same highest-likelihood run on Adam with β = 0.1, C = 45, showing percentages of several common attested non-preterminal categories that are hypothesized as one of the ten most common induced categories. Preterminal categories are not included because their boundaries as single words are trivially induced. This run correctly recalled most noun phrases and prepositional phrases, but missed a large proportion of clauses and a majority of verb phrases. Figure 13b shows the same confusion matrix with percentages of hypothesized categories that correspond to each attested category. This shows that the hypothesized categories that correspond to noun phrases, verb phrases and prepositional phrases mostly exclusively represent these categories.
Table 4 shows the 20 most frequent induced syntactic categories at the preterminal positions and the corresponding human annotated POS tags, showing the percentage of each induced category attested with each tag. Attested POS tags that have fewer than 100 word tokens or fewer than 5% of the induced category instances are not included in the table. This time all but two induced categories (the eighteenth and twentieth) seem linguistically meaningful:
93% of the most common induced preterminal category correspond to attested determiners or possessive pronouns.
99% of the second and twelfth most common induced preterminal categories (categories 33 and 28) and 86% and 89% of the sixth and tenth most common categories (categories 37 and 8) correspond to attested pronouns and other single-word noun phrases. Table 5 lists examples for these four induced categories found in the Viterbi parses. Category 37 is usually a third-person singular subject (that, it, he), category 33 is almost always a plural or second-person subject (you, they, we), category 8’s most common instances are accusative pronouns (it, them, me), and category 28 is mostly the nominative first person pronoun (I) occurring in the subject position. Word order information and subject–verb agreement seem to drive this subcategorization, which resembles a combination of case and number. Because the inducer has no sub-word phonological information, it relies solely on word order information to distinguish nominative and accusative cases, which is especially important for pronouns like it and common nouns in English when the two cases are syncretic.
100% of the third most common induced preterminal category (category 11) and 83% of the sixteenth most common category (category 36) correspond to attested verbs. It is also interesting to note that at least half of category 11 appear as the left child in the fourth and fifteenth most common induced rules, generally in a transitive verb context (an attested verb followed by a noun phrase), and many of category 36 appear as the left child in the sixth most common rule, often in a non-transitive verb context (an attested verb followed by a particle or prepositional phrase). This suggests that the inducer distinguishes transitive and intransitive verbs. This is especially interesting because the inducer does not appear to distinguish base, participial, and past-tense forms of verbs, presumably deriving a higher overall posterior probability from subcategorization distinctions.
There are also common homogenous categories corresponding to auxiliaries (73% of the fourth most common induced preterminal and 89% of the seventh most common), common nouns (91% of the fifth most common induced preterminal and 96% of the thirteenth most common), interrogative pronouns (84% of the eighth most common preterminal), prepositions (80% of the ninth most common preterminal), and others.
Recall of gold POS tags in the top 20 most frequent induced syntactic categories at the preterminal positions. Note that because of unary chain collapse, phrasal tags like NP can appear at preterminal positions.
Rank . | Induced category . | Category count . | Attested category and relative frequency . |
---|---|---|---|
1. | 0 | 11,327 | DT (0.77); PRP$ (0.16) |
2. | 33 | 9,983 | NP (0.99) |
3. | 11 | 8,853 | VB (0.71); VBP (0.12); VBD (0.07) |
4. | 30 | 8,031 | COP (0.64); AUX (0.09); VBZ (0.07); VP (0.05) |
5. | 32 | 7,865 | NN (0.81); NNS (0.10) |
6. | 37 | 7,402 | NP (0.86) |
7. | 35 | 7,333 | AUX (0.72); MD (0.17) |
8. | 38 | 6,900 | WHNP (0.54); WHADVP (0.23); WP (0.07) |
9. | 40 | 6,712 | IN (0.80); RB (0.05) |
10. | 8 | 6,013 | NP (0.89) |
11. | 6 | 5,424 | INTJ (0.60); ADVP (0.10); NP (0.09); CC (0.05) |
12. | 28 | 4,004 | NP (0.99) |
13. | 10 | 3,880 | NN (0.88); NNS (0.08) |
14. | 43 | 3,171 | ADJP (0.27); NP (0.22); VP (0.13); JJ (0.10); VBG (0.07) |
15. | 31 | 3,086 | NOT (0.99) |
16. | 36 | 3,043 | VB (0.60); VP (0.18); VBP (0.05) |
17. | 13 | 2,705 | PRT (0.32); ADVP (0.32); NP (0.12) |
18. | 1 | 2,483 | JJ (0.66); NN (0.20) |
19. | 3 | 2,388 | TO (0.80); IN (0.15) |
20. | 18 | 2,220 | RB (0.20); NOT (0.19); VBG (0.13); IN (0.11) |
Rank . | Induced category . | Category count . | Attested category and relative frequency . |
---|---|---|---|
1. | 0 | 11,327 | DT (0.77); PRP$ (0.16) |
2. | 33 | 9,983 | NP (0.99) |
3. | 11 | 8,853 | VB (0.71); VBP (0.12); VBD (0.07) |
4. | 30 | 8,031 | COP (0.64); AUX (0.09); VBZ (0.07); VP (0.05) |
5. | 32 | 7,865 | NN (0.81); NNS (0.10) |
6. | 37 | 7,402 | NP (0.86) |
7. | 35 | 7,333 | AUX (0.72); MD (0.17) |
8. | 38 | 6,900 | WHNP (0.54); WHADVP (0.23); WP (0.07) |
9. | 40 | 6,712 | IN (0.80); RB (0.05) |
10. | 8 | 6,013 | NP (0.89) |
11. | 6 | 5,424 | INTJ (0.60); ADVP (0.10); NP (0.09); CC (0.05) |
12. | 28 | 4,004 | NP (0.99) |
13. | 10 | 3,880 | NN (0.88); NNS (0.08) |
14. | 43 | 3,171 | ADJP (0.27); NP (0.22); VP (0.13); JJ (0.10); VBG (0.07) |
15. | 31 | 3,086 | NOT (0.99) |
16. | 36 | 3,043 | VB (0.60); VP (0.18); VBP (0.05) |
17. | 13 | 2,705 | PRT (0.32); ADVP (0.32); NP (0.12) |
18. | 1 | 2,483 | JJ (0.66); NN (0.20) |
19. | 3 | 2,388 | TO (0.80); IN (0.15) |
20. | 18 | 2,220 | RB (0.20); NOT (0.19); VBG (0.13); IN (0.11) |
Recall of top 3 most frequent words in the four induced categories that correspond to noun phrases.
Rank . | Induced category . | Category count . | Attested words and relative frequency . |
---|---|---|---|
1. | 33 | 9,983 | you (0.86); they (0.05); we (0.02) |
2. | 37 | 7,402 | that (0.38); it (0.27); he (0.07) |
3. | 8 | 6,013 | it (0.36); them (0.06); me (0.06) |
4. | 28 | 4,004 | I (0.70); he (0.10); it (0.06) |
Rank . | Induced category . | Category count . | Attested words and relative frequency . |
---|---|---|---|
1. | 33 | 9,983 | you (0.86); they (0.05); we (0.02) |
2. | 37 | 7,402 | that (0.38); it (0.27); he (0.07) |
3. | 8 | 6,013 | it (0.36); them (0.06); me (0.06) |
4. | 28 | 4,004 | I (0.70); he (0.10); it (0.06) |
8. Experiment 3: Evaluation of Bounded PCFG Induction on Child-Directed Speech
The Adam and Eve sections from the Brown Corpus are then used to evaluate the depth-bounded model defined in Section 4. Transcribed child-directed speech data in Chinese Mandarin (Tong; Deng et al. 2018) and German (Leo; Behrens 2006) are also collected from the CHILDES corpus with reference trees automatically generated using the state-of-the-art Kitaev and Klein (2018) supervised parser trained with the Chinese (Xia et al. 2000; The Chinese Treebank) and German (Skut et al. 1998; NEGRA) treebanks. They are used as held-out data sets for the bounded grammar induction experiments, using cross-linguistic hyperparameters tuned on English. There are 19,541 sentences in the Tong data set being recorded between age 1 year 0 months and 4 years 5 months, with an average sentence length of 5.7 and 55 unique syntactic categories. The Leo data set contains 20,000 child-directed utterances randomly sampled from the original Leo corpus, as the original corpus contains records of interactions between Leo and the caregivers between age 1 year 11 months and 4 years 11 months with high frequency. There are 72 unique syntactic categories in the parsed data set with an average sentence length of 6.7 words. Disfluencies in all corpora are removed, and only sentences spoken by caregivers are kept in the data.
The hyperparameter β = 0.1 is used for all experiments as it is found to be optimal in experiments described in Section 7. The optimal C was found to be 45 with Adam, but all depth-bounding experiments described here use C = 30 because the memory of available graphics processing units is not sufficient to contain depth-specific grammars at higher values of C. All the other settings follow the unbounded experiments: 10 randomly seeded runs for each experimental setting with results reported using box and whisker plots, induction with punctuation and evaluation without punctuation, and labeled evaluation with Viterbi parses.
8.1 Optimization of Depth on Exploratory Partition
The exploratory partition (Adam) is first used to determine an optimal depth bound D. Figure 14 shows the interaction between depth and RH scores on Adam at β = 0.1 and C = 30. The RH peaks at D = 3, which is consistent with previous results showing that three levels of nested center-embeddings appear to be the maximum in natural language text in many languages (Karlsson 2007, 2010; Schuler et al., 2010), which is in turn caused by limited amount of working memory. Induced grammars at D = 1 appear to be inadequate for capturing child-directed speech. D = ∞ grammars show great variance in accuracy, with induced grammars among the most accurate and least accurate. This shows the value of depth-bounding: The process of depth-bounding acts as an inductive bias, removing possible grammars as posterior modes with low accuracy such that the inducer is more likely to find grammars that are high in data likelihood and also consistent with human memory constraints.
Significance testing with the labeled evaluation metric RH, described in Section 5, is used in all experiments reporting significance levels. The parses from all 10 runs for each experiment condition are concatenated, and random permutations between parses from the two experimental conditions are carried out to calculate the probability of the observed accuracy difference between these two conditions. Results show that the difference between D = 3 and D = ∞ is highly significant (p < 0.001) on Adam, showing that depth-bounding significantly improves the chance of inducing more accurate grammars.
Figure 15 shows two trees from the two runs with D = ∞ and D = 3 with the highest likelihood on Adam. The analysis of the unbounded grammar (a) has a depth of 5, shown by the deeply nested center embedding analysis of the span scratch or cut you can clean it with a ball of cotton, which does not resemble any kind of linguistic analysis. Using depth-bounding, such analyses will never be entertained by the inducer, even when in this case the unbounded grammar may have a higher likelihood than the bounded grammar. The analysis of the bounded grammar (b) is closer to linguistic annotation, where the if clause is separated from the main clause. Some of the noun and prepositional, phrases are also clearly identified, and the depth of the tree is 3.
Example syntactic analyses from D = ∞ and D = 3 runs on Adam with the highest likelihood.
Example syntactic analyses from D = ∞ and D = 3 runs on Adam with the highest likelihood.
8.2 Results for Bounded Induction on Held-Out Partition
The bounded induction model with β = 0.1, C = 30, and D = 3 is then evaluated on held-out data sets in three languages: Eve in English, Tong in Mandarin Chinese, and Leo in German. Figure 16 shows that the models bounded at depth 3 are more accurate than unbounded models with both unlabeled and labeled evaluation metrics for all data sets, similar to what has been observed in Adam. Significance testing with item-level permutation with unlabeled F1 shows the accuracy differences across three data sets are all highly significant (p < 0.001).
Comparison of labeled and unlabeled evaluation of grammars bounded at depth 3 and unbounded grammars on English (Eve), Chinese Mandarin (Tong), and German (Leo) data sets from CHILDES (β = 0.1, C = 30).
Comparison of labeled and unlabeled evaluation of grammars bounded at depth 3 and unbounded grammars on English (Eve), Chinese Mandarin (Tong), and German (Leo) data sets from CHILDES (β = 0.1, C = 30).
8.3 Analysis of Learned Syntactic Categories and Grammatical Rules on Chinese
Table 6 shows the top 20 most frequent induced rules and annotated rules found in the automatically parsed data. Induced rules that capture linguistic phenomena that are different from English are described below.
The first and fourth rule show two different ways to form sentences, with the first rule used mainly for declarative sentences, and the fourth rule for questions. The fourth rule splits a question into an ordinary sentence with a sentence-final particle, which includes
(ma, the particle for a yes–no question) and
(good or not, a phrase to turn a declarative sentence into a question). The fifteenth rule also splits a sentence into a sentence and a particle, but the whole sentence is declarative. The sentence-final punctuation helps the inducer to distinguish these two types of sentences, but it must rely on statistics to split the particle off of the rest of the sentence, because the particle in many cases is not present. It is worth noting that the two most frequent rules are also the rules with the largest number of unattested constituents, because these rules are for sentence-level constituents, and bracketing error at any nodes below may cause the top level rule to have unattested constituents.
Chinese Mandarin is a classifier language. A classifier is usually needed when a determiner or a number occurs before a noun. The twelfth and thirteenth rules are used for combining noun phrases with prenominal modifiers like a determiner phrase or a quantifier phrase, which in turn is formed by the eighteenth rule. The twelfth rule constructs noun phrases at the subject position, and the thirteenth rule is for objects. Similar to English, the nominative-accusative case distinction is still induced in the grammar although there is no morphological marker or lexical distinction for them.
There appear to be statistical cues for semantics of nouns, too. Table 7 shows the four main induced categories in this grammar. The distinction is clear, the first category is for personal pronouns and relative terms, the second for general nouns, the third for location terms (they are annotated as LC in Penn Chinese Treebank, but show up as NPs because of unary chain removal) which are used as nouns in Chinese, and finally wh-pronouns. These four classes of nouns seem to have distinct statistical properties. For example, personal pronouns and relative terms almost never appear with determiner phrases and quantifier phrases, but general nouns almost always do. The location terms most of the time are modified by noun phrases and wh-pronouns appear in questions.
Ba in Chinese Mandarin takes the object of the verb and moves it to the preverbal position, making a normally SVO sentence SOV. It can be considered as a light verb (Ding 2001; Duan and Schule 2015) or a preposition and a case marker (Ye, Zhan, and Zhou 2007). Figure 17 shows the automatic annotation is consistent with annotation guidelines for Penn Chinese Treebank. The induced analysis seems to support ba as a preposition: The ba phrase combines with the VP after it and the result is also a VP.
The most frequent rules induced in Tong and their correspondences in the gold annotation. The examples are randomly sampled from the bounded induced trees.

Recall of the top 3 most frequent words in the four induced categories that correspond to nouns in Tong.
Rank . | Induced category . | Category count . | Attested words and relative frequency . |
---|---|---|---|
1. | 3 | 12,425 | ![]() ![]() ![]() |
2. | 29 | 4,073 | ![]() ![]() ![]() |
3. | 18 | 1,991 | ![]() ![]() ![]() |
4. | 15 | 1,707 | ![]() ![]() ![]() |
Rank . | Induced category . | Category count . | Attested words and relative frequency . |
---|---|---|---|
1. | 3 | 12,425 | ![]() ![]() ![]() |
2. | 29 | 4,073 | ![]() ![]() ![]() |
3. | 18 | 1,991 | ![]() ![]() ![]() |
4. | 15 | 1,707 | ![]() ![]() ![]() |
Example syntactic analyses for a ba construction in Tong: (Mom takes off this).
Example syntactic analyses for a ba construction in Tong: (Mom takes off this).
9. Experiment 4: Natural Bounding in Child-Directed and Adult Language Data
It seems likely that children and adults have different working memory capabilities, and therefore different capacities for center embedding. This section describes experimentsto measure the difference between adult and child-directed speech during statistical grammar induction.
Following earlier work (Klein and Manning 2004; Seginer 2007a; Ponvert, Baldridge, and Erk 2011; Shain et al. 2016), these experiments use the Wall Street Journal section of the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993) as adult language data. Sentences shorter than or equal to 20 words are used in the experiments, partitioned into a development set, WSJ20Dev, and a held-out set, WSJ20Test, following Jin et al. (2018a).
We are first interested in how the sparse preference related to the hyperparameter β behaves on adult-directed newswire data. Figure 18 shows the interaction between β values and the evaluation metrics. The right-branching bias contributed by high beta can still be seen on this data set. The unlabeled F1 peaks at β = 0.2, but when labels are taken into account, the evaluation metric RH peaks at a much lower β = 0.01, indicating again the preference of sparse priors in labeled grammar induction. The observation that the optimal β on WSJ20Dev is lower than on child-directed speech may point at the possibility that the right-branching bias from high β on child-directed speech provides an advantage to evaluation of structures in RH such that a certain level of that right branching bias is favored. On WSJ20Dev, however, the advantage from the right-branching bias is outweighed by the disadvantage it brings to labeling accuracy, therefore the optimal β appears to be low. Experiments with WSJ20 data sets use β = 0.01.
Because the sentences in both transcribed child-directed speech corpora and adult-directed newswire corpora are produced by humans, several predictions can be made for the distribution of tree depths, which is the proportion of sentences whose induced tree has a certain maximum depth, from the induced trees on both kinds of data sets. First, the distribution of tree depths at initialization should show a wide range, because at this time the tree depths are only correlated to sentence length. Second, the distribution of tree depths for unbounded models should substantially narrow after training, where the induced grammar implicitly learns the human memory limits from data, but imperfect learning is also expected, shown by a small number of sentences with trees with depth 4 or higher. Third, compared to child-directed data, adult-directed newswire data have more complicated structure, so the expected tree depth on adult-directed data should be higher than child-directed data. Note that the child-directed data are still generated by adults; therefore the maximum memory depth is still expected to be similar to other adult-generated data. However, the trees generated by the grammar may reflect the fact that the sentences are relatively short and simple in structure in child-directed data.
Figure 19 shows the distribution of tree depths on the development sets of child-directed data, Adam, and adult-directed data, WSJ20dev. The three bars represent the induced trees from initialized unbounded grammars (blue), unbounded grammars after training (orange), and grammars bounded at depth 3 (green). The predictions listed above bear out in the results. The expected tree depth on Adam is 1.59 for the unbounded models, and 1.55 for the bounded models, which is a significant difference (p < 0.001 using a permutation test). The expected tree depth on WSJ20Dev is 2.49 for the unbounded models, and 2.28 for the bounded models, which is also a significant difference (p < 0.001 using a permutation test), confirming that depth-bounding leads to trees with lower usage of stack elements. Also notably, the percentage of trees with depth 4 or higher from unbounded models on both Adam and WSJ20dev is very small, indicating that the unbounded models are able to learn implicitly the human memory constraints from data. However, the unbounded grammars face a larger search space of possible grammars, and they may need to allocate rules, categories, or probability mass for deep tree structures, which make them less accurate, as shown in previous experiments.
Distribution of tree depths on child-directed and adult-directed sentences. The difference of expected tree depths of bounded and unbounded models is significant on both data sets.
Distribution of tree depths on child-directed and adult-directed sentences. The difference of expected tree depths of bounded and unbounded models is significant on both data sets.
This experiment points to a potential unsupervised method for determining the optimal depth limit for a data set, although the number of potential candidate values for the depth limit parameter is already very small. The optimal maximum depth on Adam appears to be 3. On WSJ20Dev it appears to be 3 or 4, taking into account results from psycholinguistic literature (Karlsson 2007, 2010; Schuler et al. 2010). The following experiments choose depth 3 as the maximum allowed depth for the depth-bounding model. We leave the empirical investigation of accuracy of grammars bounded at depth 4 for future work, because of its extensive requirements of resources and runtime.
10. Experiment 5: Replication of Depth Bound Effects in Newswire Corpora
Independent of its value for modeling child language acquisition, there may also be engineering benefits to applying center-embedding depth bounds during grammar induction on newswire corpora. The proposed bounded and unbounded models are run with 10 random seeds, with the mean accuracy and standard deviation shown in Table 8. Both models achieve similar unlabeled accuracy, but the significant difference between labeled evaluation accuracy (p < 0.001) indicates depth-bounding facilitates discovery of grammars with better labeling accuracy, leading to overall better accuracy when labels are taken into consideration. This shows that depth-bounded grammar induction models as a human language acquisition model also works with more syntactically complex newswire text.
Mean and standard deviation of scores on WSJ20Test data set with proposed models on 10 runs. The difference of RH between the two models is significant (p < 0.001).
System . | F1 . | RH . |
---|---|---|
this work (D = ∞, C = 30) | 0.49 ± 0.02 | 0.28 ± 0.03 |
this work (D = 3, C = 30) | 0.49 ± 0.02 | 0.31 ± 0.02 |
System . | F1 . | RH . |
---|---|---|
this work (D = ∞, C = 30) | 0.49 ± 0.02 | 0.28 ± 0.03 |
this work (D = 3, C = 30) | 0.49 ± 0.02 | 0.31 ± 0.02 |
Results of induced grammars with highest likelihoods from several induction systems are presented in Table 9 for comparison. Neural induction systems achieve higher accuracy than the pure statistical systems in this work, but are not as easy to augment with depth bounds. The larger accuracy difference between statistical models and neural models on WSJ compared with child-directed data may indicate that more categories are required to capture relatively complex syntactic structures, of which the neural models have 90 but the statistical models have 30. We therefore leave integration of depth bounding into neural induction systems for future work.
Accuracy scores on WSJ20Test data set with previously published induction systems.
System . | F1 . | RH . |
---|---|---|
Seginer (2007a) | 0.61 | – |
Ponvert, Baldridge, and Erk (2011) | 0.44 | – |
Jin et al. (2018b) | 0.61 | – |
Jin et al. (2019) | 0.51 | – |
Kim, Dyer, and Rush (2019) without z | 0.52 | 0.35 |
Kim, Dyer, and Rush (2019) with z | 0.54 | 0.37 |
this work (D = ∞, C = 30) | 0.51 | 0.28 |
this work (D = 3, C = 30) | 0.51 | 0.32 |
11. Conclusion
This article describes unbounded and depth-bounded PCFG induction models, intended to represent something akin to a competence grammar and a performance modelthat takes working memory and processing constraints into account, and evaluates them on transcribed corpora of child-directed speech.
Results from Section 7 show that the model predicts 44%—nearly half—of labeled attested constituents in the held-out partition, according to the RH measure. It is interesting that so much linguistic structure can be predicted from words alone, without semantics, and without universal linguistic constraints. It is anticipated that this assessment will encourage future research to discover how the other half can be predicted. Moreover, properties of the induced structures might be used to guide future linguistic analysis—for example, incorporating readily induced properties of number, case, and subcategorization into standard part-of-speech tag sets.
Results in Section 8 also show a statistically significant positive effect for depth bounding on grammar induction. This suggests a natural explanation for simplified grammatical behaviors observed in child production, as due to a basic memory-bounding mechanism that facilitates acquisition.
Acknowledgments
The authors would like to thank the anonymous reviewers and the editor for their helpful comments. Computations for this project were partly run on the Ohio Supercomputer Center. This research was funded by Defense Advanced Research Projects Agency award HR0011-15-2-0022 and by National Science Foundation grant 1816891. The content of the information does not necessarily reflect the position or the policy of the government, and no official endorsement should be inferred.
Notes
It is also not straightforward to augment neural network models to test the contribution of depth bounds.
A Kronecker product multiplies two matrices of dimension m × n and o × p (or vectors in case n and p equal one) into a matrix of dimension mo × np consisting of a copy of the first matrix with each element replaced by a copy of the second matrix multiplied by that element.
Here, τn,i,j ∉ C if no constituent in τ yields words i to j, and n,i,j ∉ if no constituent in yields words i to j.
The tokens a, b, and c are randomly chosen uniformly from {a1, …, a50}, {b1, …, b50} and {c1, …, c50}, respectively.
Question marks (‘??’) for parent, left child, or right child indicate no constituent was attested at that location.
Category 43 is used for complements of merged contractions. This analysis appears to be consistent, but is not a linguistically familiar analysis.