## Abstract

Learning probabilistic context-free grammars (PCFGs) from strings is a classic problem in computational linguistics since Horning (1969). Here we present an algorithm based on distributional learning that is a consistent estimator for a large class of PCFGs that satisfy certain natural conditions including being anchored (Stratos et al., 2016). We proceed via a reparameterization of (top–down) PCFGs that we call a bottom–up weighted context-free grammar. We show that if the grammar is anchored and satisfies additional restrictions on its ambiguity, then the parameters can be directly related to distributional properties of the anchoring strings; we show the asymptotic correctness of a naive estimator and present some simulations using synthetic data that show that algorithms based on this approach have good finite sample behavior.

## 1 Introduction

This paper presents an approach for strongly learning a linguistically interesting subclass of probabilistic context-free grammars (PCFGs) from strings in the realizable case. Unpacking this, we assume that we have some PCFG that we are interested in learning and that we have access only to a sample of strings generated by the PCFG (i.e., sampled from the distribution defined by the context-free grammar). Crucially, we do not observe the derivation trees—the hierarchical latent structure. Strong learning means that we want the learned grammar to define the same distribution over labeled trees as the original grammar and not just the same distribution over strings.

Clearly, there can be many structurally different PCFGs that define the same distribution over strings. Consider for example the distribution that generates a single string of length 3 with probability one and the various PCFGs that give rise to that same distribution; for these obvious reasons, that we discuss in more detail later, we cannot have an algorithm that does this for all PCFGs. Accordingly, we define some sufficient conditions on PCFGs for this algorithm to perform correctly. More precisely, we define some simple structural conditions on the underlying CFGs (in Section 3), and we will show that the resulting class of PCFGs is *identifiable from strings*, in the sense that any two PCFGs that define the same distribution over strings will be isomorphic.

We then provide a computationally trivial learning algorithm in Section 4, together with a proof that it will strongly learn every grammar in this class. The algorithm is not intended to be a realistic algorithm, but merely to illustrate the fundamental correctness of this general approach. We then show that general PCFGs in Chomsky normal form (CNF) that approximate the observable properties of natural language syntax are efficiently learnable using some simulations with synthetic data in Section 5.

Our primary scientific motivation is to understand the process of first-language acquisition, in particular the early phases of the acquisition of syntactic structure. Importantly, the grammar is not just a decision procedure that classifies strings as being grammatical or ungrammatical, but additionally assigns a tree structure to the grammatical sentences, a structure the primary role of which is to support semantic interpretation. The standard view is that children learn the syntactic structure of their languages not by purely syntactic means, but rather by using information about the range of available interpretations, derived from the situational context of the sentences they hear and inferences about the intentions and goals of the speaker (e.g., Abend et al., 2017). Indeed there is ample direct evidence from the developmental psycholinguistics literature that this does in fact happen at certain stages of language acquisition: For example, Gropen et al. (1991) showed that the acquisition of argument structure of verbs exploits semantic information about the verb and the arguments. However, the children in these experiments—the youngest cohort being nearly 4 years old—have already acquired a great deal of knowledge about English syntax.

Here, we are exploring an alternative or perhaps complementary hypothesis: namely, that the acquisition of the syntactic categories and rules of the language can to a certain extent be learned using only information derived from the surface strings without any appeal to external information about the hierarchical structure of the language that is being learned. In other words, the initial phases of language acquisition are based on purely syntactic information rather than the semantic bootstrapping discussed above.

The contributions of this paper are as follows. First, we provide a reparameterization of PCFGs within the space of weighted context-free grammars (WCFGs) that we call Bottom–up WCFGs. Next, we define three structural conditions on CFGs and show that they imply the identifiability of the class of all PCFGs based on those grammars. We then present a naive computationally trivial estimator and prove its asymptotic consistency for that class of PCFGs. We present some experiments on synthetic grammars that show that a variant of this algorithm has good finite sample behavior. Finally, we examine the extent to which these conditions are plausible, using a corpus of child-directed speech.

## 2 Definitions

*Σ*. The set of finite strings over this set is written

*Σ*

^{*}, nonempty finite strings are denoted by

*Σ*

^{+}, and the empty string is

*λ*. We will typically write

*a*,

*b*,

*c*,… for elements of

*Σ*and

*u*,

*v*,

*w*,… for elements of

*Σ*

^{*}. A (formal) language

*L*is a subset of

*Σ*

^{*}. A context is an ordered pair of strings, that is, an element of

*Σ*

^{*}×

*Σ*

^{*}that we write as

*l*,

*r*. If

*U*,

*V*are languages, then their concatenation is

*UV*defined in the normal way, and we will also write

*uV*where

*u*is a string instead of {

*u*}

*V*and so on. Given a fixed language

*L*, we define for a set of strings

*U*a set of contexts

*U*

^{⊳}as

If $U=u$ we will write *u*^{⊳} for the distribution of *u*—the set of contexts in which it can occur.

A stochastic language is a function ℙ from *Σ*^{*} → [0,1], such that $\u2211w\u2208\Sigma *P(w)=1$. Note that the support of this distribution is a formal language as defined above. We assume for the rest of the paper that the expected length of strings drawn from this distribution is finite.

*u*∈

*Σ*

^{+}, the expected number of times that

*u*will occur as a substring in a string distributed according to ℙ.

^{1}

*u*, its context distribution, which is a probability distribution over its contexts written $D(u)$, whose support will be

*u*

^{⊳}, given for

*l*,

*r*∈

*Σ*

^{*}×

*Σ*

^{*}by

### Context-Free Grammars

We consider context-free grammars (CFGs) in Chomsky normal form 〈*Σ*,*V*,*S*,*P*〉 where *Σ*= is a nonempty finite set of terminal symbols; *V* is a nonempty finite set, disjoint from =*Σ*= of nonterminal symbols, *S* is a distinguished element of *V*, the start symbol and *P* is a finite nonempty set of productions each of which is either of the form =*A* → *a*= where =*A* ∈ *V*= and =*a* ∈ *Σ*= or =*A* → *BC*= where =*A* ∈ *V* and $B,C\u2208V\u2216S$.^{2}

We write *A*,*B*,*C*,… for elements of *V* and *α*for strings over *V* ∪ *Σ*. A *derivation tree**τ* is a singly rooted ordered tree where every node is labeled with an element of *V* ∪ *Σ* and each local tree is in *P*. The yield of a derivation is the string of symbols of leaves of the tree taken left to right; we write this as *y*(*τ*). The set of all derivations licensed by *G* and rooted by a nonterminal *A*, and with a yield in a set *Γ* is written as *Ω*(*G*,*A*,*Γ*); here we follow the notation of Smith and Johnson (2007) among others. We will omit *G* when it is clear.

We want to be able to combine trees using tree substitution; thus, if we have a tree *τ*_{1} whose yield is *lBr*, where *l* and *r* are strings over *Σ*, and a tree *τ*_{2} whose root is *B* and whose yield is *α*, we can combine them to get a tree *τ*_{1} ⊗ *τ*_{2} whose yield is *lαr*.

*A*to be

*G*is ℒ(

*G*) = ℒ(

*G*,

*S*).

For a tree *τ* and a production *A* → *α* we write *f*(*A* → *α*;*τ*) for the number of times the production occurs in *t*. We write |*τ*| for the number of nonterminal symbols in a tree, and |*w*| for the length of a string.

### 2.1 WCFGs

We will now consider the probabilistic case where we have a (discrete) probability distribution over trees, that is, over *Ω*(*G*,*S*,*Σ*^{+}), which will then define a stochastic language, whose support will be a context-free language. We will only consider those distributions which satisfy some simple conditional independence assumptions and can be represented by weighted CFGs.

*θ*:

*P*→ℝ that maps productions to nonnegative real values; we will write this as

*G*;

*θ*. The weight or score of a tree

*τ*is the product of the weights of each production. Formally

*s*:

*Ω*(

*G*) →ℝ is defined as

*s*(

*τ*

_{1}⊗

*τ*

_{2}) =

*s*(

*τ*

_{1})

*s*(

*τ*

_{2}). In general we will define the score of a set of trees

*Ω*to be the sum of the scores of the trees in that set:

*s*(

*Ω*)$=\u2211\tau \u2208\Omega s(\tau )$. The weight of a string

*w*is the sum of the weights of each derivation tree which yields

*w*;

*s*(

*w*) =

*s*(

*Ω*(

*G*,

*S*,

*w*)).

*A*, written

*I*(

*A*) is

*partition function*, written

*Z*(

*A*). The outside value,

*O*(

*A*), is defined likewise as

Note that *O*(*S*) = 1 by definition, since *Ω*(*G*,*S*,*Σ*^{*}*SΣ*^{*}) is a single element set consisting of the trivial tree with one node *S*, which has score 1.

A WCFG is globally normalized if *I*(*S*) = 1. In this case it defines a probability distribution over trees, we can identify the probability of a tree with its score: ℙ(*τ*) = *s*(*τ*), and via that a stochastic language.

### 2.2 Expectations

We define expectations of nonterminals, terminals, and productions, with respect to the distribution over trees defined by a globally normalized WCFGs.

*A*→

*α*occurs in a tree generated by the distribution induced by the grammar:

*A*,

*B*,

*C*and terminals

*a*, the following identities relate the expectations and the inside and outside values, which can be established using the methods of, for example, Chi (1999).

*A*that is not

*S*, and any

*β*> 0, we can scale all parameters for productions with

*A*on the left-hand side by

*β*, and every production with

*A*on the right-hand side by

*β*

^{−1}(or

*β*

^{−2}if

*A*occurs twice on the right-hand side), and the score of every tree will remain the same. There are two natural ways of resolving this arbitrariness: one is to stipulate that for all nonterminals

*I*(

*A*) = 1, which gives us the familiar PCFG. The parameters of a tight PCFG satisfy

*O*(

*A*) = 1for all nonterminals. We now define this alternative parameterization, which we call a

*bottom–up*WCFG, in contrast to the top– down generative process associated with a PCFG.

We say that a WCFG is in bottom–up form if *I*(*S*) = 1, and for all nonterminals *A*, *O*(*A*) = 1.

Note that in this form, we condition the parameters on the right-hand side of the production not on the left-hand side as is done with a PCFG.

There is a unique bijection between the class of tight PCFGs and bottom–up WCFGs; we can easily convert from one form to the other. We can efficiently compute the inside and outside values of a convergent WCFG using standard techniques (Hutchins, 1972; Nederhof and Satta, 2008; Etessami et al., 2012); these involve solving a system of quadratic equations (since the grammar is in Chomsky normal form) in the case of the inside values, which can be done using the Newton method or a fixed point iteration, and a linear system in the case of the outside values. The expectations of each production can then be computed using Equation 1 and then converted into a PCFG or bottom up WCFG as desired using Equations 2 and 3, respectively.

## 3 Identifiability

We assume that we have a sequence of strings generated independently and identically distributed (i.i.d.) from some distribution generated by an unknown PCFG or WCFG, which we call the target grammar.

We are interested in the problem of producing a PCFG from this input data that is close to the target PCFG; namely, the underlying CFG is isomorphic to the underlying CFG of the target grammar and additionally the parameters are within *ε* of the corresponding parameters of the target grammar: we call this being *ε*-close. Two CFGs are isomorphic if they are identical apart from the labels of the nonterminals; the isomorphism is just a bijection between the nonterminals and productions in the natural way.

*G*;

*θ*and

*G′*;

*θ′*, are

*ε*-close if there is an CFG-isomorphism

*ϕ*from

*G*to

*G′*such that for all

*A*→

*α*in the grammars,

More precisely, we say that a learning algorithm *A* is a consistent estimator for a class of globally normalized WCFGs, $G$, if for every WCFG, *G*_{*},*θ*_{*} in the class, for every *ε*,*δ* > 0, there is an *N* such that if the algorithm receives a sample of *m* strings, sampled i.i.d. where *m* ≥ *N*then it outputs a WCFG $\u011c,\theta ^$ such that with probability at least 1 − *δ* we have that $\u011c,\theta ^$ is *ε*-close to *G*_{*},*θ*_{*}.

### 3.1 Structural Conditions on Grammars

We now define three structural conditions on PCFGs that will be sufficient to guarantee identifiability of the class from strings.

*Condition*3.1.

A grammar *G* is *anchored* if for every nonterminal *A*, there exists a terminal *a* such that *A* → *a* ∈ *P* and, if *B* → *a* ∈ *P* then *B* = *A*. In other words *a* occurs on the right-hand side of exactly one production.

We will call such a terminal a characterizing terminal of *A*, and if *a* characterizes *A* we will sometimes write [[*a*]] for *A*.

This condition is very close to a number of conditions that have been proposed in the literature both for topic modeling and for grammatical inference: We use here the terminology of Stratos et al. (2016), but similar ideas occur in, for example, Adriaans’s (1999) approach to learning CFGs and Denis et al.’s (2004) approach to learning regular languages. This is also very closely related to what is called the 1-Finite Kernel Property in distributional learning of CFGs (Clark and Yoshinaka, 2016).

The key idea behind the learning algorithm is this: If every nonterminal has a characterizing terminal then we can infer the probabilities of the productions of the grammar from distributional properties of the strings of corresponding terminals. Thus if *A*, *B*, and *C* are nonterminals characterized by *a*, *b*, and *c*, respectively, then we can infer something about the parameter of the production *A* → *BC* by looking at the distributional properties of *a* and *bc*. And if *A* is a nonterminal characterized by *a* and *b* is any terminal, then we can infer something about the parameter of the production *A* → *b* by looking at the distributional properties of *a* and *b*.

## 3.2 Divergences

*α*-divergence (Rényi, 1961) between two discrete distributions

*P*and

*Q*is defined for $\alpha =\u221e$

*u*,

*v*we will be concerned with

*ρ*(

*u*→

*v*),

*u*and

*v*, which takes the value 0 only when they are identical. Note that, because

*u*

^{⊳}is the support of $D(u)$,

We can now state a foundational result, which relates the parameters of a production to these divergences. We will start by proving an inequality, that we will later strengthen to an equality under additional conditions.

*Suppose*

*G*;*θ*is a bottom–up WCFG, and*G*is anchored. Let*D*be the distribution it defines, and*P*the set of productions. Suppose that*a*,*b*,*c*are characterizing terminals for nonterminals*A*,*B*,*C*respectively. Then for any terminal*d*if*A*→*d*∈*P**and if*

*A*→

*BC*∈

*P*

*Proof.*

*A*is a nonterminal in

*G*that is characterized by

*a*. Then, for every context

*l*,

*r*, since the only way that we can derive an

*a*is via

*A*, ℙ(

*lar*) =

*s*(

*Ω*(

*S*,

*lAr*))

*θ*(

*A*→

*a*). Summing both sides with respect to

*l*,

*r*we obtain

*O*(

*A*) = 1 in a bottom-up WCFG we have that

*A*→

*d*in the grammar, where

*a*characterizes

*A*. Consider some

*l*,

*r*∈

*a*

^{⊳}. Since

*a*is an anchor of

*A*, we know that

*s*(

*Ω*(

*S*,

*lAr*)) > 0, and therefore ℙ(

*ldr*) > 0. Clearly

*l*,

*r*∈

*a*

^{⊳}we have

*A*,

*B*,

*C*nonterminals characterized by

*a*,

*b*,

*c*, respectively, and a production

*A*→

*BC*with parameter

*θ*(

*A*→

*BC*). Let

*l*,

*r*be some context in

*a*

^{⊳}, then ℙ(

*lar*) > 0 and ℙ(

*lbcr*) > 0. Clearly

*θ*(

*A*→

*BC*) is smaller than or equal to

*l*,

*r*∈

*a*

^{⊳}we have

This shows us that we have an *upper bound* on the parameters from a distributional property. But looking at Equations 8 and 9, we can consider the circumstances under which this inequality will be tight, in which case we can recover the parameters directly.

In particular, if the grammar is unambiguous (i.e., if every string has at most one derivation tree) then if the left-hand side of the inequality is nonzero we can immediately see that the inequality will become an equality. As it happens, there will also be equality under some much weaker conditions that we now define.

## 3.3 Ambiguity

We now define two closely related conditions that are both related to the degree of ambiguity of the grammar.

*Condition*3.2

*G*contains a production

*A*→

*α*. We say that

*G*has an unambiguous context for that production if there is a string

*w*and strings

*l*,

*u*,

*r*such that

*w*=

*lur*,

*Ω*(

*G*,

*S*,

*w*) is nonempty and

*Ω*(

*G*,

*A*,

*u*) have an occurrence of

*A*→

*α*at the root. A CFG is

*locally unambiguous*if it has an unambiguous context for every production in its set of productions.

Informally this condition says that for every production there is some string which, although it can be ambiguous, always uses that production at the same point. Note that if *G* is locally unambiguous and is anchored, then for every binary production, [[*a*]] → [[*b*]][[*c*]] there will be a context *l*,*r* such that *lbcr* satisfies the condition; and for every production [[*a*]] → *b* there will be a context *l*,*r* such that *lbr* satisfies the condition.

If a grammar is unambiguous, then every context is an unambiguous context for every derivation that uses it, but this condition is much weaker than that; indeed, we don’t need there to be any unambiguous strings, since *Ω*(*G*,*S*,*lAr*) can have more than one element.

*If*[[

*G*;*θ*is a bottom–up WCFG and*G*is anchored and is locally unambiguous, then if*a*]] →

*b*∈

*P*

*and if*[[

*a*]] → [[

*b*]][[

*c*]] ∈

*P*

*Proof.*

*a*]] → [[

*b*]][[

*c*]] in the grammar, we know there is a context such that

*Ω*(

*S*,

*lwr*) =

*Ω*(

*S*,

*l*[[

*a*]]

*r*) ⊗

*Ω*([[

*a*]],

*w*) where all the elements of

*Ω*(

*A*,

*w*) have an occurrence of [[

*a*]] → [[

*b*]][[

*c*]] at the root. Because we know that

*Ω*([[

*a*]],

*bc*) consists of a single tree using [[

*a*]] → [[

*b*]][[

*c*]]; and

*Ω*(

*S*,

*l*[[

*b*]][[

*c*]]

*r*) =

*Ω*(

*S*,

*l*[[

*a*]]

*r*) ⊗

*Ω*([[

*a*]],[[

*b*]][[

*c*]]), therefore

*Ω*(

*S*,

*lbcr*) =

*Ω*(

*S*,

*l*[[

*a*]]

*r*) ⊗

*Ω*(

*A*,

*bc*). Now we apply the same manipulations to get that for this

*l*,

*r*

*b*and

*c*.

^{3}The second term penalizes cases where the right-hand side is distributionally dissimilar from the left-hand side. For the lexical productions, similarly we have two terms:

## 3.4 Upward Monotonicity

We need one more condition, however. There may be many different grammars that define the same distribution over strings that satisfy these two conditions because we may have multiple nonterminals that could be merged together.

*Condition*3.3

A grammar *G* = 〈*Σ*,*V*,*S*,*P*〉 is *strictly upward monotonic* if for all $Q\u2283P$, $L(\u2329\Sigma ,V,S,Q\u232a)\u2283L(G)$. (Where *Q* is restricted to CNF productions of *V* × (*Σ* ∪ *V*^{2}).)

Informally, if we add a new production to the grammar, then the language defined increases. Note that of course all grammars have the property that if $Q\u2287P$, then $L(\u2329\Sigma ,V,S,Q\u232a\u2287L(G)$. Here we require this monotonicity to be strict.

*A*to be

Suppose *G* is anchored and upward monotonic: If *A*,*B* are nonterminals and $C(G,A)=C(G,B)$ then *A* = *B*.

*Proof.*

Let *a* be an anchor for *A*; we can clearly add the production *B* → *a* without increasing the language generated. Therefore, *B* → *a* is in the grammar, and so *A* = *B* as *a* is an anchor.

*G*is anchored and upward monotonic: Then

Using the same condition we can show that productions not in the grammar will have parameters zero, because of an infinite divergence term.

Suppose *G* is anchored, and upward monotonic, then

- •
If [[

*a*]] →*b*is not in the grammar, then $\rho (a\u2192b)=\u221e$. - •
If [[

*a*]] → [[*b*]][[*c*]] is not in the grammar, then $\rho (a\u2192bc)=\u221e$.

*Proof.*

If *A* → *b* is not in the grammar, then by Lemma 3.3, there is some *l*,*r* such that *lar* is in the language but *lbr* is not in the language and so $\rho (a\u2192b)=\u221e$. Similarly for binary rules.

## 3.5 Selecting Nonterminals

The preceding discussion shows that if we have a set of terminals that are anchors for the true nonterminals in the original grammar, then the productions and the (bottom–up) parameters of the associated productions will be fixed correctly, but it says nothing about parameters that might be associated to productions that use other nonterminals. However, it is easy to show that under these assumptions there can be no other nonterminals.

Suppose *G*_{1} and *G*_{2} are anchored and strictly monotonic, and are weakly equivalent. Then they are isomorphic, and there is a unique isomorphism between them.

*Proof.*

Let *A* be a nonterminal in *G*_{1}, and let *a* be an anchor for *A*. Suppose *B* → *a* be some production in *G*_{2}. Let *b* be an anchor for *B*. Therefore $a\u22b3\u2287b\u22b3$. By a similar argument there must be a nonterminal *C* in *G*_{1} and a terminal *c* that anchors *C* such that $b\u22b3\u2287c\u22b3$. But because $a\u22b3\u2287c\u22b3$, we must have a production *C* → *a* in *G*_{1}. Since *a* is an anchor *C* = *A*, and therefore *a*^{⊳} = *b*^{⊳} = *c*^{⊳}. Therefore $C(G1,A)=C(G2,B)$.

Let *ϕ* then be the CFG-morphism from *G*_{1} → *G*_{2}, defined by *ϕ*(*A*) = *A′* iff $C(G1,A)=C(G2,A\u2032)$. This is well defined by Lemma 3.2, and is clearly a bijection. Given this bijection, by Lemma 3.3, they will have the same set of productions, and thus be isomorphic.

## 3.6 Identifiability

We can now define the classes of grammars that we are interested in. Let $GA$ be the set of all trim CFGs that are in Chomsky normal form, anchored (Condition 3.1), are locally unambiguous (Condition 3.2), and are strictly upward monotonic (Condition 3.3).

Let $PA$ be the set of all tight PCFGs with finite expectations, with CFGs in $GA$, and let $WA$ be the set of all WCFGs in bottom–up form with CFGs in $GA$.

Suppose *G*_{1};*θ*_{1} and *G*_{2};*θ*_{2} are in $WA$ and are stochastically equivalent: In other words, for all *w* ∈ *Σ*^{+}, ℙ(*w*;*G*_{1}) = ℙ(*w*;*G*_{2}), then *G*_{1} is isomorphic to *G*_{2}, and if *ϕ* is the unique such morphism, for all *A* → *α*, *θ*_{1}(*A* → *α*) = *θ*_{2}(*ϕ*(*A* → *α*)).

*Proof.*

Because they are stochastically equivalent, the support of their distributions is equal, and thus *G*_{1} and *G*_{2} are weakly equivalent. Therefore by Lemma 3.5 there is a unique isomorphism between them, *ϕ*. By Lemma 3.1 the parameters of corresponding productions must also be equal.

Because there is a bijection between $WA$ and $PA$, $PA$ is also identifiable from strings.

## 4 Naive Estimators

We now analyze the properties of a particular estimator that we call the *naive plugin estimator*, which we will show can learn all grammars in $WA$ and $PA$. This approach uses a trivial manner of estimating the *ρ* values, and from this we derive a consistent estimator for the class. This approach has poor sample complexity but is algorithmically trivial.

We will need to estimate the *ρ* divergences from a sample of strings drawn i.i.d. from the distribution defined by the grammar. Given a sample of strings, the most naive approach is to estimate ℙ(*w*) and $E(a)$ by the empirical distribution, to estimate the ratio as the ratio of these estimates, and to take the supremum over the frequent contexts of *a* rather than over the infinite set *a*^{⊳}.

We are interested in convergence in probability, which we will write as $X^N\u2192N\u2192\u221eX$; in other words, for any *ε*,*δ* > 0, there is an *n* such that for all *N* > *n*, with probability greater than 1 − *δ* we have $|X^N\u2212X|<\epsilon $.

Let *w*_{1},…,*w*_{N} be the sample of *N* strings drawn i.i.d from a target PCFG, and let *n*(*w*) be the number of times that *w* occurs in the sample (as a whole string), and let *m*(*w*) be the number of times substring occurs as a substring; clearly, $\u2211l,rn(lwr)=m(w)$. Define $P^(w)=n(w)/N$ to be the empirical probability of *w* and $E^(u)=m(w)/N$ to be the empirical expectation of *u*. Clearly, for any string *w* we have $P^(w)\u2192N\u2192\u221eP(w)$ and $E^(w)\u2192N\u2192\u221eE(w)$.

The naive plugin estimator is given by:

*a*,

*b*,

*c*∈

*Σ*we define

*a*,

*b*∈

*Σ*we define

Note that $\rho ^N(a\u2192bc)=\u221e$ if there is some context *l*,*r* such that $n(lar)>N$, and *n*(*lbcr*) = 0.

We can show the convergence of the estimators when one side is anchored, starting with the case when the divergence is infinite.

For some $G;\theta \u2208WA$ suppose that *a* is an anchor for a nonterminal *A* and suppose that for some *b* ∈ *Σ*, $\rho (a\u2192b)=\u221e$. Then for every *δ* > 0, there is an *N* such that with probability at least 1 − *δ*, $\rho ^N(a\u2192b)=\u221e$. Similarly, if there is a *c* if $\rho (c\u2192a)=\u221e$, there is an *N* such that with probability at least 1 − *δ*, $\rho ^N(c\u2192a)=\u221e$.

For some $G;\theta \u2208WA$ suppose that *a* is an anchor for a nonterminal *A*, *b* for *B*, and *c* for *C*. If $\rho (a\u2192bc)=\u221e$, then for every *δ* > 0, there is an *N* such that with probability at least >1 − *δ*$\rho ^N(a\u2192bc)=\u221e$.

*Proof.*

If *A* → *BC* were in *P* then *ρ*(*a* → *bc*) would be finite. So *A* → *BC* is not in *P*. By Condition 3.3, there must be some context *l*_{*},*r*_{*} in *a*^{⊳}but not in (*bc*)^{⊳}, and so for sufficiently large *N*, *l*_{*}*ar*_{*} will occur more than $N$ times.

For some $G;\theta \u2208WA$ suppose that *a* is an anchor for a nonterminal *A*. Suppose *ρ*(*a* → *b*) is finite; then $\rho ^N(a\u2192b)\u2192N\u2192\u221e\rho (a\u2192b)$.

For some $G;\theta \u2208WA$ suppose that *a* is an anchor for a nonterminal *A*, *b* for *B*, and *c* for *C*; if *ρ*(*a* → *bc*) is finite, then $\rho ^N(a\u2192bc)\u2192N\u2192\u221e\rho (a\u2192bc)$.

When *ρ* is finite the convergence is straightforward since $|{l,r:n(lar)>N}|\u2264N$ and so we can use Chernoff bounds in a standard way.

### 4.1 Definition of the Algorithm

We can now define the algorithm, taking as input a sequence of strings ⟨*w*_{1},…,*w*_{N}⟩ and using the trivial plugin estimators $\rho ^N$. The pseudocode is presented in Algorithm A. The algorithm starts by identifying the set of terminals that are anchors, which is illustrated in Figure 1. If a terminal *d* is not an anchor then there will be some terminal *a* which is an anchor such that $\rho (a\u2192d)<\u221e$ and $\rho (d\u2192a)=\u221e$; in other words, such that *a*^{⊳} ⊂ *d*^{⊳}. If the $\rho ^N$ estimates are infinite iff *ρ*> is infinite, then we can see that *Γ* will be the set of possible anchors; that is, those terminals that occur on the right-hand side of exactly one production. Clearly, if *a* and *b* are anchors for the same nonterminal then *ρ*(*a* → *b*) = *ρ*(*b* → *a*) = 0, and if they are anchors for different nonterminals then $\rho (a\u2192b)=\rho (b\u2192a)=\u221e$, so we can just group them into equivalence classes and pick the most frequent one from each class as the anchor. The start symbol will be anchored by the symbol that occurs most frequently as a whole sentence.

We can now prove that this algorithm is a consistent estimator for the class of WCFGs that we consider, $WA$.

For every grammar $G*,\theta *\u2208WA$, for every *ε*,*δ* > 0, there is an *n* such that when Algorithm A is run on a sample of *N* strings, *N* > *n*, generated i.i.d. from *G*_{*};*θ*_{*} it produces a WCFG *G*;*θ*such that with probability at least 1 − *δ*• *G*_{*} is CFG-isomorphic to *G*, and if *ϕ* is an isomorphism from *G*_{*} to *G*• |*θ*_{*}(*A* → *α*) − *θ*(*ϕ*(*A* → *α*))| < *ε*

*Proof.*

(Sketch) Assume first that *N* is sufficiently large that $\rho ^N(a\u2192b)$ is close to *ρ*(*a* → *b*) for all *a*,*b* such that either *a* or *b* is an anchor; we can then show that *Γ*in Line 2 is just the set of possible anchors; and *a* ∼ *b* will be true iff *a*,*b* are anchors for the same nonterminal. We define a bijection between the nonterminals of the hypothesis and the target. Line 5 picks the start symbol to be the unique anchor that can occur in a length 1 string. The grammar will have the right productions via Lemma 3.3, and the parameters will converge via Lemmas 4.3 and 4.4.

The output of this is a WCFG that may be divergent: We therefore define Algorithm B that uses the inside outside (IO) algorithm (Eisner, 2016) to normalize the WCFG produced by Algorithm A; we take the output WCFG and run one iteration of the IO algorithm on the same data to estimate the expectations of all the rules that are then normalized to produce a PCFG. Proving the convergence of this estimator requires a little bit of care. Chi (1999) shows that the result of this procedure will always be a tight PCFG; the finite expectation of |*τ*| allows us to apply a variant of the dominated convergence theorem combined with the law of large numbers to show that this is a consistent estimator for the class of grammars $PA$.

## 5 Experiments

The contributions of this paper are primarily theoretical but the reader may have legitimate concerns about the practicality of this approach given the naive estimator, the assumptions that are required, and the asymptotic nature of the correctness result. Here we present some computational simulations that address these issues, using synthetic PCFGs that mimic to a certain extent the observable properties of child-directed speech (Pearl and Sprouse, 2012). We generate CFGs that have 10 nonterminals, 1,000 terminal symbols, and all possible rules in CNF; none of these grammars are in $GA$. To obtain a PCFG, we sample the parameters for the binary productions and an extra parameter for the lexical rules from a symmetric Dirichlet distribution with parameter *α*, which we vary to control the degree of ambiguity of the grammar. We then train these parameters using the IO algorithm to get a distribution of lengths close to a zero-truncated Poisson with parameter 5. We then sample the conditional lexical parameters from a multivariate log normal distribution with *σ* = 5.^{4}

To obtain a practical algorithm we follow Stratos et al. (2016). We consider only the local context—the immediate preceding and following word including a distinguished sentence boundary marker—and use Ney-Essen clustering (Ney et al., 1994) with 20 clusters to get a low-dimensional feature space. We give the learning algorithm the true number of nonterminals as a hyperparameter (in contrast to Algorithm A, which learns the number of nonterminals) and run the NMF algorithm of Stratos et al. (2016) to find the anchors, considering only those that occur at least 1,000 times. We set the lexical parameters using the Frank-Wolfe algorithm, and the binary parameters using the Renyi divergence with *α* = 5. To alleviate data sparsity with estimating the distribution of the anchor bigrams when computing the binary rule parameters, we use all bigrams consisting of words that have probability at least 0.9 of being derived from the respective nonterminal. This produces a WCFG (A) which may be divergent. We then run one iteration of the IO algorithm^{5} to obtain a PCFG (B), and then a further 10 iterations to get another PCFG (C); this is guaranteed to increase the likelihood of the model; if the PCFG B is sufficiently close to the target then this will converge towards the global optimum, the ML estimate; if not it will only converge to a local optimum.

For efficiency reasons we only run the IO algorithm on sentences of length at most 10; and we evaluate on lengths up to 20. The performance continues to improve with further iterations.

### 5.1 Results

After fixing the hyperparameters, we generate 100 different PCFGs for each condition, and sample 10^{6} sentences from each. We evaluate the results according to how well they recover the true tree structures. We sample 1,000 trees from the target PCFG and evaluate the Viterbi parse of the yield of the tree using labeled exact match in Figure 2 and micro-averaged unlabeled precision/recall in Figure 3.^{6} In all cases we exclude all forced choices so it is possible to score zero. The performance of the original grammar is a measure of the ambiguity of the grammar.

To see the effect of varying the degree of ambiguity, Figure 4 plots unlabeled exact match against the supervised baseline for values of *α* ∈{0.01,0.1,1.0}. For *α* = 1 both are close to the random baseline; apart from that extreme case we find the performance degrading smoothly as predicted by theory. The labeled exact match (not shown here) in contrast shows a more pronounced decrease.

These grammars are about an order of magnitude smaller than plausible natural language grammars for child-directed speech as derived from the treebank in Pearl and Sprouse (2012), but this is largely for resource limitations because whereas Algorithm A is very fast, the IO algorithm is computationally expensive, and running these experiments on hundreds of synthetic grammars/languages at a time would be prohibitively expensive. It is certainly computationally feasible to run these experiments on single grammars with up to 100 nonterminals and 20,000 terminals. In small-scale experiments the results appear comparable with those we report here. The major failure mode is when there are nonterminals *A* where $\u2211aE(A\u2192a)$ is very small. In those cases, though the grammar may be technically anchored, the anchors will be below the frequency threshold being considered.^{7}

## 6 Applicability to Natural Language Corpora

An important question is whether this approach is directly applicable to natural language corpora either of transcribed child-directed speech or of text; a number of the assumptions we make are clearly false. First, even looking at English, we can see that the anchoring assumption is too strong. For example, the expletive pronouns in English, *there* and *it*, are both ambiguous, since *there* is also an adverb and *it* is also a personal pronoun, and so if there is a nonterminal representing such pronouns, then it will not be anchored.

When we consider phrasal categories, the question of whether such nonterminals are anchored requires asking two questions: first, whether such nonterminals generate single words at all, and secondly whether among those words we can find anchors. The existence of pro-forms, such as pronouns in the case of noun phrases, guarantees this for at least some categories. Clearly, this is genre-dependent, because it is sensitive to sentence length. Here we look at the Adam corpus of child-directed speech in English as syntactically annotated in the Penn treebank style by Pearl and Sprouse (2012). Table 1 shows the results. We can see that nonclausal categories are mostly anchored at this crude level of analysis, but that clausal categories are not. This implies that simple sentences without embedded clauses can be learned using this approach, but that learning complex clausal structures will require this approach to be extended at least to anchors of length more than one.

t
. | P(l = 1)
. | $wmax$ . | $P(t|wmax)$ . |
---|---|---|---|

ADJP | 0.67 | careful | 0.85 |

ADVP | 0.84 | already | 1.0 |

FRAG | 0.3 | seal | 0.2 |

INTJ | 0.87 | hmm | 1.0 |

NP | 0.7 | he | 1.0 |

PP | 0.078 | for | 0.13 |

PRT | 0.99 | off | 0.72 |

S | 0.017 | - | - |

SBAR | 0.0046 | if | 0.0024 |

SBARQ | 0.0 | - | - |

SQ | 0.021 | - | - |

VP | 0.11 | crying | 0.82 |

WHADVP | 0.98 | when | 1.0 |

WHNP | 0.8 | who | 0.95 |

t
. | P(l = 1)
. | $wmax$ . | $P(t|wmax)$ . |
---|---|---|---|

ADJP | 0.67 | careful | 0.85 |

ADVP | 0.84 | already | 1.0 |

FRAG | 0.3 | seal | 0.2 |

INTJ | 0.87 | hmm | 1.0 |

NP | 0.7 | he | 1.0 |

PP | 0.078 | for | 0.13 |

PRT | 0.99 | off | 0.72 |

S | 0.017 | - | - |

SBAR | 0.0046 | if | 0.0024 |

SBARQ | 0.0 | - | - |

SQ | 0.021 | - | - |

VP | 0.11 | crying | 0.82 |

WHADVP | 0.98 | when | 1.0 |

WHNP | 0.8 | who | 0.95 |

Most fundamentally, simple PCFGs of the type that we consider here are very poor models of natural language syntax. In order to obtain reasonable results, such grammars need to be lexicalized because otherwise the independence assumptions of the PCFG are violated because of semantic relations, for example, between a verb and its subject. Thus the realizability assumption the approach relies on is dramatically false.

## 7 Discussion and Conclusion

There are two ways of thinking about PCFGs: one is as a nontrivial CFG with parameters attached, where the support of the distribution is the language generated by the CFG, and the other is where the CFG is trivial, containing all possible productions, and where the support is the set of all strings; we can call these *sparse* and *dense* PCFGs, respectively. Hsu et al. (2013) show that in the dense case the class of PCFGs is not identifiable without additional constraints, even when one can exclude a set of grammars of measure zero.^{8} The class of sparse PCFGs we consider, $PA$, has measure zero in their framework, and thus there is no incompatibility between their result and Theorem 3.2. However, there is some incompatibility between the empirical results in Section 5 and Hsu et al. (2013)’s result. With the protocol used in Section 5 we are indeed trying to learn a nonidentifiable class because the PCFGs are dense. However, the grammars are approximately anchored in the sense that for each nonterminal *A* there is a terminal *a* such that $E(A\u2192a)$ is very close to $E(a)$. In these cases, even though there are different parameter settings that give rise to the same distribution over strings, they will all be quite close to each other.

There have been many different attempts to solve this problem over the decades since the learning problem was initially introduced by Horning (1969); a useful survey of older work on learning CFGs is contained in Lee (1996). One strand of research looks at using the IO algorithm to train some heuristically initialized grammar (Baker, 1979; Lari and Young, 1990; Pereira and Schabes, 1992; de Marcken, 1999). However, this approach is only guaranteed to converge to a local maximum of the likelihood, and does not work well in practice. A related problem that we do not discuss in this paper is learning when the labeled tree structures are observed—essentially that of estimating a PCFG from a treebank, a problem which is algorithmically trivial and statistically well behaved, as Cohen and Smith (2012) show. The approach we take is most closely related to the work by Stratos et al. (2016) and work on weakly learning CFGs from samples generated by PCFGs developed by Shibata and Yoshinaka (2016). However, there are very few approaches to learning PCFGs with any nontrivial theoretical guarantees.

The approach here is essentially an exemplar-based model: The syntactic categories are based on single strings of length 1. This can be naturally extended, mutatis mutandis, to sets of exemplars, and to exemplars with length greater than 1. The extension beyond CFGs to mildly context sensitive grammars such as MCFGs (Seki et al., 1991) seems to present some problems that do not occur in the nonprobabilistic case (Clark and Yoshinaka, 2016); although the same bounds on the bottom up parameters can be derived, identifying the set of anchors seems to be challenging.

The variant of Algorithm A discussed in Section 5 is also interesting because it only uses local information in the initial phase: Indeed, it only uses the bigram and trigram counts, and it is only in the use of the IO algorithm that a pass through the data using the full sentence is used; this is compatible with psycholinguistic evidence about infants’ abilities to track transitional probabilities (e.g., work following Saffran et al., 1996). Of course the original version in Section 4 uses complete sentences and not just the low-order counts.

Note that Equation 10 provides some theoretical justification for the long literature (Harris, 1955; McCauley and Christiansen, 2019) on using mutual information as a heuristic for unsupervised chunking. Although it is intuitively reasonable that chunks should correspond to subsequences that have high pointwise mutual information, it is gratifying to finally have some mathematical basis for these intuitions.

## Acknowledgments

This work was partially carried out while the first author was a visiting researcher at The Alan Turing Institute. The second author was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1 and the DeLTA project (ANR-16-CE40-0007). We would like to thank the reviewers for helpful comments that have improved the paper.

## Notes

This is the expectation because if *u* occurs *n* times in a string *w*, there will be *n* distinct contexts *l*,*r* such that *lur* = *w*.

We follow the classical definition of Chomsky normal form in not allowing *S* to occur on the right-hand side of any rules. This simplifies various parts of the analysis, and makes the learning problem slightly harder, but it is not hard to remove this restriction if it is desired. Note that we do not allow an empty right-hand side of a production.

With an adjustment of $logE(|w|)$ because they are expectations and not probabilities.

This gives a Zipfian long-tailed distribution. We experimented also with a truncation of a Pitman Yor process with similar results.

We are grateful to Mark Johnson for his efficient C implementation.

Because both trees are binary, precision is equal to recall.

Full code for reproducing these experiments is available at https://github.com/alexc17/locallearner.

For technical reasons they consider only grammars where all probability mass is evenly distributed over all possible binary trees of a given length, and which are as a result highly ambiguous.

## References

*A Study of Grammatical Inference*