Abstract
Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. They are used ubiquitously in computational linguistics. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of probabilistic grammars using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. By making assumptions about the underlying distribution that are appropriate for natural language scenarios, we are able to derive distribution-dependent sample complexity bounds for probabilistic grammars. We also give simple algorithms for carrying out empirical risk minimization using this framework in both the supervised and unsupervised settings. In the unsupervised case, we show that the problem of minimizing empirical risk is NP-hard. We therefore suggest an approximate algorithm, similar to expectation-maximization, to minimize the empirical risk.
1. Introduction
Learning from data is central to contemporary computational linguistics. It is in common in such learning to estimate a model in a parametric family using the maximum likelihood principle. This principle applies in the supervised case (i.e., using annotated data) as well as semisupervised and unsupervised settings (i.e., using unannotated data). Probabilistic grammars constitute a range of such parametric families we can estimate (e.g., hidden Markov models, probabilistic context-free grammars). These parametric families are used in diverse NLP problems ranging from syntactic and morphological processing to applications like information extraction, question answering, and machine translation.
Estimation of probabilistic grammars, in many cases, indeed starts with the principle of maximum likelihood estimation (MLE). In the supervised case, and with traditional parametrizations based on multinomial distributions, MLE amounts to normalization of rule frequencies as they are observed in data. In the unsupervised case, on the other hand, algorithms such as expectation-maximization are available. MLE is attractive because it offers statistical consistency if some conditions are met (i.e., if the data are distributed according to a distribution in the family, then we will discover the correct parameters if sufficient data is available). In addition, under some conditions it is also an unbiased estimator.
An issue that has been far less explored in the computational linguistics literature is the sample complexity of MLE. Here, we are interested in quantifying the number of samples required to accurately learn a probabilistic grammar either in a supervised or in an unsupervised way. If bounds on the requisite number of samples (known as “sample complexity bounds”) are sufficiently tight, then they may offer guidance to learner performance, given various amounts of data and a wide range of parametric families. Being able to reason analytically about the amount of data to annotate, and the relative gains in moving to a more restricted parametric family, could offer practical advantages to language engineers.
We note that grammar learning has been studied in formal settings as a problem of grammatical inference—learning the structure of a grammar or an automaton (Angluin 1987; Clark and Thollard 2004; de la Higuera 2005; Clark, Eyraud, and Habrard 2008, among others). Our setting in this article is different. We assume that we have a fixed grammar, and our goal is to estimate its parameters. This approach has shown great empirical success, both in the supervised (Collins 2003; Charniak and Johnson 2005) and the unsupervised (Carroll and Charniak 1992; Pereira and Schabes 1992; Klein and Manning 2004; Cohen and Smith 2010a) settings. There has also been some discussion of sample complexity bounds for statistical parsing models, in a distribution-free setting (Collins 2004). The distribution-free setting, however, is not ideal for analysis of natural language, as it has to account for pathological cases of distributions that generate data.
We develop a framework for deriving sample complexity bounds using the maximum likelihood principle for probabilistic grammars in a distribution-dependent setting. Distribution dependency is introduced here by making empirically justified assumptions about the distributions that generate the data. Our framework uses and significantly extends ideas that have been introduced for deriving sample complexity bounds for probabilistic graphical models (Dasgupta 1997). Maximum likelihood estimation is put in the empirical risk minimization framework (Vapnik 1998) with the loss function being the log-loss. Following that, we develop a set of learning theoretic tools to explore rates of estimation convergence for probabilistic grammars. We also develop algorithms for performing empirical risk minimization.
Much research has been devoted to the problem of learning finite state automata (which can be thought of as a class of grammars) in the Probably Approximately Correct setting, leading to the conclusion that it is a very hard problem (Kearns and Valiant 1989; Pitt 1989; Terwijn 2002). Typically, the setting in these cases is different from our setting: Error is measured as the probability mass of strings that are not identified correctly by the learned finite state automaton, instead of measuring KL divergence between the automaton and the true distribution. In addition, in many cases, there is also a focus on the distribution-free setting. To the best of our knowledge, it is still an open problem whether finite state automata are learnable in the distribution-dependent setting when measuring the error as the fraction of misidentified strings. Other work (Ron 1995; Ron, Singer, and Tishby 1998; Clark and Thollard 2004; Palmer and Goldberg 2007) also gives treatment to probabilistic automata with an error measure which is more suitable for the probabilistic setting, such as Kullback-Lielder (KL) divergence or variation distance. These also focus on learning the structure of finite state machines. As mentioned earlier, in our setting we assume that the grammar is fixed, and that our goal is to estimate its parameters.
We note an important connection to an earlier study about the learnability of probabilistic automata and hidden Markov models by Abe and Warmuth (1992). In that study, the authors provided positive results for the sample complexity for learning probabilistic automata—they showed that a polynomial sample is sufficient for MLE. We demonstrate positive results for the more general class of probabilistic grammars which goes beyond probabilistic automata. Abe and Warmuth also showed that the problem of finding or even approximating the maximum likelihood solution for a two-state probabilistic automaton with an alphabet of an arbitrary size is hard. Even though these results extend to probabilistic grammars to some extent, we provide a novel proof that illustrates the NP-hardness of identifying the maximum likelihood solution for probabilistic grammars in the specific framework of “proper approximations” that we define in this article. Whereas Abe and Warmuth show that the problem of maximum likelihood maximization for two-state HMMs is not approximable within a certain factor in time polynomial in the alphabet and the length of the observed sequence, we show that there is no polynomial algorithm (in the length of the observed strings) that identifies the maximum likelihood estimator in our framework. In our reduction, from 3-SAT to the problem of maximum likelihood estimation, the alphabet used is binary and the grammar size is proportional to the length of the formula. In Abe and Warmuth, the alphabet size varies, and the number of states is two.
This article proceeds as follows. In Section 2 we review the background necessary from Vapnik's (1988) empirical risk minimization framework. This framework is reduced to maximum likelihood estimation when a specific loss function is used: the log-loss.1 There are some shortcomings in using the empirical risk minimization framework in its simplest form. In its simplest form, the ERM framework is distribution-free, which means that we make no assumptions about the distribution that generated the data. Naively attempting to apply the ERM framework to probabilistic grammars in the distribution-free setting does not lead to the desired sample complexity bounds. The reason for this is that the log-loss diverges whenever small probabilities are allocated in the learned hypothesis to structures or strings that have a rather large probability in the probability distribution that generates the data. With a distribution-free assumption, therefore, we would have to give treatment to distributions that are unlikely to be true for natural language data (e.g., where some extremely long sentences are very probable).
To correct for this, we move to an analysis in a distribution-dependent setting, by presenting a set of assumptions about the distribution that generates the data. In Section 3 we discuss probabilistic grammars in a general way and introduce assumptions about the true distribution that are reasonable when our data come from natural language examples. It is important to note that this distribution need not be a probabilistic grammar.
The next step we take, in Section 4, is approximating the set of probabilistic grammars over which we maximize likelihood. This is again required in order to overcome the divergence of the log-loss for probabilities that are very small. Our approximations are based on bounded approximations that have been used for deriving sample complexity bounds for graphical models in a distribution-free setting (Dasgupta 1997).
Our approximations have two important properties: They are, by themselves, probabilistic grammars from the family we are interested in estimating, and they become a tighter approximation around the family of probabilistic grammars we are interested in estimating as more samples are available.
Moving to the distribution-dependent setting and defining proper approximations enables us to derive sample complexity bounds. In Section 5 we present the sample complexity results for both the supervised and unsupervised cases. A question that lingers at this point is whether it is computationally feasible to maximize likelihood in our framework even when given enough samples.
In Section 6, we describe algorithms we use to estimate probabilistic grammars in our framework, when given access to the required number of samples. We show that in the supervised case, we can indeed maximize likelihood in our approximation framework using a simple algorithm. For the unsupervised case, however, we show that maximizing likelihood is NP-hard. This fact is related to a notion known in the learning theory literature as inherent unpredictability (Kearns and Vazirani 1994): Accurate learning is computationally hard even with enough samples. To overcome this difficulty, we adapt the expectation-maximization algorithm (Dempster, Laird, and Rubin 1977) to approximately maximize likelihood (or minimize log-loss) in the unsupervised case with proper approximations.
In Section 7 we discuss some related ideas. These include the failure of an alternative kind of distributional assumption and connections to regularization by maximum a posteriori estimation with Dirichlet priors. Longer proofs are included in the appendices. A table of notation that is used throughout is included as Table D.1 in Appendix D.
This article builds on two earlier papers. In Cohen and Smith (2010b) we presented the main sample complexity results described here; the present article includes significant extensions, a deeper analysis of our distributional assumptions, and a discussion of variants of these assumptions, as well as related work, such as that about the Tsybakov noise condition. In Cohen and Smith (2010c) we proved NP-hardness for unsupervised parameter estimation of probalistic context-free grammars (PCFGs) (without approximate families). The present article uses a similar type of proof to achieve results adapted to empirical risk minimization in our approximation framework.
2. Empirical Risk Minimization and Maximum Likelihood Estimation
We begin by introducing some notation. We seek to construct a predictive model that maps inputs from space to outputs from space
. In this work,
is a set of strings using some alphabet
, and
is a set of derivations allowed by a grammar (e.g., a context-free grammar). We assume the existence of an unknown joint probability distribution p(x,z) over
. (For the most part, we will be discussing discrete input and output spaces. This means that p will denote a probability mass function.) We are interested in estimating the distribution p from examples, either in a supervised setting, where we are provided with examples of the form
, or in the unsupervised setting, where we are provided only with examples of the form
. We first consider the supervised setting and return to the unsupervised setting in Section 5. We will use q to denote the estimated distribution.




We are interested in bounding the excess risk for q*, . The excess risk is reduced to KL divergence between p and q if
, because in this case the quantity
is minimized with q′ = p, and equals the entropy of p. In a typical case, where we do not necessarily have
, then the excess risk of q is bounded from above by the KL divergence between p and q.





Unfortunately, the regularity conditions which are required for the convergence of loss can be unbounded. This means that a modification is required for the empirical process in a way that will actually guarantee some kind of convergence. We give a treatment to this in the next section.
We note that all discussion of convergence in this section has been about convergence in probability. For example, we want Equation (6) to hold with high probability—for most samples of size n. We will make this notion more rigorous in Section 2.2.
2.1 Empirical Risk Minimization and Structural Risk Minimization Methods
It has been noted in the literature (Vapnik 1998; Koltchinskii 2006) that often the class is too complex for empirical risk minimization using a fixed number of data points. It is therefore desirable in these cases to create a family of subclasses
that have increasing complexity. The more data we have, the more complex our
can be for empirical risk minimization. Structural risk minimization (Vapnik 1998) and the method of sieves (Grenander 1981) are examples of methods that adopt such an approach. Structural risk minimization, for example, can be represented in many cases as a penalization of the empirical risk method, using a regularization term.
In our case, the level of “complexity” is related to allocation of small probabilities to derivations in the grammar by a distribution . The basic problem is this: Whenever we have a derivation with a small probability, the log-loss becomes very large (in absolute value), and this makes it hard to show the convergence of the empirical process
. Because grammars can define probability distributions over infinitely many discrete outcomes, probabilities can be arbitrarily small and log-loss can be arbitrarily large.




In Section 4 we show that the minimizer is an asymptotic empirical risk minimizer (in our specific framework), which means that
. Because we have
, the implication of having asymptotic empirical risk minimization is that we have
.
2.2 Sample Complexity Bounds
Knowing that we are interested in the convergence of , a natural question to ask is: “At what rate does this empirical process converge?”
Because the quantity is a random variable, we need to give a probabilistic treatment to its convergence. More specifically, we ask the question that is typically asked when learnability is considered (Vapnik 1998): “How many samples n are required so that with probability 1 − δ we have
?” Bounds on this number of samples are also called “sample complexity bounds,” and in a distribution-free setting they are described as a function
, independent of the distribution p that generates the data.
A complete distribution-free setting is not appropriate for analyzing natural language. This setting poses technical difficulties with the convergence of and needs to take into account pathological cases that can be ruled out in natural language data. Instead, we will make assumptions about p, parametrize these assumptions in several ways, and then calculate sample complexity bounds of the form
, where the dependence on the distribution is expressed as dependence on the parameters in the assumptions about p.
The learning setting, then, can be described as follows. The user decides on a level of accuracy (ε) which the learning algorithm has to reach with confidence (1 − δ). Then, samples are drawn from p and presented to the learning algorithm. The learning algorithm then returns an hypothesis according to Equation (9).
3. Probabilistic Grammars
We begin this section by discussing the family of probabilistic grammars. A probabilistic grammar defines a probability distribution over a certain kind of structured object (a derivation of the underlying symbolic grammar) explained step-by-step as a stochastic process. Hidden Markov models (HMMs), for example, can be understood as a random walk through a probabilistic finite-state network, with an output symbol sampled at each state. PCFGs generate phrase-structure trees by recursively rewriting nonterminal symbols as sequences of “child” symbols (each itself either a nonterminal symbol or a terminal symbol analogous to the emissions of an HMM).
Each step or emission of an HMM and each rewriting operation of a PCFG is conditionally independent of the others given a single structural element (one HMM or PCFG state); this Markov property permits efficient inference over derivations given a string.
We denote by ΘG this parameter space for θ. The grammar G dictates the support of q in Equation (11). As is often the case in probabilistic modeling, there are different ways to carve up the random variables. We can think of x and z as correlated structure variables (often x is known if z is known), or the derivation event counts as an integer-vector random variable. In this article, we assume that x is always a deterministic function of z, so we use the distribution p(z) interchangeably with p(x,z).
Note that there may be many derivations z for a given string x—perhaps even infinitely many in some kinds of grammars. For HMMs, there are three kinds of multinomials: a starting state multinomial, a transition multinomial per state and an emission multinomial per state. In that case K = 2s + 1, where s is the number of states. The value of Nk depends on whether the kth multinomial is the starting state multinomial (in which case Nk = s), transition multinomial (Nk = s), or emission multinomial (Nk = t, with t being the number of symbols in the HMM). For PCFGs, each multinomial among the K multinomials corresponds to a set of Nk context-free rules headed by the same nonterminal. The parameter θk,i is then the probability of the ith rule for the kth nonterminal.
We assume that G denotes a fixed grammar, such as a context-free or regular grammar. We let denote the total number of derivation event types. We use D(G) to denote the set of all possible derivations of G. We define Dx(G) = {z ∈ D(G) |yield(z) = x}. We use deg(G) to denote the “degree” of G, i.e., deg(G) = max kNk. We let |x| denote the length of the string x, and
denote the “length” (number of event tokens) of the derivation z.
Going back to the notation in Section 2, would be a collection of probabilistic grammars, parametrized by θ, and q would be a specific probabilistic grammar with a specific θ. We therefore treat the problem of ERM with probabilistic grammars as the problem of parameter estimation—identifying θ from complete data or incomplete data (strings x are visible but the derivations z are not). We can also view parameter estimation as the identification of a hypothesis from the concept space
(where hθ is a distribution of the form of Equation [11]) or, equivalently, from negated log-concept space
. For simplicity of notation, we assume that there is a fixed grammar G and use
to refer to
and
to refer to
.
3.1 Distributional Assumptions about Language
In this section, we describe a parametrization of assumptions we make about the distribution p(x,z), the distribution that generates derivations from D(G) (note that p does not have to be a probabilistic grammar). We first describe empirical evidence about the decay of the frequency of long strings x.
A plot of the tail of frequency vs. sentence length in treebanks for English, German, Bulgarian, Turkish, Spanish, and Chinese. Red lines denote data from the treebank, blue lines denote an approximation which uses an exponential function of the form f(l; c, α) = clα (the blue line uses data which is different from the data used to estimate the curve parameters, c and α). The parameters (c, α) are (0.19, 0.92) for English, (0.06, 0.94) for German, (0.26, 0.89) for Bulgarian, (0.26, 0.83) for Turkish, (0.11, 0.93) for Spanish, and (0.03, 0.97) for Chinese. Squared errors are 0.0005, 0.0003, 0.0007, 0.0003, 0.001, and 0.002 for English, German, Bulgarian, Turkish, Spanish, and Chinese, respectively.
A plot of the tail of frequency vs. sentence length in treebanks for English, German, Bulgarian, Turkish, Spanish, and Chinese. Red lines denote data from the treebank, blue lines denote an approximation which uses an exponential function of the form f(l; c, α) = clα (the blue line uses data which is different from the data used to estimate the curve parameters, c and α). The parameters (c, α) are (0.19, 0.92) for English, (0.06, 0.94) for German, (0.26, 0.89) for Bulgarian, (0.26, 0.83) for Turkish, (0.11, 0.93) for Spanish, and (0.03, 0.97) for Chinese. Squared errors are 0.0005, 0.0003, 0.0007, 0.0003, 0.001, and 0.002 for English, German, Bulgarian, Turkish, Spanish, and Chinese, respectively.
As a consequence of this observation, we make a few assumptions about G and p(x,z):
Derivation length proportional to sentence length: There is an α ≥ 1 such that, for all z, |z| ≤ α|yield(z)|. Further, |z| ≥ |x|. (This prohibits unary cycles.)
Exponential decay of derivations: There is a constant r < 1 and a constant L ≥ 0 such that p(z) ≤ Lr|z|. Note that the assumption here is about the frequency of length of separate derivations, and not the aggregated frequency of all sentences of a certain length (cf. the discussion above referring to Figure 1).
Exponential decay of strings: Let Λ(k) = |{z ∈ D(G) ||z| = k}| be the number derivations of length k in G. We assume that Λ(k) is an increasing function, and complete it such that it is defined over positive numbers by taking
. Taking r as before, we assume there exists a constant q < 1, such that Λ2(k) rk ≤ qk (and as a consequence, Λ(k) rk ≤ qk). This implies that the number of derivations of length k may be exponentially large (e.g., as with many PCFGs), but is bounded by (q/r)k.
Bounded expectations of rules: There is a B < ∞ such that
for all k and i.
These assumptions must hold for any p whose support consists of a finite set. These assumptions also hold in many cases when p itself is a probabilistic grammar. Also, we note that the last requirement of bounded expectations is optional, and it can be inferred from the rest of the requirements: B = L/(1 − q)2. We make this requirement explicit for simplicity of notation later. We denote the family of distributions that satisfy all of these requirements by .
There are other cases in the literature of language learning where additional assumptions are made on the learned family of models in order to obtain positive learnability results. For example, Clark and Thollard (2004) put a bound on the expected length of strings generated from any state of probabilistic finite state automata, which resembles the exponential decay of strings we have for p in this article.
An immediate consequence of these assumptions is that the entropy of p is finite and bounded by a quantity that depends on L, r and q.5 Bounding entropy of labels (derivations) given inputs (sentences) is a common way to quantify the noise in a distribution. Here, both the sentential entropy (Hs(p) = − ∑xp(x) log p(x)) is bounded as well as the derivational entropy (Hd(p) = − ∑x,zp(x,z) log p(x,z)). This is stated in the following result.
Proposition 1
Proof

We note that another common way to quantify the noise in a distribution is through the notion of Tsybakov noise (Tsybakov 2004; Koltchinskii 2006). We discuss this further in Section 7.1, where we show that Tsybakov noise is too permissive, and probabilistic grammars do not satisfy its conditions.
3.1 Limiting the Degree of the Grammar
When approximating a family of probabilistic grammars, it is much more convenient when the degree of the grammar is limited. In this article, we limit the degree of the grammar by making the assumption that all Nk ≤ 2. This assumption may seem, at first glance, somewhat restrictive, but we show next that for PCFGs (and as a consequence, other formalisms), this assumption does not limit the total generative capacity that we can have across all context-free grammars.

Example of a context-free grammar and its equivalent binarized form.
It is easy to verify that the resulting grammar G′ has an equivalent capacity to the original CFG, G. A simple transformation that converts each derivation in the new grammar to a derivation in the old grammar would involve collapsing any path of nonterminals added to G′ (i.e., all Ai for nonterminal A) so that we end up with nonterminals from the original grammar only. Similarly, any derivation in G can be converted to a derivation in G′ by adding new nonterminals through unary application of rules of the form Ai → Ai+1. Given a derivation z in G, we denote by the corresponding derivation in G′ after adding the new non-terminals Ai to z. Throughout this article, we will refer to the normalized form of G′ as a “binary normal form.”6
Note that K′, the number of multinomials in the binary normal form, is a function of both the number of nonterminals in the original grammar and the number of rules in that grammar. More specifically, we have that . To make the equivalence complete, we need to show that any probabilistic context-free grammar can be translated to a PCFG with maxkNk ≤ 2 such that the two PCFGs induce the same equivalent distributions over derivations.
Lemma 1
Let ai ∈ [0,1], i ∈ {1, … , N} such that ∑iai = 1. Define b1 = a1, c1 = 1 − a1, bi = , and ci = 1 − bi for i ≥ 2. Then
.
See Appendix A for the proof of Utility Lemma 1.
Theorem 1
Let 〈G, θ〉 be a probabilistic context-free grammar. Let G′ be the binarizing transformation of G as defined earlier. Then, there exists θ′ for G′ such that for any z ∈ D(G) we have .
Proof
For the grammar G, index the set {1, …,K} with nonterminals ranging from A1 to AK. Define G′ as before. We need to define θ′. Index the multinomials in G′ by (k,i), each having two events. Let μ(k,i),1 = θk,i, μ(k,i),2 = 1 − θk,i for i = 1 and set μk,i,1 = θk,i/μ(k,i − 1),2, and μ(k,i − 1),2 = 1 − μ(k,i − 1),2.
From Chi (1999), we know that the weighted grammar 〈G′, μ〉 can be converted to a probabilistic context-free grammar 〈G′, θ′〉, through a construction of θ′ based on μ, such that p(z′ | μ, G′) = p(z′ | θ′, G′).▪
The proof for Theorem 1 gives a construction the parameters θ′ of G′ such that 〈G, θ〉 is equivalent to 〈G′, θ′〉. The construction of θ′ can also be reversed: Given θ′ for G′, we can construct θ for G so that again we have equivalence between 〈G, θ〉 and 〈G′, θ′〉.
In this section, we focused on presenting parametrized, empirically justified distributional assumptions about language data that will make the analysis in later sections more manageable. We showed that these assumptions bound the amount of entropy as a function of the assumption parameters. We also made an assumption about the structure of the grammar family, and showed that it entails no loss of generality for CFGs. Many other formalisms can follow similar arguments to show that the structural assumption is justified for them as well.
4. Proper Approximations
In order to follow the empirical risk minimization described in Section 2.1, we have to define a series of approximations for , which we denote by the log-concept spaces
. We also have to replace two-sided uniform convergence (Equation [16]) with convergence on the sequence of concept spaces we defined (Equation [10]). The concept spaces in the sequence vary as a function of the number of samples we have. We next construct the sequence of concept spaces, and in Section 5 we return to the learning model. Our approximations are based on the concept of bounded approximations (Abe, Takeuchi, and Warmuth 1991; Dasgupta 1997), which were originally designed for graphical models.7 A bounded approximation is a subset of a concept space which is controlled by a parameter that determines its tightness. Here we use this idea to define a series of subsets of the original concept space
as approximations, while having two asymptotic properties that control the series' tightness.
Let (for m ∈ {1, 2, …}) be a sequence of concept spaces. We consider three properties of elements of this sequence, which should hold for m > M for a fixed M.






We say that the sequence properly approximates
if there exist εtail(m), εbound(m), and Cm such that, for all m larger than some M, containment, boundedness, and tightness all hold.
In a good approximation, Km would increase at a fast rate as a function of m and εtail(m) and εbound(m) decrease quickly as a function of m. As we will see in Section 5, we cannot have an arbitrarily fast convergence rate (by, for example, taking a subsequence of ), because the size of Km has a great effect on the number of samples required to obtain accurate estimation.
4.1 Constructing Proper Approximations for Probabilistic Grammars
We now focus on constructing proper approximations for probabilistic grammars whose degree is limited to 2. Proper approximations could, in principle, be used with losses other than the log-loss, though their main use is for unbounded losses. Starting from this point in the article, we focus on using such proper approximations with the log-loss.





When considering our approach to approximate a probabilistic grammar by increasing its parameter probabilities to be over a certain threshold, it becomes clear why we are required to limit the grammar to have only two rules and why we are required to use the normal from Section 3.2 with grammars of degree 2. Consider the PCFG rules in Table 1. There are different ways to move probability mass to the rule with small probability. This leads to a problem with identifability of the approximation: How does one decide how to reallocate probability to the small probability rules? By binarizing the grammar in advance, we arrive at a single way to reallocate mass when required (i.e., move mass from the high-probability rule to the low-probability rule). This leads to a simpler proof for sample complexity bounds and a single bound (rather than different bounds depending on different smoothing operators). We note, however, that the choices made in binarizing the grammar imply a particular way of smoothing the probability across the original rules.
Example of a PCFG where there is more than a single way to approximate it by truncation with γ = 0.1, because it has more than two rules. Any value of η ∈ [0, γ] will lead to a different approximation.
Rule . | θ . | General . | η = 0 . | η = 0.01 . | η = 0.005 . |
---|---|---|---|---|---|
S → NP VP | 0.09 | 0.01 | 0.1 | 0.1 | 0.1 |
S → NP | 0.11 | 0.11 − η | 0.11 | 0.1 | 0.105 |
S → VP | 0.8 | 0.8 − γ + η | 0.79 | 0.8 | 0.795 |
Rule . | θ . | General . | η = 0 . | η = 0.01 . | η = 0.005 . |
---|---|---|---|---|---|
S → NP VP | 0.09 | 0.01 | 0.1 | 0.1 | 0.1 |
S → NP | 0.11 | 0.11 − η | 0.11 | 0.1 | 0.105 |
S → VP | 0.8 | 0.8 − γ + η | 0.79 | 0.8 | 0.795 |
We now describe how this construction of approximations satisfies the properties mentioned in Section 4, specifically, the boundedness property and the tightness property.
Proposition 2
Let and let
be as defined earlier. There exists a constant β = β(L,q,p,N) > 0 such that
has the boundedness property with Km = sN log3m and
.
See Appendix A for the proof of Proposition 2.
Next, is tight with respect to
with
.
Proposition 3
See Appendix A for the proof of Proposition 3.
We now have proper approximations for probabilistic grammars. These approximations are defined as a series of probabilistic grammars, related to the family of probabilistic grammars we are interested in estimating. They consist of three properties: containment (they are a subset of the family of probabilistic grammars we are interested in estimating), boundedness (their log-loss does not diverge to infinity quickly), and they are tight (there is a small probability mass at which they are not tight approximations).
4.2 Coupling Bounded Approximations with Number of Samples
At this point, the number of samples n is decoupled from the bounded approximation () that we choose for grammar estimation. To couple between these two, we need to define m as a function of the number of samples, m(n). As mentioned earlier, there is a clear trade-off between choosing a fast rate for m(n) (such as m(n) = nk for some k > 1) and a slower rate (such as m(n) = logn). The faster the rate is, the tighter the family of approximations that we use for n samples. If the rate is too fast, however, then Km grows quickly as well. In that case, because our sample complexity bounds are increasing functions of such Km, the bounds will degrade.
To balance the trade-off, we choose m(n) = n. As we see later, this gives sample complexity bounds which are asymptotically interesting for both the supervised and unsupervised case.
4.3 Asymptotic Empirical Risk Minimization
It would be compelling to determine whether the empirical risk minimizer over is an asymptotic empirical risk minimizer. This would mean that the risk of the empirical risk minimizer over
converges to the risk of the maximum likelihood estimate. As a conclusion to this section about proper approximations, we motivate the three requirements that we posed on proper approximations by showing that this is indeed true. We now unify n, the number of samples, and m, the index of the approximation of the concept space
. Let
be the minimizer of the empirical risk over
, (
) and let gn be the minimizer of the empirical risk over
(
).
Let D = {z1,…,zn} be a sample from p(z). The operator is an asymptotic empirical risk minimizer if
as n → ∞ (Shalev-Shwartz et al. 2009). Then, we have the following
Lemma 1
See Appendix A for the proof of Lemma 1.
Proposition 4
Let D = {z1,…,zn} be a sample of derivations from G. Then is an asymptotic empirical risk minimizer.
Proof








5. Sample Complexity Bounds
Equipped with the framework of proper approximations as described previously, we now give our main sample complexity results for probabilistic grammars. These results hinge on the convergence of . Indeed, proper approximations replace the use of
in these convergence results. The rate of this convergence can be fast, if the covering numbers for
do not grow too fast.
5.1 Covering Numbers and Bounds on Covering Numbers
We next give a brief overview of covering numbers. A cover provides a way to reduce a class of functions to a much smaller (finite, in fact) representative class such that each function in the original class is represented using a function in the smaller class. Let be a class of functions. Let d(f,g) be a distance measure between two functions f,g from
. An ε-cover is a subset of
, denoted by
, such that for every
there exists an
such that d(f,f′) < ε. The covering number
is the size of the smallest ε-cover of
for the distance measure d.







Lemma 2







See Pollard (1984; Chapter 2, pages 30–31) for the proof of Lemma 2. See also Appendix A.
Covering numbers are rather complex combinatorial quantities which are hard to compute directly. Fortunately, they can be bounded using the pseudo-dimension (Anthony and Bartlett 1999), a generalization of the Vapnik-Chervonenkis (VC) dimension for real functions. In the case of our “binomialized” probabilistic grammars, the pseudo-dimension of is bounded by N, because we have
, and the functions in
are linear with N parameters. Hence,
also has pseudo-dimension that is at most N. We then have the following.
Lemma 3
5.2 Supervised Case
We turn to give an analysis for the supervised case. This analysis is mostly described as a preparation for the unsupervised case. In general, the families of probabilistic grammars we give a treatment to are parametric families, and the maximum likelihood estimator for these families is a consistent estimator in the supervised case. In the unsupervised case, however, lack of identifiability prevents us from getting these traditional consistency results. Also, the traditional results about the consistency of MLE are based on the assumption that the sample is generated from the parametric family we are trying to estimate. This is not the case in our analysis, where the distribution that generates the data does not have to be a probabilistic grammar.
Lemmas 2 and 3 can be combined to get the following sample complexity result.
Theorem 2


Proof Sketch
Theorem 2 gives little intuition about the number of samples required for accurate estimation of a grammar because it considers the “additive” setting: The empirical risk is within ε from the expected risk. More specifically, it is not clear how we should pick ε for the log-loss, because the log-loss can obtain arbitrary values.
where H(p) is the Shannon entropy of p. This stems from the fact that for any f. This means that if we are interested in computing a sample complexity bound such that the ratio between the empirical risk and the expected risk (for log-loss) is close to 1 with high probability, we need to pick up ρ such that the righthand side of Equation (17) is smaller than the desired accuracy level (between 0 and 1). Note that Equation (17) is an oracle inequality—it requires knowing the entropy of p or some upper bound on it.
5.3 Unsupervised Case













It does not immediately follow that is a proper approximation for
. It is not hard to show that the boundedness property is satisfied with the same Kn and the same form of εbound(n) as in Proposition 2 (we would have
for some β′(L,q,p,N) = β′ > 0). This relies on the property of bounded derivation length of p (see Appendix A, Proposition 7). The following result shows that we have tightness as well.
Utility Lemma 2
For ai,bi ≥ 0, if − log ∑iai + log ∑ibi ≥ ε then there exists an i such that − logai + logbi ≥ ε.
Proposition 5
Proof Sketch
Computing either the covering number or the pseudo-dimension of is a hard task, because the function in the classes includes the “log-sum-exp.” Dasgupta (1997) overcomes this problem for Bayesian networks with fixed structure by giving a bound on the covering number for (his respective)
which depends on the covering number of
.
Unfortunately, we cannot fully adopt this approach, because the derivations of a probabilistic grammar can be arbitrarily large. Instead, we present the following proposition, which is based on the “Hidden Variable Rule” from Dasgupta (1997). This proposition shows that the covering number of (or more accurately, its bounded approximations) can be bounded in terms of the covering number of the bounded approximations of
, and the constants which control the underlying distribution p mentioned in Section 3.
Utility Lemma 3
For any two positive-valued sequences (a1,…,an) and (b1,…,bn) we have that .
Proposition 6 (Hidden Variable Rule for Probabilistic Grammars)
Let . Then,
.
Proof




Set m to be the quantity that appears in the proposition to get the necessary result (f′ and f are arbitrary functions in and
respectively. Then consider
and f0 to be functions from the respective covers.).▪
For the unsupervised case, then, we get the following sample complexity result.
Theorem 3


Theorem 3 states that the number of samples we require in order to accurately estimate a probabilistic grammar from unparsed strings depends on the level of ambiguity in the grammar, represented as Λ(m). We note that this dependence is polynomial, and we consider this a positive result for unsupervised learning of grammars. More specifically, if Λ is an exponential function (such as the case with PCFGs), when compared to the supervised learning, there is an extra multiplicative factor in the sample complexity in the unsupervised setting that behaves like .

6. Algorithms for Empirical Risk Minimization
We turn now to describing algorithms and their properties for minimizing empirical risk using the framework described in Section 4.
6.1 Supervised Case
ERM with proper approximations leads to simple algorithms for estimating the probabilities of a probabilistic grammar in the supervised setting. Given an ε > 0 and a δ > 0, we draw n examples according to Theorem 2. We then set γ = n−s. To minimize the log-loss with respect to these n examples, we use the proper approximation .



This truncation to a margin γ can be thought of as a smoothing factor that enables us to compute sample complexity bounds. We explore this connection to smoothing with a Dirichlet prior in a Maximum a posteriori (MAP) Bayesian setting in Section 7.2.
6.2 Unsupervised Case
6.2.1 Hardness of ERM with Proper Approximations
It turns out that minimizing Equation (23) under the specified constraints is actually an NP-hard problem when G is a PCFG. This result follows using a similar proof to the one in Cohen and Smith (2010c) for the hardness of Viterbi training and maximizing log-likelihood for PCFGs. We turn to giving the full derivation of this hardness result for PCFGs and the modification required for adapting the results from Cohen and Smith to the case of having an arbitrary γ margin constraint.
In order to show an NP-hardness result, we need to “convert” the problem of the maximization of Equation (23) to a decision problem. We do so by stating the following decision problem.
Problem 1 (Unsupervised Minimization of the Log-Loss with Margin)
Input: A binarized context-free grammar G, a set of sentences x1, … , xn, a value γ ∈ , and a value α ∈ [0, 1].
We will show the hardness result both when γ is not restricted at all as well as when we allow γ > 0. The proof of the hardness result is achieved by reducing the problem 3-SAT (Sipser 2006), known to be NP-complete, to Problem 1. The problem 3-SAT is defined as follows:
Problem 2 (3-SAT)
Input: A formula in conjunctive normal form, such that each clause has three literals.
Output: 1 if there is a satisfying assignment for φ, and 0 otherwise.
Given an instance of the 3-SAT problem, the reduction will, in polynomial time, create a grammar and a single string such that solving Problem 1 for this grammar and string will yield a solution for the instance of the 3-SAT problem.
Let be an instance of the 3-SAT problem, where ai, bi, and ci are literals over the set of variables {Y1,…,YN} (a literal refers to a variable Yj or its negation,
). Let Cj be the jth clause in φ, such that Cj = aj ∨ bj ∨ cj. We define the following CFG Gφ and string to parse sφ:
- 1.
The terminals of Gφ are the binary digits Σ = {0,1}.
- 2.
We create N nonterminals
, r ∈ {1,…,N} and rules
and
.
- 3.
We create N nonterminals
, r ∈ {1,…,N} and rules
and
.
- 4.
We create
and
.
- 5.
We create the rule S1 → A1. For each j ∈ {2,…,m}, we create a rule Sj → Sj − 1Aj where Sj is a new nonterminal indexed by
and Aj is also a new nonterminal indexed by j ∈ {1,…,m}.
- 6. Let Cj = aj ∨ bj ∨ cj be clause j in φ. Let Y(aj) be the variable that aj mentions. Let (y1,y2,y3) be a satisfying assignment for Cj where yk ∈ { 0,1 } and is the value of Y(aj), Y(bj), and Y(cj), respectively, for k ∈ {1,2,3}. For each such clause-satisfying assignment, we add the ruleFor each Aj, we would have at most seven rules of this form, because one rule will be logically inconsistent with aj ∨ bj ∨ cj.
- 7.
The grammar's start symbol is Sn.
- 8.
The string to parse is sφ = (10)3m, that is, 3m consecutive occurrences of the string 10.
A parse of the string sφ using Gφ will be used to get an assignment by setting Yr = 0 if the rule or
is used in the derivation of the parse tree, and 1 otherwise. Notice that at this point we do not exclude “contradictions” that come from the parse tree, such as
used in the tree together with
or
. To maintain the restriction on the degree of grammars, we convert Gφ to the binary normal form described in Section 3.2. The following lemma gives a condition under which the assignment is consistent (so that contradictions do not occur in the parse tree).
Lemma 4
Let φ be an instance of the 3-SAT problem, and let Gφ be a probabilistic CFG based on the given grammar with weights θφ. If the (multiplicative) weight of the Viterbi parse (i.e., the highest scoring parse according to the PCFG) of sφ is 1, then the assignment extracted from the parse tree is consistent.
Proof
Because the probability of the Viterbi parse is 1, all rules of the form which appear in the parse tree have probability 1 as well. There are two possible types of inconsistencies. We show that neither exists in the Viterbi parse:
- 1.
For any r, an appearance of both rules of the form
and
cannot occur because all rules that appear in the Viterbi parse tree have probability 1.
- 2.
For any r, an appearance of rules of the form
and
cannot occur, because whenever we have an appearance of the rule
, we have an adjacent appearance of the rule
(because we parse substrings of the form 10), and then we again use the fact that all rules in the parse tree have probability 1. The case of
and
is handled analogously.
An example of a Viterbi parse tree which represents a satisfying assignment for . In θφ, all rules appearing in the parse tree have probability 1. The extracted assignment would be Y1 = 0, Y2 = 1, Y3 = 1, Y4 = 0. Note that there is no usage of two different rules for a single nonterminal.
An example of a Viterbi parse tree which represents a satisfying assignment for . In θφ, all rules appearing in the parse tree have probability 1. The extracted assignment would be Y1 = 0, Y2 = 1, Y3 = 1, Y4 = 0. Note that there is no usage of two different rules for a single nonterminal.
Lemma 5
Define φ and Gφ as before. There exists θφ such that the Viterbi parse of sφ is 1 if and only if φ is satisfiable. Moreover, the satisfying assignment is the one extracted from the parse tree with weight 1 of sφ under θφ.
Proof






We are now ready to prove the following result.
Theorem 4
Problem 1 is NP-hard when either requiring γ > 0 or when fixing γ = 0.
Proof
We first describe the reduction for the case of γ = 0. In Problem 1, set γ = 0, α = 1, G = Gφ, γ = 0, and x1 = sφ. If φ is satisfiable, then the left side of Equation (24) can get value 0, by setting the rule probabilities according to Lemma 5, hence we would return 1 as the result of running Problem 1.
If φ is unsatisfiable, then we would still get value 0 only if L(G) = {sφ}. If Gφ generates a single derivation for (10)3m, then we actually do have a satisfying assignment from Lemma 4. Otherwise (more than a single derivation), the optimal θ (or
). In that case, it is no longer true that (10)3m is the only generated sentence, and this is a contradiction to getting value 0 for Problem 1.


6.2.2 An Expectation-Maximization Algorithm
Instead of solving the optimization problem implied by Equation (21), we propose a rather simple modification to the expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin 1977) to approximate the optimal solution—this algorithm finds a local maximum for the maximum likelihood problem using proper approximations. The modified algorithm is given in Algorithm 1.
The modification from the usual expectation-maximization algorithm is done in the M-step: Instead of using the expected value of the sufficient statistics by counting and normalizing, we truncate the values by γ. It can be shown that if θ(0) ∈ Θ(γ), then the likelihood is guaranteed to increase (and hence, the log-loss is guaranteed to decrease) after each iteration of the algorithm.
The reason for this likelihood increase stems from the fact that the M-step solves the optimization problem of minimizing the log-loss (with respect to θ ∈ Θ(γ)) when the posterior calculate at the E-step as the base distribution is used. This means that the M-step minimizes (in iteration t): where the expectation is taken with respect to the distribution
. With this notion in mind, the likelihood increase after each iteration follows from principles similar to those described in Bishop (2006) for the EM algorithm.
7. Discussion
Our framework can be specialized to improve the two main criteria which have a trade-off: the tightness of the proper approximation and the sample complexity. For example, we can improve the tightness of our proper approximations by taking a subsequence of . This will make the sample complexity bound degrade, however, because Kn will grow faster. Table 2 shows the trade-offs between parameters in our model and the effectiveness of learning.
Trade-off between quantities in our learning model and effectiveness of different criteria. Kn is the constant that satisfies the boundedness property (Theorems 2 and 3) and s is a fixed constant larger than 1 (Section 4.1).
criterion . | as Kn increases … . | as s increases … . |
---|---|---|
tightness of proper approximation | improves | improves |
sample complexity bound | degrades | degrades |
criterion . | as Kn increases … . | as s increases … . |
---|---|---|
tightness of proper approximation | improves | improves |
sample complexity bound | degrades | degrades |
We note that the sample complexity bounds that we give in this article give insight about the asymptotic behavior of grammar estimation, but are not necessarily sufficiently tight to be used in practice. It still remains an open problem to obtain sample complexity bounds which are sufficiently tight in this respect. For a discussion about the connection of grammar learning in theory and practice, we refer the reader to Clark and Lappin (2010).
It is also important to note that MLE is not the only option for estimating finite state probabilistic grammars. There has been some recent advances in learning finite state models (HMMs and finite state transducers) by using spectral analysis of matrices which consist of quantities estimated from observations only (Hsu, Kakade, and Zhang 2009; Balle, Quattoni, and Carreras 2011), based on the observable operator models of Jaeger (1999). These algorithms are not prone to local minima, and converge to the correct model as the number of samples increases, but require some assumptions about the underlying model that generates the data.
7.1 Tsybakov Noise
In this article, we chose to introduce assumptions about distributions that generate natural language data. The choice of these assumptions was motivated by observations about properties shared among treebanks. The main consequence of making these assumptions is bounding the amount of noise in the distribution (i.e., the amount of variation in probabilities across labels given a fixed input).
There are other ways to restrict the noise in a distribution. One condition for such noise restriction, which has received considerable recent attention in the statistical literature, is the Tsybakov noise condition (Tsybakov 2004; Koltchinskii 2006). Showing that a distribution satisfies the Tsybakov noise condition enables the use of techniques (e.g., from Koltchinskii 2006) for deriving distribution-dependent sample complexity bounds that depend on the parameters of the noise. It is therefore of interest to see whether Tsybakov noise holds under the assumptions presented in Section 3.1. We show that this is not the case, and that Tsybakov noise is too permissive. In fact, we show that p can be a probabilistic grammar itself (and hence, satisfy the assumptions in Section 3.1), and still not satisfy the Tsybakov noise conditions.
Tsybakov noise was originally introduced for classification problems (Tsybakov 2004), and was later extended to more general settings, such as the one we are facing in this article (Koltchinskii 2006). We now explain the definition of Tsybakov noise in our context.


Theorem 5
Let G be a grammar with K ≥ 2 and degree 2. Assume that p is 〈G, θ*〉 for some θ*, such that and that c1 ≤ c2. If AG(θ*) is positive definite, then p does not satisfy the Tsybakov noise condition for any (C,κ), where C > 0 and κ ≥ 1.
See Appendix C for the proof of Theorem 5.
In Appendix C we show that AG(θ) is positive semi-definite for any choice of θ. The main intuition behind the proof is that given a probabilistic grammar p, we can construct an hypothesis h such that the KL divergence between p and h is small, but dist(p,h) is lower-bounded and is not close to 0.
We conclude that probabilistic grammars, as generative distributions of data, do not generally satisfy the Tsybakov noise condition. This motivates an alternative choice of assumptions that could lead to better understanding of rates of convergences and bounds on the excess risk. Section 3.1 states such assumptions which were also justified empirically.
7.2 Comparison to Dirichlet Maximum A Posteriori Solutions


The effect of Dirichlet smoothing becomes weaker as we have more samples, because the frequency counts become dominant in both the numerator and the denominator when there are more data. In this sense, the prior's effect on learning diminishes as we use more data. A similar effect occurs in our framework: γ = n−s where n is the number of samples—the more samples we have, the more we trust the counts in the data to be reliable. There is a subtle difference, however. With the Dirichlet MAP solution, the smoothing is less dominant only if the counts of the features are large, regardless of the number of samples we have. With our framework, smoothing depends only on the number of samples we have. These two scenarios are related, of course: The more samples we have, the more likely it is that the counts of the events will grow large.
7.3 Other Derivations of Sample Complexity Bounds
In this section, we discuss other possible solutions to the problem of deriving sample complexity bounds for probabilistic grammars.
7.3.1 Using Talagrand's Inequality
Our bounds are based on VC theory together with classical results for empirical processes (Pollard 1984). There have been some recent developments to the derivation of rates of convergence in statistical learning theory (Massart 2000; Bartlett, Bousquet, and Mendelson 2005; Koltchinskii 2006), most prominently through the use of Talagrand's inequality (Talagrand 1994), which is a concentration of measure inequality, in the spirit of Lemma 2.
The bounds achieved with Talagrand's inequality are also distribution-dependent, and are based on the diameter of the ε-minimal set—the set of hypotheses which have an excess risk smaller than ε. We saw in Section 7.1 that the diameter of the ε-minimal set does not follow the Tsybakov noise condition, but it is perhaps possible to find meaningful bounds for it, in which case we may be able to get tighter bounds using Talagrand's inequality. We note that it may be possible to obtain data-dependent bounds for the diameter of the ε-minimal set, following Koltchinskii (2006), by calculating the diameter of the ε-minimal set using .
7.3.2 Simpler Bounds for the Supervised Case
As noted in Section 6.1, minimizing empirical risk with the log-loss leads to a simple frequency count for calculating the estimated parameters of the grammar. In Corazza and Satta (2006), it has been also noted that to minimize the non-empirical risk, it is necessary to set the parameters of the grammar to the normalized expected count of the features.
This means that we can get bounds on the deviation of a certain parameter from the optimal parameter by applying modifications to rather simple inequalities such as Hoeffding's inequality, which determines the probability of the average of a set of i.i.d. random variables deviating from its mean. The modification would require us to split the event space into two cases: one in which the count of some features is larger than some fixed value (which will happen with small probability because of the bounded expectation of features), and one in which they are all smaller than that fixed value. Handling these two cases separately is necessary because Hoeffding's inequality requires that the count of the rules is bounded.
The bound on the deviation from the mean of the parameters (the true probability) can potentially lead to a bound on the excess risk in the supervised case. This formulation of the problem would not generalize to the unsupervised case, however, where the empirical risk minimization does not amount to simple frequency count.
7.4 Open Problems
We conclude the discussion with some directions for further exploration and future work.
7.4.1 Sample Complexity Bounds with Semi-Supervised Learning
Our bounds focus on the supervised case and the unsupervised case. There is a trivial extension to the semi-supervised case. Consider the objective function to be the sum of the likelihood for the labeled data together with the marginalized likelihood of the unlabeled data (this sum could be a weighted sum). Then, use the sample complexity bounds for each summand to derive a sample complexity bound on this sum.
It would be more interesting to extend our results to frameworks such as the one described by Balcan and Blum (2010). In that case, our discussion of sample complexity would attempt to identify how unannotated data can reduce the space of candidate probabilistic grammars to a smaller set, after which we can use the annotated data to estimate the final grammar. This reduction of the space is accomplished through a notion of compatibility, a type of fitness that the learner believes the estimated grammar should have given the distribution that generates the data. The key challenge in the case of probabilistic grammars would be to properly define this compatibility notion such that it fits the log-loss. If this is achieved, then similar machinery to that described in this paper (with proper approximations) can be followed to derive semi-supervised sample complexity bounds for probabilistic grammars.
7.4.2 Sharper Bounds for the Pseudo-Dimension of Probabilistic Grammars
The pseudo-dimension of a probabilistic grammar with the log-loss is bounded by the number of parameters in the grammar, because the logarithm of a distribution generated by a probabilistic grammar is a linear function. Typically the set of counts for the feature vectors of a probabilistic grammar resides in a subspace of a dimension which is smaller than the full dimension specified by the number of parameters, however. The reason for this is that there are usually relationships (which are often linear) between the elements in the feature counts. For example, with HMMs, the total feature count for emissions should equal the total feature count for transitions. With PCFGs, the total number of times that nonterminal rules fire equals the total number of times that features with that nonerminal in the right-hand side fired, again reducing the pseudo-dimension. An open problem that remains is characterization of the exact value pseudo-dimension for a given grammar, determined by consideration of various properties of that grammar. We conjecture, however, that a lower bound on the pseudo-dimension would be rather close to the full dimension of the grammar (the number of parameters).
It is interesting to note that there has been some work to identify the VC dimension and pseudo-dimension for certain types of grammars. Bane, Riggle, and Sonderegger (2010), for example, calculated the VC dimension for constraint-based grammars. Ishigami and Tani (1993; Ishigami and Tani (1997) computed the VC dimension for finite state automata with various properties.
7.5 Conclusion
We presented a framework for performing empirical risk minimization for probabilistic grammars, in which sample complexity bounds, for the supervised case and the unsupervised case, can be derived. Our framework is based on the idea of bounded approximations used in the past to derive sample complexity bounds for graphical models.
Our framework required assumptions about the probability distribution that generates sentences or derivations in the language of the given grammar. These assumptions were tested using corpora, and found to fit the data well.
We also discussed algorithms that can be used for minimizing empirical risk in our framework, given enough samples. We showed that directly trying to minimize empirical risk in the unsupervised case is NP-hard, and suggested an approximation based on an expectation-maximization algorithm.
Appendix A. Proofs
We include in this appendix proofs for several results in the article.
Utility Lemma 1
Let such that ∑iai = 1. Define b1 = a1, c1 = 1 − a1, bi =
, and ci = 1 − bi for i ≥ 2. Then
Proof
Lemma 1
Proof


Proposition 2
Let and let
be as defined earlier. There exists a constant β = β(L, q, p, N) > 0 such that
has the boundedness property with Km = sN log3m and
.
Proof
Utility Lemma 4
(From Dasgupta [1997].) Let a ∈ [0,1] and let b = a if a ∈ [γ,1 − γ], b = γ if a ≤ γ, and b = 1 − γ if a ≥ 1 − γ. Then for any ε ≤ 1/2 such that γ ≤ ε/ (1 + ε) we have log a/b ≤ ε.
Proposition 3
Proof












Proposition 7
There exists a β′(L,p,q,N) > 0 such that has the boundedness property with Km = sN log3m and
.
Proof
From the requirement of p, we know that for any x we have a z such that yield(z) = x and |z| ≤ α|x|. Therefore, if we let , then we have for any
and
that
(similarly to the proof of Proposition 2). Denote by f1(x,z) the function in
such that
.
Utility Lemma 2
For ai, bi ≥ 0, if − log ∑iai + log ∑ibi ≥ ε then there exists an i such that − log ai + log bi ≥ ε.
Proof
Assume − logai + logbi < ε for all i. Then, , therefore
, therefore − log ∑ iai + log ∑ ibi < ε which is a contradiction to − log ∑ iai + log ∑ ibi ≥ ε.▪
The next lemma is the main concentation of measure result that we use. Its proof requires some simple modification to the proof given for Theorem 24 in Pollard (1984, pages 30–31).
Lemma 2
Proof
At this point, we can follow the proof of Theorem 24 in Pollard (1984), and its extension on pages 30–31 to get Lemma 2, using the shifted set of functions .▪
Appendix B. Minimizing Log-Loss for Probabilistic Grammars







Equation (B.6) is precisely the definition of the KL divergence between the distribution of a coin with probability γ of heads and the distribution of a coin with probability β of heads, and therefore the right side in Equation (B.6) is positive, and we get what we need.
Appendix C. Counterexample to Tsybakov Noise (Proofs)
Lemma 6
A = AG(θ) is positive semi-definite for any probabilistic grammar 〈G, θ〉.
Proof
Lemma 7
Proof
Theorem 5
Let G be a grammar with K ≥ 2 and degree 2. Assume that p is 〈G, θ*〉 for some θ*, such that and that c1 ≤ c2. If AG(θ*) is positive definite, then p does not satisfy the Tsybakov noise condition for any (C, κ), where C > 0 and κ ≥ 1.
Proof




Appendix D. Notation
Table D.1 gives a table of notation for symbols used throughout this article.
Acknowledgements
The authors thank the anonymous reviewers for their comments and Avrim Blum, Steve Hanneke, Mark Johnson, John Lafferty, Dan Roth, and Eric Xing for useful conversations. This research was supported by National Science Foundation grant IIS-0915187.
Notes
It is important to remember that minimizing the log-loss does not equate to minimizing the error of a linguistic analyzer or natural language processing application. In this article we focus on the log-loss case because we believe that probabilistic models of language phenomena have inherent usefulness as explanatory tools in computational linguistics, aside from their use in systems.
We note that itself is a random variable, because it depends on the sample drawn from p.
We note that being able to attain the minimum through an hypothesis q* is not necessarily possible in the general case. In our instantiations of ERM for probabilistic grammars, however, the minimum can be attained. In fact, in the unsupervised case the minimum can be attained by more than a single hypothesis. In these cases, q* is arbitrarily chosen to be one of these minimizers.
Treebanks offer samples of cleanly segmented sentences. It is important to note that the distributions estimated may not generalize well to samples from other domains in these languages. Our argument is that the family of the estimated curve is reasonable, not that we can correctly estimate the curve's parameters.
For simplicity and consistency with the log-loss, we measure entropy in nats, which means we use the natural logarithm when computing entropy.
We note that this notion of binarization is different from previous types of binarization appearing in computational linguistics for grammars. Typically in previous work about binarized grammars such as CFGs, the grammars are constrained to have at most two nonterminals in the right side in Chomsky normal form. Another form of binarization for linear context-free rewriting systems is restriction of the fan-out of the rules to two (Gómez-Rodríguez and Satta 2009; Gildea 2010). We, however, limit the number of rules for each nonterminal (or more generally, the number of elements in each multinomial).
There are other ways to manage the unboundedness of KL divergence in the language learning literature. Clark and Thollard (2004), for example, decompose the KL divergence between probabilistic finite-state automata into several terms according to a decomposition of Carrasco (1997) and then bound each term separately.
By varying s we get a family of approximations. The larger s is, the tighter the approximation is. Also, the larger s is, as we see later, the looser our sample complexity bound will be.
The “permissible class” requirement is a mild regularity condition regarding measurability that holds for proper approximations. We refer the reader to Pollard (1984) for more details.
References
Author notes
Department of Computer Science, Columbia University, New York, NY 10027, United States. E-mail: [email protected]. This research was completed while the first author was at Carnegie Mellon University.
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, United States. E-mail: [email protected].