## Abstract

Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. They are used ubiquitously in computational linguistics. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of probabilistic grammars using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. By making assumptions about the underlying distribution that are appropriate for natural language scenarios, we are able to derive distribution-dependent sample complexity bounds for probabilistic grammars. We also give simple algorithms for carrying out empirical risk minimization using this framework in both the supervised and unsupervised settings. In the unsupervised case, we show that the problem of minimizing empirical risk is NP-hard. We therefore suggest an approximate algorithm, similar to expectation-maximization, to minimize the empirical risk.

## 1. Introduction

Learning from data is central to contemporary computational linguistics. It is in common in such learning to estimate a model in a parametric family using the maximum likelihood principle. This principle applies in the supervised case (i.e., using annotated data) as well as semisupervised and unsupervised settings (i.e., using unannotated data). *Probabilistic grammars* constitute a range of such parametric families we can estimate (e.g., hidden Markov models, probabilistic context-free grammars). These parametric families are used in diverse NLP problems ranging from syntactic and morphological processing to applications like information extraction, question answering, and machine translation.

Estimation of probabilistic grammars, in many cases, indeed starts with the principle of maximum likelihood estimation (MLE). In the supervised case, and with traditional parametrizations based on multinomial distributions, MLE amounts to normalization of rule frequencies as they are observed in data. In the unsupervised case, on the other hand, algorithms such as expectation-maximization are available. MLE is attractive because it offers statistical consistency if some conditions are met (i.e., if the data are distributed according to a distribution in the family, then we will discover the correct parameters if sufficient data is available). In addition, under some conditions it is also an unbiased estimator.

An issue that has been far less explored in the computational linguistics literature is the *sample complexity* of MLE. Here, we are interested in quantifying the number of samples required to accurately learn a probabilistic grammar either in a supervised or in an unsupervised way. If bounds on the requisite number of samples (known as “sample complexity bounds”) are sufficiently tight, then they may offer guidance to learner performance, given various amounts of data and a wide range of parametric families. Being able to reason analytically about the amount of data to annotate, and the relative gains in moving to a more restricted parametric family, could offer practical advantages to language engineers.

We note that grammar learning has been studied in formal settings as a problem of *grammatical inference*—learning the structure of a grammar or an automaton (Angluin 1987; Clark and Thollard 2004; de la Higuera 2005; Clark, Eyraud, and Habrard 2008, among others). Our setting in this article is different. We assume that we have a fixed grammar, and our goal is to estimate its parameters. This approach has shown great empirical success, both in the supervised (Collins 2003; Charniak and Johnson 2005) and the unsupervised (Carroll and Charniak 1992; Pereira and Schabes 1992; Klein and Manning 2004; Cohen and Smith 2010a) settings. There has also been some discussion of sample complexity bounds for statistical parsing models, in a distribution-free setting (Collins 2004). The distribution-free setting, however, is not ideal for analysis of natural language, as it has to account for pathological cases of distributions that generate data.

We develop a framework for deriving sample complexity bounds using the maximum likelihood principle for probabilistic grammars in a distribution-dependent setting. Distribution dependency is introduced here by making empirically justified assumptions about the distributions that generate the data. Our framework uses and significantly extends ideas that have been introduced for deriving sample complexity bounds for probabilistic graphical models (Dasgupta 1997). Maximum likelihood estimation is put in the empirical risk minimization framework (Vapnik 1998) with the loss function being the log-loss. Following that, we develop a set of learning theoretic tools to explore rates of estimation convergence for probabilistic grammars. We also develop algorithms for performing empirical risk minimization.

Much research has been devoted to the problem of learning finite state automata (which can be thought of as a class of grammars) in the Probably Approximately Correct setting, leading to the conclusion that it is a very hard problem (Kearns and Valiant 1989; Pitt 1989; Terwijn 2002). Typically, the setting in these cases is different from our setting: Error is measured as the probability mass of strings that are not identified correctly by the learned finite state automaton, instead of measuring KL divergence between the automaton and the true distribution. In addition, in many cases, there is also a focus on the distribution-free setting. To the best of our knowledge, it is still an open problem whether finite state automata are learnable in the distribution-dependent setting when measuring the error as the fraction of misidentified strings. Other work (Ron 1995; Ron, Singer, and Tishby 1998; Clark and Thollard 2004; Palmer and Goldberg 2007) also gives treatment to probabilistic automata with an error measure which is more suitable for the probabilistic setting, such as Kullback-Lielder (KL) divergence or variation distance. These also focus on learning the structure of finite state machines. As mentioned earlier, in our setting we assume that the grammar is fixed, and that our goal is to estimate its parameters.

We note an important connection to an earlier study about the learnability of probabilistic automata and hidden Markov models by Abe and Warmuth (1992). In that study, the authors provided positive results for the sample complexity for learning probabilistic automata—they showed that a polynomial sample is sufficient for MLE. We demonstrate positive results for the more general class of probabilistic grammars which goes beyond probabilistic automata. Abe and Warmuth also showed that the problem of finding or even approximating the maximum likelihood solution for a two-state probabilistic automaton with an alphabet of an arbitrary size is hard. Even though these results extend to probabilistic grammars to some extent, we provide a novel proof that illustrates the NP-hardness of identifying the maximum likelihood solution for probabilistic grammars in the specific framework of “proper approximations” that we define in this article. Whereas Abe and Warmuth show that the problem of maximum likelihood maximization for two-state HMMs is not approximable within a certain factor in time polynomial in the alphabet and the length of the observed sequence, we show that there is no polynomial algorithm (in the length of the observed strings) that identifies the maximum likelihood estimator in our framework. In our reduction, from 3-SAT to the problem of maximum likelihood estimation, the alphabet used is binary and the grammar size is proportional to the length of the formula. In Abe and Warmuth, the alphabet size varies, and the number of states is two.

This article proceeds as follows. In Section 2 we review the background necessary from Vapnik's (1988) empirical risk minimization framework. This framework is reduced to maximum likelihood estimation when a specific loss function is used: the log-loss.^{1} There are some shortcomings in using the empirical risk minimization framework in its simplest form. In its simplest form, the ERM framework is *distribution-free*, which means that we make no assumptions about the distribution that generated the data. Naively attempting to apply the ERM framework to probabilistic grammars in the distribution-free setting does not lead to the desired sample complexity bounds. The reason for this is that the log-loss diverges whenever small probabilities are allocated in the learned hypothesis to structures or strings that have a rather large probability in the probability distribution that generates the data. With a distribution-free assumption, therefore, we would have to give treatment to distributions that are unlikely to be true for natural language data (e.g., where some extremely long sentences are very probable).

To correct for this, we move to an analysis in a distribution-dependent setting, by presenting a set of assumptions about the distribution that generates the data. In Section 3 we discuss probabilistic grammars in a general way and introduce assumptions about the true distribution that are reasonable when our data come from natural language examples. It is important to note that this distribution need not be a probabilistic grammar.

The next step we take, in Section 4, is *approximating* the set of probabilistic grammars over which we maximize likelihood. This is again required in order to overcome the divergence of the log-loss for probabilities that are very small. Our approximations are based on *bounded approximations* that have been used for deriving sample complexity bounds for graphical models in a distribution-free setting (Dasgupta 1997).

Our approximations have two important properties: They are, by themselves, probabilistic grammars from the family we are interested in estimating, and they become a tighter approximation around the family of probabilistic grammars we are interested in estimating as more samples are available.

Moving to the distribution-dependent setting and defining proper approximations enables us to derive sample complexity bounds. In Section 5 we present the sample complexity results for both the supervised and unsupervised cases. A question that lingers at this point is whether it is computationally feasible to maximize likelihood in our framework even when given enough samples.

In Section 6, we describe algorithms we use to estimate probabilistic grammars in our framework, when given access to the required number of samples. We show that in the supervised case, we can indeed maximize likelihood in our approximation framework using a simple algorithm. For the unsupervised case, however, we show that maximizing likelihood is NP-hard. This fact is related to a notion known in the learning theory literature as **inherent unpredictability** (Kearns and Vazirani 1994): Accurate learning is computationally hard even with enough samples. To overcome this difficulty, we adapt the expectation-maximization algorithm (Dempster, Laird, and Rubin 1977) to approximately maximize likelihood (or minimize log-loss) in the unsupervised case with proper approximations.

In Section 7 we discuss some related ideas. These include the failure of an alternative kind of distributional assumption and connections to regularization by maximum a posteriori estimation with Dirichlet priors. Longer proofs are included in the appendices. A table of notation that is used throughout is included as Table D.1 in Appendix D.

This article builds on two earlier papers. In Cohen and Smith (2010b) we presented the main sample complexity results described here; the present article includes significant extensions, a deeper analysis of our distributional assumptions, and a discussion of variants of these assumptions, as well as related work, such as that about the Tsybakov noise condition. In Cohen and Smith (2010c) we proved NP-hardness for unsupervised parameter estimation of probalistic context-free grammars (PCFGs) (without approximate families). The present article uses a similar type of proof to achieve results adapted to empirical risk minimization in our approximation framework.

## 2. Empirical Risk Minimization and Maximum Likelihood Estimation

We begin by introducing some notation. We seek to construct a predictive model that maps inputs from space to outputs from space . In this work, is a set of strings using some alphabet , and is a set of derivations allowed by a grammar (e.g., a context-free grammar). We assume the existence of an unknown joint probability distribution *p*(*x,z*) over . (For the most part, we will be discussing discrete input and output spaces. This means that *p* will denote a probability mass function.) We are interested in estimating the distribution *p* from examples, either in a supervised setting, where we are provided with examples of the form , or in the unsupervised setting, where we are provided only with examples of the form . We first consider the supervised setting and return to the unsupervised setting in Section 5. We will use *q* to denote the estimated distribution.

*p*as accurately as possible using

*q*(x,z), we are interested in minimizing the

*log-loss*, that is, in finding

*q*

_{opt}, from a fixed family of distributions (also called “the concept space”), such that Note that if , then this quantity achieves the minimum when

*q*

_{opt}=

*p*, in which case the value of the log-loss is the entropy of

*p*. Indeed, more generally, this optimization is equivalent to finding

*q*such that it minimizes the KL divergence from

*p*to

*q*.

*p*is unknown, we cannot hope to minimize the log-loss directly. Given a set of examples (x

_{1},z

_{1}),…,(x

_{n},z

_{n}), however, there is a natural candidate, the empirical distribution , for use in Equation (1) instead of

*p*, defined as: where is 1 if (

*x,z*) = (

*x*) and 0 otherwise.

_{i},z_{i}^{2}We then set up the problem as the problem of

**empirical risk minimization**(ERM), that is, trying to find

*q*such thatEquation (3) immediately shows that minimizing empirical risk using the log-loss is equivalent to the maximizing likelihood, which is a common statistical principle used for estimating a probabilistic grammar in computational linguistics (Charniak 1993; Manning and Schütze 1999).

^{3}

We are interested in bounding the excess risk for *q**, . The excess risk is reduced to KL divergence between *p* and *q* if , because in this case the quantity is minimized with *q*′ = *p*, and equals the entropy of *p*. In a typical case, where we do not necessarily have , then the excess risk of *q* is bounded from above by the KL divergence between *p* and *q*.

*q*

_{opt}minimizes the expected risk for , and

*q** minimizes the empirical risk for . The consequence of Equations (7) and (8) is that the expected risk of

*q** is at most 2

*ε*away from the expected risk of

*q*

_{opt}, and as a result, we find the excess risk , for large enough

*n*, is smaller than 2

*ε*. Intuitively, this means that, under a large sample,

*q** does not give much worse results than

*q*

_{opt}under the criterion of the log-loss.

Unfortunately, the regularity conditions which are required for the convergence of loss can be unbounded. This means that a modification is required for the empirical process in a way that will actually guarantee some kind of convergence. We give a treatment to this in the next section.

We note that all discussion of convergence in this section has been about convergence *in probability*. For example, we want Equation (6) to hold with high probability—for most samples of size *n*. We will make this notion more rigorous in Section 2.2.

### 2.1 Empirical Risk Minimization and Structural Risk Minimization Methods

It has been noted in the literature (Vapnik 1998; Koltchinskii 2006) that often the class is too complex for empirical risk minimization using a fixed number of data points. It is therefore desirable in these cases to create a family of subclasses that have increasing complexity. The more data we have, the more complex our can be for empirical risk minimization. Structural risk minimization (Vapnik 1998) and the method of sieves (Grenander 1981) are examples of methods that adopt such an approach. Structural risk minimization, for example, can be represented in many cases as a penalization of the empirical risk method, using a regularization term.

In our case, the level of “complexity” is related to allocation of small probabilities to derivations in the grammar by a distribution . The basic problem is this: Whenever we have a derivation with a small probability, the log-loss becomes very large (in absolute value), and this makes it hard to show the convergence of the empirical process . Because grammars can define probability distributions over infinitely many discrete outcomes, probabilities can be arbitrarily small and log-loss can be arbitrarily large.

*n*is the number of samples we draw for the learner:We are then interested in the convergence of the empirical process

In Section 4 we show that the minimizer is an *asymptotic* empirical risk minimizer (in our specific framework), which means that . Because we have , the implication of having asymptotic empirical risk minimization is that we have .

### 2.2 Sample Complexity Bounds

Knowing that we are interested in the convergence of , a natural question to ask is: “At what *rate* does this empirical process converge?”

Because the quantity is a random variable, we need to give a probabilistic treatment to its convergence. More specifically, we ask the question that is typically asked when learnability is considered (Vapnik 1998): “How many samples *n* are required so that with probability 1 − δ we have ?” Bounds on this number of samples are also called “sample complexity bounds,” and in a distribution-free setting they are described as a function , independent of the distribution *p* that generates the data.

A complete distribution-free setting is not appropriate for analyzing natural language. This setting poses technical difficulties with the convergence of and needs to take into account pathological cases that can be ruled out in natural language data. Instead, we will make assumptions about *p*, parametrize these assumptions in several ways, and then calculate sample complexity bounds of the form , where the dependence on the distribution is expressed as dependence on the parameters in the assumptions about *p*.

The learning setting, then, can be described as follows. The user decides on a level of accuracy (*ε*) which the learning algorithm has to reach with confidence (1 − δ). Then, samples are drawn from *p* and presented to the learning algorithm. The learning algorithm then returns an hypothesis according to Equation (9).

## 3. Probabilistic Grammars

We begin this section by discussing the family of probabilistic grammars. A probabilistic grammar defines a probability distribution over a certain kind of structured object (a derivation of the underlying symbolic grammar) explained step-by-step as a stochastic process. Hidden Markov models (HMMs), for example, can be understood as a random walk through a probabilistic finite-state network, with an output symbol sampled at each state. PCFGs generate phrase-structure trees by recursively rewriting nonterminal symbols as sequences of “child” symbols (each itself either a nonterminal symbol or a terminal symbol analogous to the emissions of an HMM).

Each step or emission of an HMM and each rewriting operation of a PCFG is conditionally independent of the others given a single structural element (one HMM or PCFG state); this Markov property permits efficient inference over derivations given a string.

*G*, θ〉 defines the joint probability of a string x and a grammatical derivation z:where

*ψ*

_{k,i}is a function that “counts” the number of times the

*k*th distribution's

*i*th event occurs in the derivation. The parameters θ are a collection of

*K*multinomials 〈θ

_{1}, … , θ

_{K}〉, the

*k*th of which includes

*N*

_{k}competing events. If we let θ

_{k}= 〈θ

_{k,1}, … , θ

_{k,Nk}〉, each

*θ*

_{k,i}is a probability, such that

We denote by Θ* _{G}* this parameter space for θ. The grammar G dictates the support of

*q*in Equation (11). As is often the case in probabilistic modeling, there are different ways to carve up the random variables. We can think of

*x*and

*z*as correlated structure variables (often

*x*is known if

*z*is known), or the derivation event counts as an integer-vector random variable. In this article, we assume that

*x*is always a deterministic function of

*z*, so we use the distribution

*p*(

*z*) interchangeably with

*p*(

*x,z*).

Note that there may be many derivations *z* for a given string *x*—perhaps even infinitely many in some kinds of grammars. For HMMs, there are three kinds of multinomials: a starting state multinomial, a transition multinomial per state and an emission multinomial per state. In that case *K* = 2*s* + 1, where *s* is the number of states. The value of *N*_{k} depends on whether the *k*th multinomial is the starting state multinomial (in which case *N*_{k} = *s*), transition multinomial (*N*_{k} = *s*), or emission multinomial (*N*_{k} = *t*, with *t* being the number of symbols in the HMM). For PCFGs, each multinomial among the *K* multinomials corresponds to a set of *N*_{k} context-free rules headed by the same nonterminal. The parameter *θ*_{k,i} is then the probability of the *i*th rule for the *k*th nonterminal.

We assume that G denotes a fixed grammar, such as a context-free or regular grammar. We let denote the total number of derivation event types. We use *D*(G) to denote the set of all possible derivations of G. We define *D*_{x}(*G*) = {*z* ∈ *D*(*G*) |yield(*z*) = *x*}. We use deg(*G*) to denote the “degree” of *G*, i.e., deg(*G*) = max _{k}*N*_{k}. We let |*x*| denote the length of the string *x*, and denote the “length” (number of event tokens) of the derivation *z*.

Going back to the notation in Section 2, would be a collection of probabilistic grammars, parametrized by θ, and *q* would be a specific probabilistic grammar with a specific θ. We therefore treat the problem of ERM with probabilistic grammars as the problem of parameter estimation—identifying θ from complete data or incomplete data (strings *x* are visible but the derivations *z* are not). We can also view parameter estimation as the identification of a *hypothesis* from the concept space (where *h*_{θ} is a distribution of the form of Equation [11]) or, equivalently, from negated log-concept space . For simplicity of notation, we assume that there is a fixed grammar *G* and use to refer to and to refer to .

### 3.1 Distributional Assumptions about Language

In this section, we describe a parametrization of assumptions we make about the distribution *p*(*x,z*), the distribution that generates derivations from *D*(*G*) (note that *p* does not have to be a probabilistic grammar). We first describe empirical evidence about the decay of the frequency of long strings *x*.

^{4}The trend in the plots clearly shows that in the extended tail of the curve, all languages have an exponential decay of probabilities as a function of sentence length. To test this, we performed a simple regression of frequencies using an exponential curve. We estimated each curve for each language using a curve of the form

*f*(

*l*;

*c*,

*α*) =

*cl*

^{α}. This estimation was done by minimizing squared error between the frequency versus sentence length curve and the approximate version of this curve. The data points used for the approximation are (

*l*

_{i},

*p*

_{i}), where

*l*

_{i}denotes sentence length and

*p*

_{i}denotes frequency, selected from the extended tail of the distribution. Extended tail here refers to all points with length longer than

*l*

_{1}, where

*l*

_{1}is the length with the highest frequency in the treebank. The goal of focusing on the tail is to avoid approximating the head of the curve, which is actually a monotonically increasing function. We plotted the approximate curve together with a length versus frequency curve for new syntactic data. It can be seen (Figure 1) that the approximation is rather accurate in these corpora.

As a consequence of this observation, we make a few assumptions about *G* and *p*(*x,z*):

Derivation length proportional to sentence length: There is an

*α*≥ 1 such that, for all*z*, |*z*| ≤*α*|yield(z)|. Further, |*z*| ≥ |*x*|. (This prohibits unary cycles.)Exponential decay of derivations: There is a constant

*r*< 1 and a constant*L*≥ 0 such that*p*(*z*) ≤*Lr*^{|z|}. Note that the assumption here is about the frequency of length of separate derivations, and not the aggregated frequency of all sentences of a certain length (cf. the discussion above referring to Figure 1).Exponential decay of strings: Let

*Λ*(*k*) = |{*z*∈*D*(*G*) ||*z*| =*k*}| be the number derivations of length*k*in G. We assume that*Λ*(*k*) is an increasing function, and complete it such that it is defined over positive numbers by taking . Taking*r*as before, we assume there exists a constant*q*< 1, such that*Λ*^{2}(*k*)*r*^{k}≤*q*^{k}(and as a consequence,*Λ*(*k*)*r*^{k}≤*q*^{k}). This implies that the number of derivations of length*k*may be exponentially large (e.g., as with many PCFGs), but is bounded by (*q*/*r*)^{k}.Bounded expectations of rules: There is a

*B*< ∞ such that for all*k*and*i*.

These assumptions must hold for any *p* whose support consists of a finite set. These assumptions also hold in many cases when *p* itself is a probabilistic grammar. Also, we note that the last requirement of bounded expectations is optional, and it can be inferred from the rest of the requirements: *B* = *L*/(1 − *q*)^{2}. We make this requirement explicit for simplicity of notation later. We denote the family of distributions that satisfy all of these requirements by .

There are other cases in the literature of language learning where additional assumptions are made on the learned family of models in order to obtain positive learnability results. For example, Clark and Thollard (2004) put a bound on the expected length of strings generated from any state of probabilistic finite state automata, which resembles the exponential decay of strings we have for *p* in this article.

An immediate consequence of these assumptions is that the entropy of *p* is finite and bounded by a quantity that depends on *L*, *r* and *q*.^{5} Bounding entropy of labels (derivations) given inputs (sentences) is a common way to quantify the *noise* in a distribution. Here, both the *sentential* entropy (*H*_{s}(*p*) = − ∑_{x}*p*(*x*) log *p*(*x*)) is bounded as well as the *derivational* entropy (*H*_{d}(*p*) = − ∑_{x,z}*p*(*x,z*) log *p*(*x,z*)). This is stated in the following result.

**Proposition 1**

**Proof**

*H*

_{s}(

*p*) ≤

*H*

_{d}(

*p*) holds by the data processing inequality (Cover and Thomas 1991) because the sentential probability distribution

*p*(

*x*) is a coarser version of the derivational probability distribution

*p*(x,z). Now, consider

*p*(

*x,z*). For simplicity of notation, we use

*p*(

*z*) instead of

*p*(

*x,z*). The yield of

*z*,

*x*, is a function of z, and therefore can be omitted from the distribution. It holds that where

*Z*

_{1}= {z |

*p*(

*z*) > 1/

*e*} and

*Z*

_{2}= {z |

*p*(

*z*) ≤ 1/

*e*}. Note that the function −

*α*log

*α*reaches its maximum for

*α*= 1/

*e*. We therefore have We give a bound on |

*Z*

_{1}|, the number of “high probability” derivations. Because we have

*p*(

*x,z*) ≤

*L*

*r*

^{|z|}, we can find the maximum length of a derivation that has a probability of more than 1/

*e*(and hence, it may appear in

*Z*

_{1}) by solving 1/

*e*≤

*L*

*r*

^{|z|}for |

*z*|, which leads to |

*z*| ≤ log(1/

*eL*)/log

*r*. Therefore, there are at most derivations in |Z

_{1}| and therefore we have where we use the monotonicity of

*Λ*. Consider

*H*

_{d}(

*p*,

*Z*

_{2}) (the “low probability” derivations). We have: where Equation (13) holds from the assumptions about

*p*. Putting Equation (12) and Equation (14) together, we obtain the result.▪

We note that another common way to quantify the noise in a distribution is through the notion of Tsybakov noise (Tsybakov 2004; Koltchinskii 2006). We discuss this further in Section 7.1, where we show that Tsybakov noise is too permissive, and probabilistic grammars do not satisfy its conditions.

### 3.1 Limiting the Degree of the Grammar

When approximating a family of probabilistic grammars, it is much more convenient when the degree of the grammar is limited. In this article, we limit the degree of the grammar by making the assumption that all *N*_{k} ≤ 2. This assumption may seem, at first glance, somewhat restrictive, but we show next that for PCFGs (and as a consequence, other formalisms), this assumption does not limit the total generative capacity that we can have across all context-free grammars.

*N*

_{k}≤ 2 that generates derivations equivalent to derivations in the original grammar. Such a grammar is also called a “covering grammar” (Nijholt 1980; Leermakers 1989). Let

*G*be a CFG. Let

*A*be the

*k*th nonterminal. Consider the rules

*A*→

*α*

_{i}for

*i*≤

*N*

_{k}where

*A*appears on the left side. For each rule

*A*→

*α*

_{i},

*i*<

*N*

_{k}, we create a new nonterminal in

*G*′ such that

*A*

_{i}has two rewrite rules:

*A*

_{i}→

*α*

_{i}and

*A*

_{i}→

*A*

_{i+1}. In addition, we create rules

*A*→

*A*

_{1}and . Figure 2 demonstrates an example of this transformation on a small context-free grammar.

It is easy to verify that the resulting grammar *G*′ has an equivalent capacity to the original CFG, G. A simple transformation that converts each derivation in the new grammar to a derivation in the old grammar would involve collapsing any path of nonterminals added to *G*′ (i.e., all *A*_{i} for nonterminal *A*) so that we end up with nonterminals from the original grammar only. Similarly, any derivation in *G* can be converted to a derivation in *G*′ by adding new nonterminals through unary application of rules of the form *A*_{i} → *A*_{i+1}. Given a derivation *z* in *G*, we denote by the corresponding derivation in *G*′ after adding the new non-terminals *A*_{i} to z. Throughout this article, we will refer to the normalized form of *G*′ as a “binary normal form.”^{6}

Note that *K*′, the number of multinomials in the binary normal form, is a function of both the number of nonterminals in the original grammar and the number of rules in that grammar. More specifically, we have that . To make the equivalence complete, we need to show that any *probabilistic* context-free grammar can be translated to a PCFG with max_{k}*N*_{k} ≤ 2 such that the two PCFGs induce the same equivalent distributions over derivations.

**Lemma 1**

Let *a*_{i} ∈ [0,1], *i* ∈ {1, … , *N*} such that ∑_{i}*a*_{i} = 1. Define *b*_{1} = *a*_{1}, *c*_{1} = 1 − *a*_{1}, *b _{i}* = , and

*c*

_{i}= 1 −

*b*

_{i}for

*i*≥ 2. Then .

See Appendix A for the proof of Utility Lemma 1.

**Theorem 1**

Let 〈*G*, θ〉 be a probabilistic context-free grammar. Let *G*′ be the binarizing transformation of *G* as defined earlier. Then, there exists θ′ for *G*′ such that for any *z* ∈ *D*(*G*) we have .

**Proof**

For the grammar *G*, index the set {1, …,*K*} with nonterminals ranging from *A*_{1} to *A*_{K}. Define *G*′ as before. We need to define θ′. Index the multinomials in *G*′ by (*k*,*i*), each having two events. Let *μ*_{(k,i),1} = *θ*_{k,i}, *μ*_{(k,i),2} = 1 − *θ*_{k,i} for *i* = 1 and set *μ*_{k,i,1} = *θ*_{k,i}/*μ*_{(k,i − 1),2}, and *μ*_{(k,i − 1),2} = 1 − *μ*_{(k,i − 1),2}.

From Chi (1999), we know that the weighted grammar 〈*G*′, μ〉 can be converted to a probabilistic context-free grammar 〈*G*′, θ′〉, through a construction of θ′ based on μ, such that *p*(*z*′ | μ, *G*′) = *p*(*z*′ | θ′, *G*′).▪

The proof for Theorem 1 gives a construction the parameters θ′ of *G*′ such that 〈*G*, θ〉 is equivalent to 〈*G*′, θ′〉. The construction of θ′ can also be reversed: Given θ′ for *G*′, we can construct θ for *G* so that again we have equivalence between 〈*G*, θ〉 and 〈*G*′, θ′〉.

In this section, we focused on presenting parametrized, empirically justified distributional assumptions about language data that will make the analysis in later sections more manageable. We showed that these assumptions bound the amount of entropy as a function of the assumption parameters. We also made an assumption about the *structure* of the grammar family, and showed that it entails no loss of generality for CFGs. Many other formalisms can follow similar arguments to show that the structural assumption is justified for them as well.

## 4. Proper Approximations

In order to follow the empirical risk minimization described in Section 2.1, we have to define a series of approximations for , which we denote by the log-concept spaces . We also have to replace two-sided uniform convergence (Equation [16]) with convergence on the sequence of concept spaces we defined (Equation [10]). The concept spaces in the sequence vary as a function of the number of samples we have. We next construct the sequence of concept spaces, and in Section 5 we return to the learning model. Our approximations are based on the concept of *bounded approximations* (Abe, Takeuchi, and Warmuth 1991; Dasgupta 1997), which were originally designed for graphical models.^{7} A bounded approximation is a subset of a concept space which is controlled by a parameter that determines its tightness. Here we use this idea to define a series of subsets of the original concept space as approximations, while having two asymptotic properties that control the series' tightness.

Let (for *m* ∈ {1, 2, …}) be a sequence of concept spaces. We consider three properties of elements of this sequence, which should hold for *m* > *M* for a fixed *M*.

**boundedness**:where

*ε*

_{bound}is a non-increasing function such that . This states that the expected values of functions from on values larger than some

*K*

_{m}is small. This is required to obtain uniform convergence results in the revised empirical risk minimization model from Section 2.1. Note that

*K*

_{m}can grow arbitrarily large.

**tightness**: where

*ε*

_{tail}is a non-increasing function such that , and

*C*

_{m}denotes an operator that maps functions in to . This ensures that our approximation actually converges to the original concept space . We will show in Section 4.3 that this is actually a well-motivated characterization of convergence for probabilistic grammars in the supervised setting.

We say that the sequence *properly approximates* if there exist *ε*_{tail}(*m*), *ε*_{bound}(*m*), and *C*_{m} such that, for all *m* larger than some *M*, containment, boundedness, and tightness all hold.

In a good approximation, *K*_{m} would increase at a fast rate as a function of *m* and *ε*_{tail}(*m*) and *ε*_{bound}(*m*) decrease quickly as a function of *m*. As we will see in Section 5, we cannot have an arbitrarily fast convergence rate (by, for example, taking a subsequence of ), because the size of *K*_{m} has a great effect on the number of samples required to obtain accurate estimation.

### 4.1 Constructing Proper Approximations for Probabilistic Grammars

We now focus on constructing proper approximations for probabilistic grammars whose degree is limited to 2. Proper approximations could, in principle, be used with losses other than the log-loss, though their main use is for unbounded losses. Starting from this point in the article, we focus on using such proper approximations with the log-loss.

*T*(

*f*,

*γ*) that shifts every binomial parameter θ

_{k}= 〈θ

_{k,1}, θ

_{k,2}〉 in the probabilistic grammar by at most

*γ*:Note that for any

*γ*≤ 1/2. Fix a constant

*s*> 1.

^{8}We denote by

*T*(θ,

*γ*) the same transformation on θ (which outputs the new shifted parameters) and we denote by

*Θ*

_{G}(

*γ*) =

*Θ*(

*γ*) the set {

*T*(θ,

*γ*) |θ ∈

*Θ*

_{G}}. For each

*m*∈ ℕ, define .

When considering our approach to approximate a probabilistic grammar by increasing its parameter probabilities to be over a certain threshold, it becomes clear why we are required to limit the grammar to have only two rules and why we are required to use the normal from Section 3.2 with grammars of degree 2. Consider the PCFG rules in Table 1. There are different ways to move probability mass to the rule with small probability. This leads to a problem with identifability of the approximation: How does one decide how to reallocate probability to the small probability rules? By binarizing the grammar in advance, we arrive at a single way to reallocate mass when required (i.e., move mass from the high-probability rule to the low-probability rule). This leads to a simpler proof for sample complexity bounds and a single bound (rather than different bounds depending on different smoothing operators). We note, however, that the choices made in binarizing the grammar imply a particular way of smoothing the probability across the original rules.

Rule . | θ . | General . | η = 0. | η = 0.01. | η = 0.005. |
---|---|---|---|---|---|

S → NP VP | 0.09 | 0.01 | 0.1 | 0.1 | 0.1 |

S → NP | 0.11 | 0.11 − η | 0.11 | 0.1 | 0.105 |

S → VP | 0.8 | 0.8 − γ + η | 0.79 | 0.8 | 0.795 |

Rule . | θ . | General . | η = 0. | η = 0.01. | η = 0.005. |
---|---|---|---|---|---|

S → NP VP | 0.09 | 0.01 | 0.1 | 0.1 | 0.1 |

S → NP | 0.11 | 0.11 − η | 0.11 | 0.1 | 0.105 |

S → VP | 0.8 | 0.8 − γ + η | 0.79 | 0.8 | 0.795 |

We now describe how this construction of approximations satisfies the properties mentioned in Section 4, specifically, the boundedness property and the tightness property.

**Proposition 2**

Let and let be as defined earlier. There exists a constant β = β(*L*,*q*,*p*,*N*) > 0 such that has the boundedness property with *K _{m}* =

*sN*log

^{3}

*m*and .

See Appendix A for the proof of Proposition 2.

Next, is tight with respect to with .

**Proposition 3**

See Appendix A for the proof of Proposition 3.

We now have proper approximations for probabilistic grammars. These approximations are defined as a series of probabilistic grammars, related to the family of probabilistic grammars we are interested in estimating. They consist of three properties: containment (they are a subset of the family of probabilistic grammars we are interested in estimating), boundedness (their log-loss does not diverge to infinity quickly), and they are tight (there is a small probability mass at which they are not tight approximations).

### 4.2 Coupling Bounded Approximations with Number of Samples

At this point, the number of samples *n* is decoupled from the bounded approximation () that we choose for grammar estimation. To couple between these two, we need to define *m* as a function of the number of samples, *m*(*n*). As mentioned earlier, there is a clear trade-off between choosing a fast rate for *m*(*n*) (such as *m*(*n*) = *n*^{k} for some *k* > 1) and a slower rate (such as *m*(*n*) = log*n*). The faster the rate is, the tighter the family of approximations that we use for *n* samples. If the rate is too fast, however, then *K*_{m} grows quickly as well. In that case, because our sample complexity bounds are increasing functions of such *K*_{m}, the bounds will degrade.

To balance the trade-off, we choose *m*(*n*) = *n*. As we see later, this gives sample complexity bounds which are asymptotically interesting for both the supervised and unsupervised case.

### 4.3 Asymptotic Empirical Risk Minimization

It would be compelling to determine whether the empirical risk minimizer over is an *asymptotic empirical risk minimizer*. This would mean that the risk of the empirical risk minimizer over converges to the risk of the maximum likelihood estimate. As a conclusion to this section about proper approximations, we motivate the three requirements that we posed on proper approximations by showing that this is indeed true. We now unify *n*, the number of samples, and *m*, the index of the approximation of the concept space . Let be the minimizer of the empirical risk over , () and let *g*_{n} be the minimizer of the empirical risk over ().

Let *D* = {*z*_{1},…,*z*_{n}} be a sample from *p*(*z*). The operator is an asymptotic empirical risk minimizer if as *n* → ∞ (Shalev-Shwartz et al. 2009). Then, we have the following

**Lemma 1**

See Appendix A for the proof of Lemma 1.

**Proposition 4**

Let *D* = {*z*_{1},…,*z*_{n}} be a sample of derivations from *G*. Then is an asymptotic empirical risk minimizer.

**Proof**

*A*

_{j,ε,n}for

*j*∈ {1,…,

*n*} be the event “”. Then

*A*

_{ε,n}= ∪

_{j}

*A*

_{j,ε,n}. We have that where Equation (16) comes from z

_{l}being independent. Also,

*B*is the constant from Section 3.1. Therefore, we have:

^{2}

*n*or greater can be in . Thereforewhere

*κ*> 0 is a constant. Similarly, we have . This means that . In addition, it can be shown that using the same proof technique we used here, while relying on the fact that , and therefore .▪

## 5. Sample Complexity Bounds

Equipped with the framework of proper approximations as described previously, we now give our main sample complexity results for probabilistic grammars. These results hinge on the convergence of . Indeed, proper approximations replace the use of in these convergence results. The rate of this convergence can be fast, if the *covering numbers* for do not grow too fast.

### 5.1 Covering Numbers and Bounds on Covering Numbers

We next give a brief overview of covering numbers. A cover provides a way to reduce a class of functions to a much smaller (finite, in fact) representative class such that each function in the original class is represented using a function in the smaller class. Let be a class of functions. Let *d*(*f*,*g*) be a distance measure between two functions *f*,*g* from . An *ε*-cover is a subset of , denoted by , such that for every there exists an such that *d*(*f*,*f*′) < *ε*. The **covering number** is the size of the smallest *ε*-cover of for the distance measure *d*.

*z*

_{1},…,

*z*

_{n}. Let . We will useInstead of using directly, we bound this quantity with , where we consider all possible samples (yielding ). The following is the key result regarding the connection between covering numbers and the double-sided convergence of the empirical process as

*n*→ ∞. This result is a general-purpose result that has been used frequently to prove the convergence of empirical processes of the type we discuss in this article.

**Lemma 2**

See Pollard (1984; Chapter 2, pages 30–31) for the proof of Lemma 2. See also Appendix A.

Covering numbers are rather complex combinatorial quantities which are hard to compute directly. Fortunately, they can be bounded using the pseudo-dimension (Anthony and Bartlett 1999), a generalization of the Vapnik-Chervonenkis (VC) dimension for real functions. In the case of our “binomialized” probabilistic grammars, the pseudo-dimension of is bounded by *N*, because we have , and the functions in are linear with *N* parameters. Hence, also has pseudo-dimension that is at most *N*. We then have the following.

**Lemma 3**

### 5.2 Supervised Case

We turn to give an analysis for the supervised case. This analysis is mostly described as a preparation for the unsupervised case. In general, the families of probabilistic grammars we give a treatment to are parametric families, and the maximum likelihood estimator for these families is a consistent estimator in the supervised case. In the unsupervised case, however, lack of identifiability prevents us from getting these traditional consistency results. Also, the traditional results about the consistency of MLE are based on the assumption that the sample is generated from the parametric family we are trying to estimate. This is not the case in our analysis, where the distribution that generates the data does not have to be a probabilistic grammar.

Lemmas 2 and 3 can be combined to get the following sample complexity result.

**Theorem 2**

*G*be a grammar. Let (Section 3.1). Let be a proper approximation for the corresponding family of probabilistic grammars. Let

*z*

_{1},…,

*z*

_{n}be a sample of derivations. Then there exists a constant β(

*L*,

*q*,

*p*,

*N*) and constant

*M*such that for any 0 <

*δ*< 1 and 0 <

*ε*<

*K*

_{n}and any

*n*>

*M*and if then we have where

*K*=

_{n}*sN*log

^{3}

*n*.

**Proof Sketch**

*β*(

*L*,

*q*,

*p*,

*N*) is the constant from Proposition 2. The main idea in the proof is to solve for

*n*in the following two inequalities (based on Equation [17] [see the following]) while relying on Lemma 3:

Theorem 2 gives little intuition about the number of samples required for accurate estimation of a grammar because it considers the “additive” setting: The empirical risk is within *ε* from the expected risk. More specifically, it is not clear how we should pick *ε* for the log-loss, because the log-loss can obtain arbitrary values.

*ρ*∈ (0,1) and choose

*ε*=

*ρK*

_{n}. Then, substituting this

*ε*in Theorem 2, we get that ifthen, with probability 1 −

*δ*,

where *H*(*p*) is the Shannon entropy of *p*. This stems from the fact that for any *f*. This means that if we are interested in computing a sample complexity bound such that the ratio between the empirical risk and the expected risk (for log-loss) is close to 1 with high probability, we need to pick up *ρ* such that the righthand side of Equation (17) is smaller than the desired accuracy level (between 0 and 1). Note that Equation (17) is an oracle inequality—it requires knowing the entropy of *p* or some upper bound on it.

### 5.3 Unsupervised Case

*n*

**yields**of derivations from the grammar,

*x*

_{1},…,

*x*

_{n}, and our goal again is to identify grammar parameters θ from these yields. Our concept classes are now the sets of log marginalized distributions from . For each , we define asWe denote the set of by . Analogously, we define . Note that we also need to define the operator as a first step towards defining as proper approximations (for ) in the unsupervised setting. Let . Let

*f*be the concept in such that . Then we define .

It does not immediately follow that is a proper approximation for . It is not hard to show that the boundedness property is satisfied with the same *K*_{n} and the same form of *ε*_{bound}(*n*) as in Proposition 2 (we would have for some *β*′(*L*,*q*,*p*,*N*) = *β*′ > 0). This relies on the property of bounded derivation length of *p* (see Appendix A, Proposition 7). The following result shows that we have tightness as well.

**Utility Lemma 2**

For *a*_{i},*b*_{i} ≥ 0, if − log ∑_{i}*a*_{i} + log ∑_{i}*b*_{i} ≥ ε then there exists an *i* such that − log*a*_{i} + log*b*_{i} ≥ ε.

**Proposition 5**

**Proof Sketch**

Computing either the covering number or the pseudo-dimension of is a hard task, because the function in the classes includes the “log-sum-exp.” Dasgupta (1997) overcomes this problem for Bayesian networks with fixed structure by giving a bound on the covering number for (his respective) which depends on the covering number of .

Unfortunately, we cannot fully adopt this approach, because the derivations of a probabilistic grammar can be arbitrarily large. Instead, we present the following proposition, which is based on the “Hidden Variable Rule” from Dasgupta (1997). This proposition shows that the covering number of (or more accurately, its bounded approximations) can be bounded in terms of the covering number of the bounded approximations of , and the constants which control the underlying distribution *p* mentioned in Section 3.

**Utility Lemma 3**

For any two positive-valued sequences (*a*_{1},…,*a*_{n}) and (*b*_{1},…,*b*_{n}) we have that .

**Proposition 6 (Hidden Variable Rule for Probabilistic Grammars)**

Let . Then, .

**Proof**

*m*. Consider . Let

*f*′ and be the corresponding functions in . Then, for any distribution

*p*,where

*p*′(

*x,z*) is a probability distribution that uniformly divides the probability mass

*p*(

*x*) across all derivations for the specific

*x*, that is:The inequality in Equation (18) stems from Utility Lemma 3.

Set *m* to be the quantity that appears in the proposition to get the necessary result (*f*′ and *f* are arbitrary functions in and respectively. Then consider and *f*_{0} to be functions from the respective covers.).▪

For the unsupervised case, then, we get the following sample complexity result.

**Theorem 3**

*G*be a grammar. Let be a proper approximation for the corresponding family of probabilistic grammars. Let

*p*(

*x,z*) be a distribution over derivations which satisfies the requirements in Section 3.1. Let

*x*

_{1},…,

*x*

_{n}be a sample of strings from

*p*(

*x*). Then there exists a constant β′(

*L*,

*q*,

*p*,

*N*) and constant

*M*such that for any 0 < δ < 1, 0 <

*ε*<

*K*

_{n}, any

*n*>

*M*, and ifwhere , we have thatwhere

*K*

_{n}=

*sN*log

^{3}

*n*.

Theorem 3 states that the number of samples we require in order to accurately estimate a probabilistic grammar from unparsed strings depends on the level of ambiguity in the grammar, represented as *Λ*(*m*). We note that this dependence is polynomial, and we consider this a positive result for unsupervised learning of grammars. More specifically, if *Λ* is an exponential function (such as the case with PCFGs), when compared to the supervised learning, there is an extra multiplicative factor in the sample complexity in the unsupervised setting that behaves like .

*K*

_{n}(ρ ∈ (0,1)), we get the following requirement on

*n*:where .

## 6. Algorithms for Empirical Risk Minimization

We turn now to describing algorithms and their properties for minimizing empirical risk using the framework described in Section 4.

### 6.1 Supervised Case

ERM with proper approximations leads to simple algorithms for estimating the probabilities of a probabilistic grammar in the supervised setting. Given an ε > 0 and a δ > 0, we draw *n* examples according to Theorem 2. We then set γ = *n*^{−s}. To minimize the log-loss with respect to these *n* examples, we use the proper approximation .

*G*) ≤ 2 (Section 3.2), we haveTo minimize the log-loss with respect to , we need to minimize Equation (21) under the constraint that γ ≤

*θ*

_{k,i}≤ 1 − γ and θ

*+ θ*

_{k1}*= 1. It can be shown that the solution for this optimization problem iswhere is the number of times that ψ*

_{k,2}_{k,i}fires in Example

*j*. (We include a full derivation of this result in Appendix B.) The interpretation of Equation (22) is simple: We count the number of times a rule appears in the samples and then normalize this value by the total number of times rules associated with the same multinomial appear in the samples. This frequency count is the maximum likelihood solution with respect to the full hypothesis class (Corazza and Satta 2006; see Appendix B). Because we constrain ourselves to obtain a value away from 0 or 1 by a margin of

*γ*, we need to truncate this solution, as done in Equation (22).

This truncation to a margin *γ* can be thought of as a smoothing factor that enables us to compute sample complexity bounds. We explore this connection to smoothing with a Dirichlet prior in a Maximum a posteriori (MAP) Bayesian setting in Section 7.2.

### 6.2 Unsupervised Case

#### 6.2.1 Hardness of ERM with Proper Approximations

It turns out that minimizing Equation (23) under the specified constraints is actually an NP-hard problem when *G* is a PCFG. This result follows using a similar proof to the one in Cohen and Smith (2010c) for the hardness of Viterbi training and maximizing log-likelihood for PCFGs. We turn to giving the full derivation of this hardness result for PCFGs and the modification required for adapting the results from Cohen and Smith to the case of having an arbitrary γ margin constraint.

In order to show an NP-hardness result, we need to “convert” the problem of the maximization of Equation (23) to a decision problem. We do so by stating the following decision problem.

**Problem 1 (Unsupervised Minimization of the Log-Loss with Margin)**

**Input:** A binarized context-free grammar *G*, a set of sentences *x*_{1}, … , *x*_{n}, a value γ ∈ , and a value α ∈ [0, 1].

We will show the hardness result both when γ is not restricted at all as well as when we allow γ > 0. The proof of the hardness result is achieved by reducing the problem 3-SAT (Sipser 2006), known to be NP-complete, to Problem 1. The problem 3-SAT is defined as follows:

**Problem 2 (3-SAT)**

**Input:** A formula in conjunctive normal form, such that each clause has three literals.

**Output:** 1 if there is a satisfying assignment for φ, and 0 otherwise.

Given an instance of the 3-SAT problem, the reduction will, in polynomial time, create a grammar and a single string such that solving Problem 1 for this grammar and string will yield a solution for the instance of the 3-SAT problem.

Let be an instance of the 3-SAT problem, where *a*_{i}, *b*_{i}, and *c*_{i} are literals over the set of variables {*Y*_{1},…,*Y*_{N}} (a literal refers to a variable *Y*_{j} or its negation, ). Let *C*_{j} be the *j*th clause in φ, such that *C*_{j} = *a*_{j} ∨ *b*_{j} ∨ *c*_{j}. We define the following CFG *G*_{φ} and string to parse *s*_{φ}:

- 1.
The terminals of

*G*_{φ}are the binary digits Σ = {0,1}. - 2.
We create

*N*nonterminals ,*r*∈ {1,…,*N*} and rules and . - 3.
We create

*N*nonterminals ,*r*∈ {1,…,*N*} and rules and . - 4.
We create and .

- 5.
We create the rule

*S*_{1}→*A*_{1}. For each*j*∈ {2,…,*m*}, we create a rule*S*_{j}→*S*_{j − 1}*A*_{j}where*S*_{j}is a new nonterminal indexed by and*A*_{j}is also a new nonterminal indexed by*j*∈ {1,…,*m*}. - 6. Let
*C*_{j}=*a*_{j}∨*b*_{j}∨*c*_{j}be clause*j*in*φ*. Let*Y*(*a*_{j}) be the variable that*a*_{j}mentions. Let (*y*_{1},*y*_{2},*y*_{3}) be a satisfying assignment for*C*_{j}where*y*_{k}∈ { 0,1 } and is the value of*Y*(*a*_{j}),*Y*(*b*_{j}), and*Y*(*c*_{j}), respectively, for*k*∈ {1,2,3}. For each such clause-satisfying assignment, we add the ruleFor each*A*_{j}, we would have at most seven rules of this form, because one rule will be logically inconsistent with*a*_{j}∨*b*_{j}∨*c*_{j}. - 7.
The grammar's start symbol is

*S*_{n}. - 8.
The string to parse is

*s*_{φ}= (10)^{3m}, that is, 3*m*consecutive occurrences of the string 10.

A parse of the string *s*_{φ} using *G*_{φ} will be used to get an assignment by setting *Y*_{r} = 0 if the rule or is used in the derivation of the parse tree, and 1 otherwise. Notice that at this point we do not exclude “contradictions” that come from the parse tree, such as used in the tree together with or . To maintain the restriction on the degree of grammars, we convert *G*_{φ} to the binary normal form described in Section 3.2. The following lemma gives a condition under which the assignment is consistent (so that contradictions do not occur in the parse tree).

**Lemma 4**

Let φ be an instance of the 3-SAT problem, and let *G*_{φ} be a probabilistic CFG based on the given grammar with weights θ_{φ}. If the (multiplicative) weight of the Viterbi parse (i.e., the highest scoring parse according to the PCFG) of *s*_{φ} is 1, then the assignment extracted from the parse tree is consistent.

**Proof**

Because the probability of the Viterbi parse is 1, all rules of the form which appear in the parse tree have probability 1 as well. There are two possible types of inconsistencies. We show that neither exists in the Viterbi parse:

- 1.
For any

*r*, an appearance of both rules of the form and cannot occur because all rules that appear in the Viterbi parse tree have probability 1. - 2.
For any

*r*, an appearance of rules of the form and cannot occur, because whenever we have an appearance of the rule , we have an adjacent appearance of the rule (because we parse substrings of the form 10), and then we again use the fact that all rules in the parse tree have probability 1. The case of and is handled analogously.

**Lemma 5**

Define φ and *G*_{φ} as before. There exists θ_{φ} such that the Viterbi parse of *s*_{φ} is 1 if and only if φ is satisfiable. Moreover, the satisfying assignment is the one extracted from the parse tree with weight 1 of *s*_{φ} under θ_{φ}.

**Proof**

*C*

_{j}=

*a*

_{j}∨

*b*

_{j}∨

*c*

_{j}is satisfied using a tuple (

*y*

_{1},

*y*

_{2},

*y*

_{3}), which assigns values for

*Y*(

*a*

_{j}),

*Y*(

*b*

_{j}), and

*Y*(

*c*

_{j}). This assignment corresponds to the following rule:Set its probability to 1, and set all other rules of

*A*

_{j}to 0. In addition, for each

*r*, if

*Y*

_{r}=

*y*, set the probabilities of the rules and to 1 and and to 0. The rest of the weights for

*S*

_{j}→

*S*

_{j − 1}

*A*

_{j}are set to 1. This assignment of rule probabilities results in a Viterbi parse of weight 1.

*C*

_{j}we have a rulethat is assigned probability 1, for some (

*y*

_{1},

*y*

_{2},

*y*

_{3}). One can verify that (

*y*

_{1},

*y*

_{2},

*y*

_{3}) are the values of the assignment for the corresponding variables in clause

*C*

_{j}, and that they satisfy this clause. This means that each clause is satisfied by the assignment we extracted.▪

We are now ready to prove the following result.

**Theorem 4**

Problem 1 is NP-hard when either requiring γ > 0 or when fixing γ = 0.

**Proof**

We first describe the reduction for the case of γ = 0. In Problem 1, set γ = 0, α = 1, *G* = *G*_{φ}, γ = 0, and *x*_{1} = *s*_{φ}. If φ is satisfiable, then the left side of Equation (24) can get value 0, by setting the rule probabilities according to Lemma 5, hence we would return 1 as the result of running Problem 1.

If φ is unsatisfiable, then we would still get value 0 only if *L*(*G*) = {*s*_{φ}}. If *G*_{φ} generates a single derivation for (10)^{3m}, then we actually do have a satisfying assignment from Lemma 4. Otherwise (more than a single derivation), the optimal θ (or ). In that case, it is no longer true that (10)^{3m} is the only generated sentence, and this is a contradiction to getting value 0 for Problem 1.

^{3m}is at most 10

*m*(using the binarized

*G*

_{φ}) and assuming α = γ < (1 − γ)

^{10m}. This inequality indeed holds whenever . Therefore, we have − log

*h*(

*x*

_{1}|θ) > − log α. Problem 1 would return 0 in this case.

#### 6.2.2 An Expectation-Maximization Algorithm

Instead of solving the optimization problem implied by Equation (21), we propose a rather simple modification to the expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin 1977) to approximate the optimal solution—this algorithm finds a local maximum for the maximum likelihood problem using proper approximations. The modified algorithm is given in Algorithm 1.

The modification from the usual expectation-maximization algorithm is done in the M-step: Instead of using the expected value of the sufficient statistics by counting and normalizing, we truncate the values by *γ*. It can be shown that if θ^{(0)} ∈ Θ(γ), then the likelihood is guaranteed to increase (and hence, the log-loss is guaranteed to decrease) after each iteration of the algorithm.

The reason for this likelihood increase stems from the fact that the M-step solves the optimization problem of minimizing the log-loss (with respect to θ ∈ *Θ*(*γ*)) when the posterior calculate at the E-step as the base distribution is used. This means that the M-step minimizes (in iteration *t*): where the expectation is taken with respect to the distribution . With this notion in mind, the likelihood increase after each iteration follows from principles similar to those described in Bishop (2006) for the EM algorithm.

## 7. Discussion

Our framework can be specialized to improve the two main criteria which have a trade-off: the tightness of the proper approximation and the sample complexity. For example, we can improve the tightness of our proper approximations by taking a subsequence of . This will make the sample complexity bound degrade, however, because *K*_{n} will grow faster. Table 2 shows the trade-offs between parameters in our model and the effectiveness of learning.

criterion . | as K_{n} increases …. | as s increases …. |
---|---|---|

tightness of proper approximation | improves | improves |

sample complexity bound | degrades | degrades |

criterion . | as K_{n} increases …. | as s increases …. |
---|---|---|

tightness of proper approximation | improves | improves |

sample complexity bound | degrades | degrades |

We note that the sample complexity bounds that we give in this article give insight about the asymptotic behavior of grammar estimation, but are not necessarily sufficiently tight to be used in practice. It still remains an open problem to obtain sample complexity bounds which are sufficiently tight in this respect. For a discussion about the connection of grammar learning in theory and practice, we refer the reader to Clark and Lappin (2010).

It is also important to note that MLE is not the only option for estimating finite state probabilistic grammars. There has been some recent advances in learning finite state models (HMMs and finite state transducers) by using spectral analysis of matrices which consist of quantities estimated from observations only (Hsu, Kakade, and Zhang 2009; Balle, Quattoni, and Carreras 2011), based on the observable operator models of Jaeger (1999). These algorithms are not prone to local minima, and converge to the correct model as the number of samples increases, but require some assumptions about the underlying model that generates the data.

### 7.1 Tsybakov Noise

In this article, we chose to introduce assumptions about distributions that generate natural language data. The choice of these assumptions was motivated by observations about properties shared among treebanks. The main consequence of making these assumptions is bounding the amount of *noise* in the distribution (i.e., the amount of variation in probabilities across labels given a fixed input).

There are other ways to restrict the noise in a distribution. One condition for such noise restriction, which has received considerable recent attention in the statistical literature, is the Tsybakov noise condition (Tsybakov 2004; Koltchinskii 2006). Showing that a distribution satisfies the Tsybakov noise condition enables the use of techniques (e.g., from Koltchinskii 2006) for deriving distribution-dependent sample complexity bounds that depend on the parameters of the noise. It is therefore of interest to see whether Tsybakov noise holds under the assumptions presented in Section 3.1. We show that this is not the case, and that Tsybakov noise is too permissive. In fact, we show that *p* can be a probabilistic grammar itself (and hence, satisfy the assumptions in Section 3.1), and still not satisfy the Tsybakov noise conditions.

Tsybakov noise was originally introduced for classification problems (Tsybakov 2004), and was later extended to more general settings, such as the one we are facing in this article (Koltchinskii 2006). We now explain the definition of Tsybakov noise in our context.

*C*> 0 and

*κ*≥ 1. We say that a distribution

*p*(

*x,z*) satisfies the (

*C*,

*κ*) Tsybakov noise condition if for any

*ε*> 0 and such that , we have This interpretation of Tsybakov noise implies that the diameter of the set of functions from the concept class that has small excess risk should shrink to 0 at the rate in Equation (25). Distribution-dependent bounds from Koltchinskii (2006) are monotone with respect to the diameter of this set of functions, and therefore demonstrating that it goes to 0 enables sharper derivations of sample complexity bounds.

**Theorem 5**

Let *G* be a grammar with *K* ≥ 2 and degree 2. Assume that *p* is 〈*G*, θ*〉 for some θ*, such that and that *c*_{1} ≤ *c*_{2}. If *A _{G}*(θ*) is positive definite, then

*p*does not satisfy the Tsybakov noise condition for any (

*C*,κ), where

*C*> 0 and κ ≥ 1.

See Appendix C for the proof of Theorem 5.

In Appendix C we show that *A*_{G}(θ) is positive semi-definite for any choice of θ. The main intuition behind the proof is that given a probabilistic grammar *p*, we can construct an hypothesis *h* such that the KL divergence between *p* and *h* is small, but dist(*p*,*h*) is lower-bounded and is not close to 0.

We conclude that probabilistic grammars, as generative distributions of data, do not generally satisfy the Tsybakov noise condition. This motivates an alternative choice of assumptions that could lead to better understanding of rates of convergences and bounds on the excess risk. Section 3.1 states such assumptions which were also justified empirically.

### 7.2 Comparison to Dirichlet Maximum A Posteriori Solutions

*T*(θ,

*γ*) from Section 4.1 can be thought of as a

*smoother*for the probabilities θ: It ensures that the probability of each rule is at least

*γ*(and as a result, the probabilities of all rules cannot exceed 1 −

*γ*). Adding pseudo-counts to frequency counts is also a common way to smooth probabilities in models based on multinomial distributions, including probabilistic grammars (Manning and Schütze 1999). These pseudo-counts can be framed as a maximum a posteriori (MAP) alternative to the maximum likelihood problem, with the choice of Bayesian prior over the parameters in the form of a Dirichlet distribution. In comparison to our framework, with (symmetric) Dirichlet smoothing, instead of truncating the probabilities with a margin

*γ*we would set the probability of each rule (in the supervised setting) tofor

*i*= 1,2, where are the counts in the data of event

*i*in multinomial

*k*for Example

*j*. Dirichlet smoothing can be formulated as the result of adding a symmetric Dirichlet prior over the parameters

*θ*

_{k,i}with hyperparameter

*α*. Then Equation (26) is the mode of the posterior after observing appearances of event

*i*in multinomial

*k*.

The effect of Dirichlet smoothing becomes weaker as we have more samples, because the frequency counts become dominant in both the numerator and the denominator when there are more data. In this sense, the prior's effect on learning diminishes as we use more data. A similar effect occurs in our framework: *γ* = *n*^{−s} where *n* is the number of samples—the more samples we have, the more we trust the counts in the data to be reliable. There is a subtle difference, however. With the Dirichlet MAP solution, the smoothing is less dominant only if the counts of the features are large, regardless of the number of samples we have. With our framework, smoothing depends *only* on the number of samples we have. These two scenarios are related, of course: The more samples we have, the more likely it is that the counts of the events will grow large.

### 7.3 Other Derivations of Sample Complexity Bounds

In this section, we discuss other possible solutions to the problem of deriving sample complexity bounds for probabilistic grammars.

#### 7.3.1 Using Talagrand's Inequality

Our bounds are based on VC theory together with classical results for empirical processes (Pollard 1984). There have been some recent developments to the derivation of rates of convergence in statistical learning theory (Massart 2000; Bartlett, Bousquet, and Mendelson 2005; Koltchinskii 2006), most prominently through the use of Talagrand's inequality (Talagrand 1994), which is a concentration of measure inequality, in the spirit of Lemma 2.

The bounds achieved with Talagrand's inequality are also distribution-dependent, and are based on the diameter of the *ε*-minimal set—the set of hypotheses which have an excess risk smaller than *ε*. We saw in Section 7.1 that the diameter of the *ε*-minimal set does not follow the Tsybakov noise condition, but it is perhaps possible to find meaningful bounds for it, in which case we may be able to get tighter bounds using Talagrand's inequality. We note that it may be possible to obtain *data-dependent* bounds for the diameter of the *ε*-minimal set, following Koltchinskii (2006), by calculating the diameter of the *ε*-minimal set using .

#### 7.3.2 Simpler Bounds for the Supervised Case

As noted in Section 6.1, minimizing empirical risk with the log-loss leads to a simple frequency count for calculating the estimated parameters of the grammar. In Corazza and Satta (2006), it has been also noted that to minimize the non-empirical risk, it is necessary to set the parameters of the grammar to the normalized *expected* count of the features.

This means that we can get bounds on the deviation of a certain parameter from the optimal parameter by applying modifications to rather simple inequalities such as Hoeffding's inequality, which determines the probability of the average of a set of i.i.d. random variables deviating from its mean. The modification would require us to split the event space into two cases: one in which the count of some features is larger than some fixed value (which will happen with small probability because of the bounded expectation of features), and one in which they are all smaller than that fixed value. Handling these two cases separately is necessary because Hoeffding's inequality requires that the count of the rules is bounded.

The bound on the deviation from the mean of the parameters (the true probability) can potentially lead to a bound on the excess risk in the supervised case. This formulation of the problem would not generalize to the unsupervised case, however, where the empirical risk minimization does not amount to simple frequency count.

### 7.4 Open Problems

We conclude the discussion with some directions for further exploration and future work.

#### 7.4.1 Sample Complexity Bounds with Semi-Supervised Learning

Our bounds focus on the supervised case and the unsupervised case. There is a trivial extension to the semi-supervised case. Consider the objective function to be the sum of the likelihood for the labeled data together with the marginalized likelihood of the unlabeled data (this sum could be a weighted sum). Then, use the sample complexity bounds for each summand to derive a sample complexity bound on this sum.

It would be more interesting to extend our results to frameworks such as the one described by Balcan and Blum (2010). In that case, our discussion of sample complexity would attempt to identify how unannotated data can reduce the space of candidate probabilistic grammars to a smaller set, after which we can use the annotated data to estimate the final grammar. This reduction of the space is accomplished through a notion of compatibility, a type of fitness that the learner believes the estimated grammar should have given the distribution that generates the data. The key challenge in the case of probabilistic grammars would be to properly define this compatibility notion such that it fits the log-loss. If this is achieved, then similar machinery to that described in this paper (with proper approximations) can be followed to derive semi-supervised sample complexity bounds for probabilistic grammars.

#### 7.4.2 Sharper Bounds for the Pseudo-Dimension of Probabilistic Grammars

The pseudo-dimension of a probabilistic grammar with the log-loss is bounded by the number of parameters in the grammar, because the logarithm of a distribution generated by a probabilistic grammar is a linear function. Typically the set of counts for the feature vectors of a probabilistic grammar resides in a subspace of a dimension which is smaller than the full dimension specified by the number of parameters, however. The reason for this is that there are usually relationships (which are often linear) between the elements in the feature counts. For example, with HMMs, the total feature count for emissions should equal the total feature count for transitions. With PCFGs, the total number of times that nonterminal rules fire equals the total number of times that features with that nonerminal in the right-hand side fired, again reducing the pseudo-dimension. An open problem that remains is characterization of the exact value pseudo-dimension for a given grammar, determined by consideration of various properties of that grammar. We conjecture, however, that a lower bound on the pseudo-dimension would be rather close to the full dimension of the grammar (the number of parameters).

It is interesting to note that there has been some work to identify the VC dimension and pseudo-dimension for certain types of grammars. Bane, Riggle, and Sonderegger (2010), for example, calculated the VC dimension for constraint-based grammars. Ishigami and Tani (1993; Ishigami and Tani (1997) computed the VC dimension for finite state automata with various properties.

### 7.5 Conclusion

We presented a framework for performing empirical risk minimization for probabilistic grammars, in which sample complexity bounds, for the supervised case and the unsupervised case, can be derived. Our framework is based on the idea of bounded approximations used in the past to derive sample complexity bounds for graphical models.

Our framework required assumptions about the probability distribution that generates sentences or derivations in the language of the given grammar. These assumptions were tested using corpora, and found to fit the data well.

We also discussed algorithms that can be used for minimizing empirical risk in our framework, given enough samples. We showed that directly trying to minimize empirical risk in the unsupervised case is NP-hard, and suggested an approximation based on an expectation-maximization algorithm.

## Appendix A. Proofs

We include in this appendix proofs for several results in the article.

**Utility Lemma 1**

Let such that ∑_{i}*a*_{i} = 1. Define *b*_{1} = *a*_{1}, *c*_{1} = 1 − *a*_{1}, *b*_{i} = , and *c*_{i} = 1 − *b*_{i} for *i* ≥ 2. Then

**Proof**

**Lemma 1**

**Proof**

*g*

_{n}as the minimizer of the empirical risk. We next bound . We know from the requirement of proper approximation that we haveand that equals the right side of Equation (Appendix A.1).▪

**Proposition 2**

Let and let be as defined earlier. There exists a constant β = β(*L*, *q*, *p*, *N*) > 0 such that has the boundedness property with *K _{m}* =

*sN*log

^{3}

*m*and .

**Proof**

**Utility Lemma 4**

(From Dasgupta [1997].) Let *a* ∈ [0,1] and let *b* = *a* if *a* ∈ [*γ*,1 − *γ*], *b* = *γ* if *a* ≤ *γ*, and *b* = 1 − *γ* if *a* ≥ 1 − *γ*. Then for any *ε* ≤ 1/2 such that γ ≤ *ε*/ (1 + *ε*) we have log *a*/*b* ≤ ε.

**Proposition 3**

**Proof**

^{2}

*m*. Let . Define

*f*′ =

*T*(

*f*,

*m*

^{ − s}). For any we have thatWithout loss of generality, assume . Let . From Utility Lemma 4 we have that . Plug this into Equation A.2 (

*N*= 2

*K*) to get that for all we have . It remains to show that the measure . Note that for

*m*>

*M*where

*M*is fixed.▪

**Proposition 7**

There exists a β′(*L*,*p*,*q*,*N*) > 0 such that has the boundedness property with *K _{m}* =

*sN*log

^{3}

*m*and .

**Proof**

From the requirement of *p*, we know that for any *x* we have a *z* such that yield(*z*) = *x* and |*z*| ≤ *α*|*x*|. Therefore, if we let , then we have for any and that (similarly to the proof of Proposition 2). Denote by *f*_{1}(*x,z*) the function in such that .

**Utility Lemma 2**

For *a _{i}*,

*b*≥ 0, if − log ∑

_{i}_{i}

*a*

_{i}+ log ∑

_{i}

*b*

_{i}≥ ε then there exists an

*i*such that − log

*a*

_{i}+ log

*b*≥ ε.

_{i}**Proof**

Assume − log*a*_{i} + log*b*_{i} < *ε* for all *i*. Then, , therefore , therefore − log ∑ _{i}*a*_{i} + log ∑ _{i}*b*_{i} < *ε* which is a contradiction to − log ∑ _{i}*a*_{i} + log ∑ _{i}*b*_{i} ≥ *ε*.▪

The next lemma is the main concentation of measure result that we use. Its proof requires some simple modification to the proof given for Theorem 24 in Pollard (1984, pages 30–31).

**Lemma 2**

**Proof**

At this point, we can follow the proof of Theorem 24 in Pollard (1984), and its extension on pages 30–31 to get Lemma 2, using the shifted set of functions .▪

## Appendix B. Minimizing Log-Loss for Probabilistic Grammars

*c*

_{k,i}which depend on or some other intermediate distribution in the case of the expectation-maximization algorithm and

*γ*which is a margin determined by the number of samples. This minimization problem can be decomposed into several optimization problems, one for each

*k*, each having the following form:where

*c*

_{i}≥ 0 and 1/2 >

*γ*≥ 0. Ignore for a moment the constraints

*γ*≤

*β*

_{i}≤ 1 −

*γ*. In that case, this can be thought of as a regular maximum likelihood estimation problem, so

*β*

_{i}=

*c*

_{i}/ (

*c*

_{1}+

*c*

_{2}). We give a derivation of this result in this simple case for completion. We use Lagranian multipliers to solve this problem. Let

*F*(

*β*1,

*β*2) =

*c*

_{1}

*β*

_{1}+

*c*

_{2}

*β*

_{2}. Define the Lagrangian:

*g*(

*λ*) is the objective function of the dual problem of Equation (B.1)–Equation (B.2). We would like to minimize Equation (B.5) with respect to

*λ*. The derivative of

*g*(

*λ*) ishence when equating the derivative of

*g*(

*λ*) to 0, we get

*λ*= − (

*c*

_{1}+

*c*

_{2}), and therefore the solution is . We need to verify that the solution to the dual problem indeed gets the optimal value for the primal. Because the primal problem is convex, it is sufficient to verify that the Karush-Kuhn-Tucker (KKT) conditions hold (Boyd and Vandenberghe 2004). Indeed, we havewhere stands for the equality constraint. The rest of the KKT conditions trivially hold, therefore β

^{*}is the optimal solution for Equations (B.1)–(B.2).

*γ*<

*c*

_{i}/ (

*c*

_{1}+

*c*

_{2}) <

*γ*, then this is the solution even when again adding the constraints in Equation (B.3) and (B.4). When

*c*

_{1}/ (

*c*

_{1}+

*c*

_{2}) <

*γ*, then the solution is and . Similarly, when

*c*

_{2}/ (

*c*

_{1}+

*c*

_{2}) <

*γ*then the solution is and . We describe why this is true for the first case. The second case follows very similarly. Assume

*c*

_{1}/ (

*c*

_{1}+

*c*

_{2}) <

*γ*. We want to show that for any choice of

*β*∈ [0,1] such that

*β*>

*γ*we have

Equation (B.6) is precisely the definition of the KL divergence between the distribution of a coin with probability *γ* of heads and the distribution of a coin with probability *β* of heads, and therefore the right side in Equation (B.6) is positive, and we get what we need.

## Appendix C. Counterexample to Tsybakov Noise (Proofs)

**Lemma 6**

*A* = *A*_{G}(θ) is positive semi-definite for any probabilistic grammar 〈*G*, θ〉.

**Proof**

**Lemma 7**

**Proof**

*t*(

*ε*) ≤ 0 if First, show that which happens if (after substituting

*a*=

*α*

_{1}

*μ*,

*b*=

*α*

_{2}

*μ*) Note we have

*α*

_{1}

*α*

_{2}> 1 because

*c*

_{1}≤

*c*

_{2}. In addition, we have

*α*

_{1}+

*α*

_{2}− 2 ≥ 0 for small enough

*ε*(can be shown by taking the derivative, with respect to

*ε*of

*α*

_{1}+

*α*

_{2}− 2, which is always positive for small enough

*ε*, and in addition, noticing that the value of

*α*

_{1}+

*α*

_{2}− 2 is 0 when

*ε*= 0.) Therefore, Equation (C.2) is true.

*t*(

*ε*) ≤ 0 if which is equivalent to Taking again the derivative of the left side of Equation (C.3), we have that it is an increasing function of

*ε*(if

*c*

_{1}≤

*c*

_{2}), and in addition at

*ε*= 0 it obtains the value

*c*

_{1}+

*c*

_{2}. Therefore, Equation (C.3) holds, and therefore

*t*(

*ε*) ≤ 0 for small enough

*ε*.▪

**Theorem 5**

Let *G* be a grammar with *K* ≥ 2 and degree 2. Assume that *p* is 〈*G*, θ*〉 for some θ*, such that and that *c*_{1} ≤ *c*_{2}. If *A _{G}*(θ*) is positive definite, then

*p*does not satisfy the Tsybakov noise condition for any (

*C*, κ), where

*C*> 0 and κ ≥ 1.

**Proof**

*λ*to be the eigenvalue of

*A*

_{G}(θ) with the smallest value (

*λ*is positive). Also, define

**v**(θ) to be a vector indexed by

*k*,

*i*such that Simple algebra shows that for any (and the fact that ), we have For a

*C*> 0 and

*κ*≥ 1, define

*α*=

*Cε*

^{1/κ}. Let

*ε*<

*α*. First, we construct an

*h*such that

*D*

_{KL}(

*p*||

*h*) <

*ε*+

*ε*/2 but dist(

*p*,

*h*) >

*Cε*

^{1/κ}as

*ε*→0. The construction follows. Parametrize

*h*by θ such that θ is identical to θ

^{*}except for

*k*= 1,2, in which case we have Note that

*μ*≤

*θ*

_{1,1}≤ 1/2 and

*θ*

_{2,1}<

*μ*. Then, we have that We also have if (This can be shown by dividing Equation [C.6] by

*c*

_{1}+

*c*

_{2}and then using the concavity of the logarithm function.) From Lemma 7, we have that Equation (C.7) holds. Therefore, Now, consider the following, which can be shown through algebraic manipulation: Then, additional algebraic simplification shows that

*λ*is the smallest eigenvalue in

*A*. From the construction of

*θ*and Equation (C.4)–(C.5), we have that . Therefore, which means . Therefore,

*p*does not satisfy the Tsybakov noise condition with parameters (

*D*,

*κ*) for any

*D*> 0.

## Appendix D. Notation

Table D.1 gives a table of notation for symbols used throughout this article.

## Acknowledgements

The authors thank the anonymous reviewers for their comments and Avrim Blum, Steve Hanneke, Mark Johnson, John Lafferty, Dan Roth, and Eric Xing for useful conversations. This research was supported by National Science Foundation grant IIS-0915187.

## Notes

It is important to remember that minimizing the log-loss does not equate to minimizing the error of a linguistic analyzer or natural language processing application. In this article we focus on the log-loss case because we believe that probabilistic models of language phenomena have inherent usefulness as explanatory tools in computational linguistics, aside from their use in systems.

We note that itself is a random variable, because it depends on the sample drawn from *p*.

We note that being able to attain the minimum through an hypothesis *q** is not necessarily possible in the general case. In our instantiations of ERM for probabilistic grammars, however, the minimum can be attained. In fact, in the unsupervised case the minimum can be attained by more than a single hypothesis. In these cases, *q** is arbitrarily chosen to be one of these minimizers.

Treebanks offer samples of cleanly segmented sentences. It is important to note that the distributions estimated may not generalize well to samples from other domains in these languages. Our argument is that the family of the estimated curve is reasonable, not that we can correctly estimate the curve's parameters.

For simplicity and consistency with the log-loss, we measure entropy in nats, which means we use the natural logarithm when computing entropy.

We note that this notion of binarization is different from previous types of binarization appearing in computational linguistics for grammars. Typically in previous work about binarized grammars such as CFGs, the grammars are constrained to have at most two nonterminals in the right side in Chomsky normal form. Another form of binarization for linear context-free rewriting systems is restriction of the *fan-out* of the rules to two (Gómez-Rodríguez and Satta 2009; Gildea 2010). We, however, limit the number of *rules* for each nonterminal (or more generally, the number of elements in each multinomial).

There are other ways to manage the unboundedness of KL divergence in the language learning literature. Clark and Thollard (2004), for example, decompose the KL divergence between probabilistic finite-state automata into several terms according to a decomposition of Carrasco (1997) and then bound each term separately.

By varying *s* we get a family of approximations. The larger *s* is, the tighter the approximation is. Also, the larger *s* is, as we see later, the looser our sample complexity bound will be.

The “permissible class” requirement is a mild regularity condition regarding measurability that holds for proper approximations. We refer the reader to Pollard (1984) for more details.

## References

*n*-best parsing and maxent discriminative reranking.

*n*states.

*k*letters and

*n*states.

*Lecture Notes in Computer Science*)

## Author notes

Department of Computer Science, Columbia University, New York, NY 10027, United States. E-mail: scohen@cs.columbia.edu. This research was completed while the first author was at Carnegie Mellon University.

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, United States. E-mail: nasmith@cs.cmu.edu.