Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. They are used ubiquitously in computational linguistics. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of probabilistic grammars using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. By making assumptions about the underlying distribution that are appropriate for natural language scenarios, we are able to derive distribution-dependent sample complexity bounds for probabilistic grammars. We also give simple algorithms for carrying out empirical risk minimization using this framework in both the supervised and unsupervised settings. In the unsupervised case, we show that the problem of minimizing empirical risk is NP-hard. We therefore suggest an approximate algorithm, similar to expectation-maximization, to minimize the empirical risk.

Learning from data is central to contemporary computational linguistics. It is in common in such learning to estimate a model in a parametric family using the maximum likelihood principle. This principle applies in the supervised case (i.e., using annotated data) as well as semisupervised and unsupervised settings (i.e., using unannotated data). Probabilistic grammars constitute a range of such parametric families we can estimate (e.g., hidden Markov models, probabilistic context-free grammars). These parametric families are used in diverse NLP problems ranging from syntactic and morphological processing to applications like information extraction, question answering, and machine translation.

Estimation of probabilistic grammars, in many cases, indeed starts with the principle of maximum likelihood estimation (MLE). In the supervised case, and with traditional parametrizations based on multinomial distributions, MLE amounts to normalization of rule frequencies as they are observed in data. In the unsupervised case, on the other hand, algorithms such as expectation-maximization are available. MLE is attractive because it offers statistical consistency if some conditions are met (i.e., if the data are distributed according to a distribution in the family, then we will discover the correct parameters if sufficient data is available). In addition, under some conditions it is also an unbiased estimator.

An issue that has been far less explored in the computational linguistics literature is the sample complexity of MLE. Here, we are interested in quantifying the number of samples required to accurately learn a probabilistic grammar either in a supervised or in an unsupervised way. If bounds on the requisite number of samples (known as “sample complexity bounds”) are sufficiently tight, then they may offer guidance to learner performance, given various amounts of data and a wide range of parametric families. Being able to reason analytically about the amount of data to annotate, and the relative gains in moving to a more restricted parametric family, could offer practical advantages to language engineers.

We note that grammar learning has been studied in formal settings as a problem of grammatical inference—learning the structure of a grammar or an automaton (Angluin 1987; Clark and Thollard 2004; de la Higuera 2005; Clark, Eyraud, and Habrard 2008, among others). Our setting in this article is different. We assume that we have a fixed grammar, and our goal is to estimate its parameters. This approach has shown great empirical success, both in the supervised (Collins 2003; Charniak and Johnson 2005) and the unsupervised (Carroll and Charniak 1992; Pereira and Schabes 1992; Klein and Manning 2004; Cohen and Smith 2010a) settings. There has also been some discussion of sample complexity bounds for statistical parsing models, in a distribution-free setting (Collins 2004). The distribution-free setting, however, is not ideal for analysis of natural language, as it has to account for pathological cases of distributions that generate data.

We develop a framework for deriving sample complexity bounds using the maximum likelihood principle for probabilistic grammars in a distribution-dependent setting. Distribution dependency is introduced here by making empirically justified assumptions about the distributions that generate the data. Our framework uses and significantly extends ideas that have been introduced for deriving sample complexity bounds for probabilistic graphical models (Dasgupta 1997). Maximum likelihood estimation is put in the empirical risk minimization framework (Vapnik 1998) with the loss function being the log-loss. Following that, we develop a set of learning theoretic tools to explore rates of estimation convergence for probabilistic grammars. We also develop algorithms for performing empirical risk minimization.

Much research has been devoted to the problem of learning finite state automata (which can be thought of as a class of grammars) in the Probably Approximately Correct setting, leading to the conclusion that it is a very hard problem (Kearns and Valiant 1989; Pitt 1989; Terwijn 2002). Typically, the setting in these cases is different from our setting: Error is measured as the probability mass of strings that are not identified correctly by the learned finite state automaton, instead of measuring KL divergence between the automaton and the true distribution. In addition, in many cases, there is also a focus on the distribution-free setting. To the best of our knowledge, it is still an open problem whether finite state automata are learnable in the distribution-dependent setting when measuring the error as the fraction of misidentified strings. Other work (Ron 1995; Ron, Singer, and Tishby 1998; Clark and Thollard 2004; Palmer and Goldberg 2007) also gives treatment to probabilistic automata with an error measure which is more suitable for the probabilistic setting, such as Kullback-Lielder (KL) divergence or variation distance. These also focus on learning the structure of finite state machines. As mentioned earlier, in our setting we assume that the grammar is fixed, and that our goal is to estimate its parameters.

We note an important connection to an earlier study about the learnability of probabilistic automata and hidden Markov models by Abe and Warmuth (1992). In that study, the authors provided positive results for the sample complexity for learning probabilistic automata—they showed that a polynomial sample is sufficient for MLE. We demonstrate positive results for the more general class of probabilistic grammars which goes beyond probabilistic automata. Abe and Warmuth also showed that the problem of finding or even approximating the maximum likelihood solution for a two-state probabilistic automaton with an alphabet of an arbitrary size is hard. Even though these results extend to probabilistic grammars to some extent, we provide a novel proof that illustrates the NP-hardness of identifying the maximum likelihood solution for probabilistic grammars in the specific framework of “proper approximations” that we define in this article. Whereas Abe and Warmuth show that the problem of maximum likelihood maximization for two-state HMMs is not approximable within a certain factor in time polynomial in the alphabet and the length of the observed sequence, we show that there is no polynomial algorithm (in the length of the observed strings) that identifies the maximum likelihood estimator in our framework. In our reduction, from 3-SAT to the problem of maximum likelihood estimation, the alphabet used is binary and the grammar size is proportional to the length of the formula. In Abe and Warmuth, the alphabet size varies, and the number of states is two.

This article proceeds as follows. In Section 2 we review the background necessary from Vapnik's (1988) empirical risk minimization framework. This framework is reduced to maximum likelihood estimation when a specific loss function is used: the log-loss.1 There are some shortcomings in using the empirical risk minimization framework in its simplest form. In its simplest form, the ERM framework is distribution-free, which means that we make no assumptions about the distribution that generated the data. Naively attempting to apply the ERM framework to probabilistic grammars in the distribution-free setting does not lead to the desired sample complexity bounds. The reason for this is that the log-loss diverges whenever small probabilities are allocated in the learned hypothesis to structures or strings that have a rather large probability in the probability distribution that generates the data. With a distribution-free assumption, therefore, we would have to give treatment to distributions that are unlikely to be true for natural language data (e.g., where some extremely long sentences are very probable).

To correct for this, we move to an analysis in a distribution-dependent setting, by presenting a set of assumptions about the distribution that generates the data. In Section 3 we discuss probabilistic grammars in a general way and introduce assumptions about the true distribution that are reasonable when our data come from natural language examples. It is important to note that this distribution need not be a probabilistic grammar.

The next step we take, in Section 4, is approximating the set of probabilistic grammars over which we maximize likelihood. This is again required in order to overcome the divergence of the log-loss for probabilities that are very small. Our approximations are based on bounded approximations that have been used for deriving sample complexity bounds for graphical models in a distribution-free setting (Dasgupta 1997).

Our approximations have two important properties: They are, by themselves, probabilistic grammars from the family we are interested in estimating, and they become a tighter approximation around the family of probabilistic grammars we are interested in estimating as more samples are available.

Moving to the distribution-dependent setting and defining proper approximations enables us to derive sample complexity bounds. In Section 5 we present the sample complexity results for both the supervised and unsupervised cases. A question that lingers at this point is whether it is computationally feasible to maximize likelihood in our framework even when given enough samples.

In Section 6, we describe algorithms we use to estimate probabilistic grammars in our framework, when given access to the required number of samples. We show that in the supervised case, we can indeed maximize likelihood in our approximation framework using a simple algorithm. For the unsupervised case, however, we show that maximizing likelihood is NP-hard. This fact is related to a notion known in the learning theory literature as inherent unpredictability (Kearns and Vazirani 1994): Accurate learning is computationally hard even with enough samples. To overcome this difficulty, we adapt the expectation-maximization algorithm (Dempster, Laird, and Rubin 1977) to approximately maximize likelihood (or minimize log-loss) in the unsupervised case with proper approximations.

In Section 7 we discuss some related ideas. These include the failure of an alternative kind of distributional assumption and connections to regularization by maximum a posteriori estimation with Dirichlet priors. Longer proofs are included in the appendices. A table of notation that is used throughout is included as Table D.1 in Appendix D.

This article builds on two earlier papers. In Cohen and Smith (2010b) we presented the main sample complexity results described here; the present article includes significant extensions, a deeper analysis of our distributional assumptions, and a discussion of variants of these assumptions, as well as related work, such as that about the Tsybakov noise condition. In Cohen and Smith (2010c) we proved NP-hardness for unsupervised parameter estimation of probalistic context-free grammars (PCFGs) (without approximate families). The present article uses a similar type of proof to achieve results adapted to empirical risk minimization in our approximation framework.

We begin by introducing some notation. We seek to construct a predictive model that maps inputs from space to outputs from space . In this work, is a set of strings using some alphabet , and is a set of derivations allowed by a grammar (e.g., a context-free grammar). We assume the existence of an unknown joint probability distribution p(x,z) over . (For the most part, we will be discussing discrete input and output spaces. This means that p will denote a probability mass function.) We are interested in estimating the distribution p from examples, either in a supervised setting, where we are provided with examples of the form , or in the unsupervised setting, where we are provided only with examples of the form . We first consider the supervised setting and return to the unsupervised setting in Section 5. We will use q to denote the estimated distribution.

In order to estimate p as accurately as possible using q(x,z), we are interested in minimizing the log-loss, that is, in finding qopt, from a fixed family of distributions (also called “the concept space”), such that
Note that if , then this quantity achieves the minimum when qopt = p, in which case the value of the log-loss is the entropy of p. Indeed, more generally, this optimization is equivalent to finding q such that it minimizes the KL divergence from p to q.
Because p is unknown, we cannot hope to minimize the log-loss directly. Given a set of examples (x1,z1),…,(xn,zn), however, there is a natural candidate, the empirical distribution , for use in Equation (1) instead of p, defined as:
where is 1 if (x,z) = (xi,zi) and 0 otherwise.2 We then set up the problem as the problem of empirical risk minimization (ERM), that is, trying to find q such that
Equation (3) immediately shows that minimizing empirical risk using the log-loss is equivalent to the maximizing likelihood, which is a common statistical principle used for estimating a probabilistic grammar in computational linguistics (Charniak 1993; Manning and Schütze 1999).3
As mentioned earlier, our goal is to estimate the probability distribution p while quantifying how accurate our estimate is. One way to quantify the estimation accuracy is by bounding the excess risk, which is defined as

We are interested in bounding the excess risk for q*, . The excess risk is reduced to KL divergence between p and q if , because in this case the quantity is minimized with q′ = p, and equals the entropy of p. In a typical case, where we do not necessarily have , then the excess risk of q is bounded from above by the KL divergence between p and q.

We can bound the excess risk by showing the double-sided convergence of the empirical process , defined as follows:
as n → ∞. For any ε > 0, if, for large enough n it holds that
(with high probability), then we can “sandwich” the following quantities:
where the inequalities come from the fact that qopt minimizes the expected risk for , and q* minimizes the empirical risk for . The consequence of Equations (7) and (8) is that the expected risk of q* is at most 2ε away from the expected risk of qopt, and as a result, we find the excess risk , for large enough n, is smaller than 2ε. Intuitively, this means that, under a large sample, q* does not give much worse results than qopt under the criterion of the log-loss.

Unfortunately, the regularity conditions which are required for the convergence of loss can be unbounded. This means that a modification is required for the empirical process in a way that will actually guarantee some kind of convergence. We give a treatment to this in the next section.

We note that all discussion of convergence in this section has been about convergence in probability. For example, we want Equation (6) to hold with high probability—for most samples of size n. We will make this notion more rigorous in Section 2.2.

2.1 Empirical Risk Minimization and Structural Risk Minimization Methods

It has been noted in the literature (Vapnik 1998; Koltchinskii 2006) that often the class is too complex for empirical risk minimization using a fixed number of data points. It is therefore desirable in these cases to create a family of subclasses that have increasing complexity. The more data we have, the more complex our can be for empirical risk minimization. Structural risk minimization (Vapnik 1998) and the method of sieves (Grenander 1981) are examples of methods that adopt such an approach. Structural risk minimization, for example, can be represented in many cases as a penalization of the empirical risk method, using a regularization term.

In our case, the level of “complexity” is related to allocation of small probabilities to derivations in the grammar by a distribution . The basic problem is this: Whenever we have a derivation with a small probability, the log-loss becomes very large (in absolute value), and this makes it hard to show the convergence of the empirical process . Because grammars can define probability distributions over infinitely many discrete outcomes, probabilities can be arbitrarily small and log-loss can be arbitrarily large.

To solve this issue with the complexity of , we define in Section 4 a series of approximations for probabilistic grammars such that . Our framework for empirical risk minimization is then set up to minimize the empirical risk with respect to , where n is the number of samples we draw for the learner:
We are then interested in the convergence of the empirical process

In Section 4 we show that the minimizer is an asymptotic empirical risk minimizer (in our specific framework), which means that . Because we have , the implication of having asymptotic empirical risk minimization is that we have .

2.2 Sample Complexity Bounds

Knowing that we are interested in the convergence of , a natural question to ask is: “At what rate does this empirical process converge?”

Because the quantity is a random variable, we need to give a probabilistic treatment to its convergence. More specifically, we ask the question that is typically asked when learnability is considered (Vapnik 1998): “How many samples n are required so that with probability 1 − δ we have ?” Bounds on this number of samples are also called “sample complexity bounds,” and in a distribution-free setting they are described as a function , independent of the distribution p that generates the data.

A complete distribution-free setting is not appropriate for analyzing natural language. This setting poses technical difficulties with the convergence of and needs to take into account pathological cases that can be ruled out in natural language data. Instead, we will make assumptions about p, parametrize these assumptions in several ways, and then calculate sample complexity bounds of the form , where the dependence on the distribution is expressed as dependence on the parameters in the assumptions about p.

The learning setting, then, can be described as follows. The user decides on a level of accuracy (ε) which the learning algorithm has to reach with confidence (1 − δ). Then, samples are drawn from p and presented to the learning algorithm. The learning algorithm then returns an hypothesis according to Equation (9).

We begin this section by discussing the family of probabilistic grammars. A probabilistic grammar defines a probability distribution over a certain kind of structured object (a derivation of the underlying symbolic grammar) explained step-by-step as a stochastic process. Hidden Markov models (HMMs), for example, can be understood as a random walk through a probabilistic finite-state network, with an output symbol sampled at each state. PCFGs generate phrase-structure trees by recursively rewriting nonterminal symbols as sequences of “child” symbols (each itself either a nonterminal symbol or a terminal symbol analogous to the emissions of an HMM).

Each step or emission of an HMM and each rewriting operation of a PCFG is conditionally independent of the others given a single structural element (one HMM or PCFG state); this Markov property permits efficient inference over derivations given a string.

In general, a probabilistic grammar 〈G, θ〉 defines the joint probability of a string x and a grammatical derivation z:
where ψk,i is a function that “counts” the number of times the kth distribution's ith event occurs in the derivation. The parameters θ are a collection of K multinomials 〈θ1, … , θK〉, the kth of which includes Nk competing events. If we let θk = 〈θk,1, … , θk,Nk〉, each θk,i is a probability, such that

We denote by ΘG this parameter space for θ. The grammar G dictates the support of q in Equation (11). As is often the case in probabilistic modeling, there are different ways to carve up the random variables. We can think of x and z as correlated structure variables (often x is known if z is known), or the derivation event counts as an integer-vector random variable. In this article, we assume that x is always a deterministic function of z, so we use the distribution p(z) interchangeably with p(x,z).

Note that there may be many derivations z for a given string x—perhaps even infinitely many in some kinds of grammars. For HMMs, there are three kinds of multinomials: a starting state multinomial, a transition multinomial per state and an emission multinomial per state. In that case K = 2s + 1, where s is the number of states. The value of Nk depends on whether the kth multinomial is the starting state multinomial (in which case Nk = s), transition multinomial (Nk = s), or emission multinomial (Nk = t, with t being the number of symbols in the HMM). For PCFGs, each multinomial among the K multinomials corresponds to a set of Nk context-free rules headed by the same nonterminal. The parameter θk,i is then the probability of the ith rule for the kth nonterminal.

We assume that G denotes a fixed grammar, such as a context-free or regular grammar. We let denote the total number of derivation event types. We use D(G) to denote the set of all possible derivations of G. We define Dx(G) = {zD(G) |yield(z) = x}. We use deg(G) to denote the “degree” of G, i.e., deg(G) = max kNk. We let |x| denote the length of the string x, and denote the “length” (number of event tokens) of the derivation z.

Going back to the notation in Section 2, would be a collection of probabilistic grammars, parametrized by θ, and q would be a specific probabilistic grammar with a specific θ. We therefore treat the problem of ERM with probabilistic grammars as the problem of parameter estimation—identifying θ from complete data or incomplete data (strings x are visible but the derivations z are not). We can also view parameter estimation as the identification of a hypothesis from the concept space (where hθ is a distribution of the form of Equation [11]) or, equivalently, from negated log-concept space . For simplicity of notation, we assume that there is a fixed grammar G and use to refer to and to refer to .

3.1 Distributional Assumptions about Language

In this section, we describe a parametrization of assumptions we make about the distribution p(x,z), the distribution that generates derivations from D(G) (note that p does not have to be a probabilistic grammar). We first describe empirical evidence about the decay of the frequency of long strings x.

Figure 1 shows the frequency of sentence length for treebanks in various languages.4 The trend in the plots clearly shows that in the extended tail of the curve, all languages have an exponential decay of probabilities as a function of sentence length. To test this, we performed a simple regression of frequencies using an exponential curve. We estimated each curve for each language using a curve of the form f(l; c, α) = clα. This estimation was done by minimizing squared error between the frequency versus sentence length curve and the approximate version of this curve. The data points used for the approximation are (li, pi), where li denotes sentence length and pi denotes frequency, selected from the extended tail of the distribution. Extended tail here refers to all points with length longer than l1, where l1 is the length with the highest frequency in the treebank. The goal of focusing on the tail is to avoid approximating the head of the curve, which is actually a monotonically increasing function. We plotted the approximate curve together with a length versus frequency curve for new syntactic data. It can be seen (Figure 1) that the approximation is rather accurate in these corpora.
Figure 1

A plot of the tail of frequency vs. sentence length in treebanks for English, German, Bulgarian, Turkish, Spanish, and Chinese. Red lines denote data from the treebank, blue lines denote an approximation which uses an exponential function of the form f(l; c, α) = clα (the blue line uses data which is different from the data used to estimate the curve parameters, c and α). The parameters (c, α) are (0.19, 0.92) for English, (0.06, 0.94) for German, (0.26, 0.89) for Bulgarian, (0.26, 0.83) for Turkish, (0.11, 0.93) for Spanish, and (0.03, 0.97) for Chinese. Squared errors are 0.0005, 0.0003, 0.0007, 0.0003, 0.001, and 0.002 for English, German, Bulgarian, Turkish, Spanish, and Chinese, respectively.

Figure 1

A plot of the tail of frequency vs. sentence length in treebanks for English, German, Bulgarian, Turkish, Spanish, and Chinese. Red lines denote data from the treebank, blue lines denote an approximation which uses an exponential function of the form f(l; c, α) = clα (the blue line uses data which is different from the data used to estimate the curve parameters, c and α). The parameters (c, α) are (0.19, 0.92) for English, (0.06, 0.94) for German, (0.26, 0.89) for Bulgarian, (0.26, 0.83) for Turkish, (0.11, 0.93) for Spanish, and (0.03, 0.97) for Chinese. Squared errors are 0.0005, 0.0003, 0.0007, 0.0003, 0.001, and 0.002 for English, German, Bulgarian, Turkish, Spanish, and Chinese, respectively.

Close modal

As a consequence of this observation, we make a few assumptions about G and p(x,z):

  • Derivation length proportional to sentence length: There is an α ≥ 1 such that, for all z, |z| ≤ α|yield(z)|. Further, |z| ≥ |x|. (This prohibits unary cycles.)

  • Exponential decay of derivations: There is a constant r < 1 and a constant L ≥ 0 such that p(z) ≤ Lr|z|. Note that the assumption here is about the frequency of length of separate derivations, and not the aggregated frequency of all sentences of a certain length (cf. the discussion above referring to Figure 1).

  • Exponential decay of strings: Let Λ(k) = |{zD(G) ||z| = k}| be the number derivations of length k in G. We assume that Λ(k) is an increasing function, and complete it such that it is defined over positive numbers by taking . Taking r as before, we assume there exists a constant q < 1, such that Λ2(k) rkqk (and as a consequence, Λ(k) rkqk). This implies that the number of derivations of length k may be exponentially large (e.g., as with many PCFGs), but is bounded by (q/r)k.

  • Bounded expectations of rules: There is a B < ∞ such that for all k and i.

These assumptions must hold for any p whose support consists of a finite set. These assumptions also hold in many cases when p itself is a probabilistic grammar. Also, we note that the last requirement of bounded expectations is optional, and it can be inferred from the rest of the requirements: B = L/(1 − q)2. We make this requirement explicit for simplicity of notation later. We denote the family of distributions that satisfy all of these requirements by .

There are other cases in the literature of language learning where additional assumptions are made on the learned family of models in order to obtain positive learnability results. For example, Clark and Thollard (2004) put a bound on the expected length of strings generated from any state of probabilistic finite state automata, which resembles the exponential decay of strings we have for p in this article.

An immediate consequence of these assumptions is that the entropy of p is finite and bounded by a quantity that depends on L, r and q.5 Bounding entropy of labels (derivations) given inputs (sentences) is a common way to quantify the noise in a distribution. Here, both the sentential entropy (Hs(p) = − ∑xp(x) log p(x)) is bounded as well as the derivational entropy (Hd(p) = − ∑x,zp(x,z) log p(x,z)). This is stated in the following result.

Proposition 1

Let be a distribution. Then, we have

Proof

First note that Hs(p) ≤ Hd(p) holds by the data processing inequality (Cover and Thomas 1991) because the sentential probability distribution p(x) is a coarser version of the derivational probability distribution p(x,z). Now, consider p(x,z). For simplicity of notation, we use p(z) instead of p(x,z). The yield of z, x, is a function of z, and therefore can be omitted from the distribution. It holds that
where Z1 = {z | p(z) > 1/e} and Z2 = {z | p(z) ≤ 1/e}. Note that the function −α log α reaches its maximum for α = 1/e. We therefore have
We give a bound on |Z1|, the number of “high probability” derivations. Because we have p(x,z) ≤ Lr|z|, we can find the maximum length of a derivation that has a probability of more than 1/e (and hence, it may appear in Z1) by solving 1/eLr|z| for |z|, which leads to |z| ≤ log(1/eL)/logr. Therefore, there are at most derivations in |Z1| and therefore we have
where we use the monotonicity of Λ. Consider Hd(p,Z2) (the “low probability” derivations). We have:
where Equation (13) holds from the assumptions about p. Putting Equation (12) and Equation (14) together, we obtain the result.▪

We note that another common way to quantify the noise in a distribution is through the notion of Tsybakov noise (Tsybakov 2004; Koltchinskii 2006). We discuss this further in Section 7.1, where we show that Tsybakov noise is too permissive, and probabilistic grammars do not satisfy its conditions.

3.1 Limiting the Degree of the Grammar

When approximating a family of probabilistic grammars, it is much more convenient when the degree of the grammar is limited. In this article, we limit the degree of the grammar by making the assumption that all Nk ≤ 2. This assumption may seem, at first glance, somewhat restrictive, but we show next that for PCFGs (and as a consequence, other formalisms), this assumption does not limit the total generative capacity that we can have across all context-free grammars.

We first show that any context-free grammar with arbitrary degree can be mapped to a corresponding grammar with all Nk ≤ 2 that generates derivations equivalent to derivations in the original grammar. Such a grammar is also called a “covering grammar” (Nijholt 1980; Leermakers 1989). Let G be a CFG. Let A be the kth nonterminal. Consider the rules Aαi for iNk where A appears on the left side. For each rule Aαi, i < Nk, we create a new nonterminal in G′ such that Ai has two rewrite rules: Aiαi and AiAi+1. In addition, we create rules AA1 and . Figure 2 demonstrates an example of this transformation on a small context-free grammar.
Figure 2

Example of a context-free grammar and its equivalent binarized form.

Figure 2

Example of a context-free grammar and its equivalent binarized form.

Close modal

It is easy to verify that the resulting grammar G′ has an equivalent capacity to the original CFG, G. A simple transformation that converts each derivation in the new grammar to a derivation in the old grammar would involve collapsing any path of nonterminals added to G′ (i.e., all Ai for nonterminal A) so that we end up with nonterminals from the original grammar only. Similarly, any derivation in G can be converted to a derivation in G′ by adding new nonterminals through unary application of rules of the form AiAi+1. Given a derivation z in G, we denote by the corresponding derivation in G′ after adding the new non-terminals Ai to z. Throughout this article, we will refer to the normalized form of G′ as a “binary normal form.”6

Note that K′, the number of multinomials in the binary normal form, is a function of both the number of nonterminals in the original grammar and the number of rules in that grammar. More specifically, we have that . To make the equivalence complete, we need to show that any probabilistic context-free grammar can be translated to a PCFG with maxkNk ≤ 2 such that the two PCFGs induce the same equivalent distributions over derivations.

Lemma 1

Let ai ∈ [0,1], i ∈ {1, … , N} such that ∑iai = 1. Define b1 = a1, c1 = 1 − a1, bi = , and ci = 1 − bi for i ≥ 2. Then .

See Appendix A for the proof of Utility Lemma 1.

Theorem 1

Let 〈G, θ〉 be a probabilistic context-free grammar. Let G′ be the binarizing transformation of G as defined earlier. Then, there exists θ′ for G′ such that for any zD(G) we have .

Proof

For the grammar G, index the set {1, …,K} with nonterminals ranging from A1 to AK. Define G′ as before. We need to define θ′. Index the multinomials in G′ by (k,i), each having two events. Let μ(k,i),1 = θk,i, μ(k,i),2 = 1 − θk,i for i = 1 and set μk,i,1 = θk,i/μ(k,i − 1),2, and μ(k,i − 1),2 = 1 − μ(k,i − 1),2.

G′, μ〉 is a weighted context-free grammar such that the μ(k,i),1 corresponds to the ith event in the k multinomial of the original grammar. Let z be a derivation in G and . Then, from Utility Lemma 1 and the construction of g′, we have that:

From Chi (1999), we know that the weighted grammar 〈G′, μ〉 can be converted to a probabilistic context-free grammar 〈G′, θ′〉, through a construction of θ′ based on μ, such that p(z′ | μ, G′) = p(z′ | θ′, G′).▪

The proof for Theorem 1 gives a construction the parameters θ′ of G′ such that 〈G, θ〉 is equivalent to 〈G′, θ′〉. The construction of θ′ can also be reversed: Given θ′ for G′, we can construct θ for G so that again we have equivalence between 〈G, θ〉 and 〈G′, θ′〉.

In this section, we focused on presenting parametrized, empirically justified distributional assumptions about language data that will make the analysis in later sections more manageable. We showed that these assumptions bound the amount of entropy as a function of the assumption parameters. We also made an assumption about the structure of the grammar family, and showed that it entails no loss of generality for CFGs. Many other formalisms can follow similar arguments to show that the structural assumption is justified for them as well.

In order to follow the empirical risk minimization described in Section 2.1, we have to define a series of approximations for , which we denote by the log-concept spaces . We also have to replace two-sided uniform convergence (Equation [16]) with convergence on the sequence of concept spaces we defined (Equation [10]). The concept spaces in the sequence vary as a function of the number of samples we have. We next construct the sequence of concept spaces, and in Section 5 we return to the learning model. Our approximations are based on the concept of bounded approximations (Abe, Takeuchi, and Warmuth 1991; Dasgupta 1997), which were originally designed for graphical models.7 A bounded approximation is a subset of a concept space which is controlled by a parameter that determines its tightness. Here we use this idea to define a series of subsets of the original concept space as approximations, while having two asymptotic properties that control the series' tightness.

Let (for m ∈ {1, 2, …}) be a sequence of concept spaces. We consider three properties of elements of this sequence, which should hold for m > M for a fixed M.

The first is containment in :
The second property is boundedness:
where εbound is a non-increasing function such that . This states that the expected values of functions from on values larger than some Km is small. This is required to obtain uniform convergence results in the revised empirical risk minimization model from Section 2.1. Note that Km can grow arbitrarily large.
The third property is tightness:
where εtail is a non-increasing function such that , and Cm denotes an operator that maps functions in to . This ensures that our approximation actually converges to the original concept space . We will show in Section 4.3 that this is actually a well-motivated characterization of convergence for probabilistic grammars in the supervised setting.

We say that the sequence properly approximates if there exist εtail(m), εbound(m), and Cm such that, for all m larger than some M, containment, boundedness, and tightness all hold.

In a good approximation, Km would increase at a fast rate as a function of m and εtail(m) and εbound(m) decrease quickly as a function of m. As we will see in Section 5, we cannot have an arbitrarily fast convergence rate (by, for example, taking a subsequence of ), because the size of Km has a great effect on the number of samples required to obtain accurate estimation.

4.1 Constructing Proper Approximations for Probabilistic Grammars

We now focus on constructing proper approximations for probabilistic grammars whose degree is limited to 2. Proper approximations could, in principle, be used with losses other than the log-loss, though their main use is for unbounded losses. Starting from this point in the article, we focus on using such proper approximations with the log-loss.

We construct . For each we define a transformation T(f, γ) that shifts every binomial parameter θk = 〈θk,1, θk,2〉 in the probabilistic grammar by at most γ:
Note that for any γ ≤ 1/2. Fix a constant s > 1.8 We denote by T(θ,γ) the same transformation on θ (which outputs the new shifted parameters) and we denote by ΘG(γ) = Θ(γ) the set {T(θ,γ) |θ ∈ ΘG}. For each m ∈ ℕ, define .

When considering our approach to approximate a probabilistic grammar by increasing its parameter probabilities to be over a certain threshold, it becomes clear why we are required to limit the grammar to have only two rules and why we are required to use the normal from Section 3.2 with grammars of degree 2. Consider the PCFG rules in Table 1. There are different ways to move probability mass to the rule with small probability. This leads to a problem with identifability of the approximation: How does one decide how to reallocate probability to the small probability rules? By binarizing the grammar in advance, we arrive at a single way to reallocate mass when required (i.e., move mass from the high-probability rule to the low-probability rule). This leads to a simpler proof for sample complexity bounds and a single bound (rather than different bounds depending on different smoothing operators). We note, however, that the choices made in binarizing the grammar imply a particular way of smoothing the probability across the original rules.

Table 1

Example of a PCFG where there is more than a single way to approximate it by truncation with γ = 0.1, because it has more than two rules. Any value of η ∈ [0, γ] will lead to a different approximation.

Rule
θ
General
η = 0
η = 0.01
η = 0.005
S → NP VP 0.09 0.01 0.1 0.1 0.1 
S → NP 0.11 0.11 − η 0.11 0.1 0.105 
S → VP 0.8 0.8 − γ + η 0.79 0.8 0.795 
Rule
θ
General
η = 0
η = 0.01
η = 0.005
S → NP VP 0.09 0.01 0.1 0.1 0.1 
S → NP 0.11 0.11 − η 0.11 0.1 0.105 
S → VP 0.8 0.8 − γ + η 0.79 0.8 0.795 

We now describe how this construction of approximations satisfies the properties mentioned in Section 4, specifically, the boundedness property and the tightness property.

Proposition 2

Let and let be as defined earlier. There exists a constant β = β(L,q,p,N) > 0 such that has the boundedness property with Km = sN log3m and .

See Appendix A for the proof of Proposition 2.

Next, is tight with respect to with .

Proposition 3

Let and let as defined earlier. There exists an M such that for any m > M we have
for and Cm(f) = T(f, ms).

See Appendix A for the proof of Proposition 3.

We now have proper approximations for probabilistic grammars. These approximations are defined as a series of probabilistic grammars, related to the family of probabilistic grammars we are interested in estimating. They consist of three properties: containment (they are a subset of the family of probabilistic grammars we are interested in estimating), boundedness (their log-loss does not diverge to infinity quickly), and they are tight (there is a small probability mass at which they are not tight approximations).

4.2 Coupling Bounded Approximations with Number of Samples

At this point, the number of samples n is decoupled from the bounded approximation () that we choose for grammar estimation. To couple between these two, we need to define m as a function of the number of samples, m(n). As mentioned earlier, there is a clear trade-off between choosing a fast rate for m(n) (such as m(n) = nk for some k > 1) and a slower rate (such as m(n) = logn). The faster the rate is, the tighter the family of approximations that we use for n samples. If the rate is too fast, however, then Km grows quickly as well. In that case, because our sample complexity bounds are increasing functions of such Km, the bounds will degrade.

To balance the trade-off, we choose m(n) = n. As we see later, this gives sample complexity bounds which are asymptotically interesting for both the supervised and unsupervised case.

4.3 Asymptotic Empirical Risk Minimization

It would be compelling to determine whether the empirical risk minimizer over is an asymptotic empirical risk minimizer. This would mean that the risk of the empirical risk minimizer over converges to the risk of the maximum likelihood estimate. As a conclusion to this section about proper approximations, we motivate the three requirements that we posed on proper approximations by showing that this is indeed true. We now unify n, the number of samples, and m, the index of the approximation of the concept space . Let be the minimizer of the empirical risk over , () and let gn be the minimizer of the empirical risk over ().

Let D = {z1,…,zn} be a sample from p(z). The operator is an asymptotic empirical risk minimizer if as n → ∞ (Shalev-Shwartz et al. 2009). Then, we have the following

Lemma 1

Denote by the set . Denote by Aε,n the event “one of ziD is in .” If properly approximates , then:
where the expectations are taken with respect to the data set D.

See Appendix A for the proof of Lemma 1.

Proposition 4

Let D = {z1,…,zn} be a sample of derivations from G. Then is an asymptotic empirical risk minimizer.

Proof

Let be the concept that puts uniform weights over θ, namely, for all k. Note that
Let Aj,ε,n for j ∈ {1,…,n} be the event “”. Then Aε,n = ∪ jAj,ε,n. We have that
where Equation (16) comes from zl being independent. Also, B is the constant from Section 3.1. Therefore, we have:
From the construction of our proper approximations (Proposition 3), we know that only derivations of length log2n or greater can be in . Therefore
where κ > 0 is a constant. Similarly, we have . This means that . In addition, it can be shown that using the same proof technique we used here, while relying on the fact that , and therefore .▪

Equipped with the framework of proper approximations as described previously, we now give our main sample complexity results for probabilistic grammars. These results hinge on the convergence of . Indeed, proper approximations replace the use of in these convergence results. The rate of this convergence can be fast, if the covering numbers for do not grow too fast.

5.1 Covering Numbers and Bounds on Covering Numbers

We next give a brief overview of covering numbers. A cover provides a way to reduce a class of functions to a much smaller (finite, in fact) representative class such that each function in the original class is represented using a function in the smaller class. Let be a class of functions. Let d(f,g) be a distance measure between two functions f,g from . An ε-cover is a subset of , denoted by , such that for every there exists an such that d(f,f′) < ε. The covering number is the size of the smallest ε-cover of for the distance measure d.

We are interested in a specific distance measure which is dependent on the empirical distribution that describes the data z1,…,zn. Let . We will use
Instead of using directly, we bound this quantity with , where we consider all possible samples (yielding ). The following is the key result regarding the connection between covering numbers and the double-sided convergence of the empirical process as n→ ∞. This result is a general-purpose result that has been used frequently to prove the convergence of empirical processes of the type we discuss in this article.

Lemma 2

Let be a permissible class9 of functions such that for every we have . Let , namely, the set of functions from after being truncated by Kn. Then for ε > 0 we have
provided and εbound(n) < ε.

See Pollard (1984; Chapter 2, pages 30–31) for the proof of Lemma 2. See also Appendix A.

Covering numbers are rather complex combinatorial quantities which are hard to compute directly. Fortunately, they can be bounded using the pseudo-dimension (Anthony and Bartlett 1999), a generalization of the Vapnik-Chervonenkis (VC) dimension for real functions. In the case of our “binomialized” probabilistic grammars, the pseudo-dimension of is bounded by N, because we have , and the functions in are linear with N parameters. Hence, also has pseudo-dimension that is at most N. We then have the following.

Lemma 3

(From Pollard [1984] and Haussler [1992].) Let be the proper approximations for probabilistic grammars, for any 0 < ε < Kn we have:

5.2 Supervised Case

We turn to give an analysis for the supervised case. This analysis is mostly described as a preparation for the unsupervised case. In general, the families of probabilistic grammars we give a treatment to are parametric families, and the maximum likelihood estimator for these families is a consistent estimator in the supervised case. In the unsupervised case, however, lack of identifiability prevents us from getting these traditional consistency results. Also, the traditional results about the consistency of MLE are based on the assumption that the sample is generated from the parametric family we are trying to estimate. This is not the case in our analysis, where the distribution that generates the data does not have to be a probabilistic grammar.

Lemmas 2 and 3 can be combined to get the following sample complexity result.

Theorem 2

Let G be a grammar. Let (Section 3.1). Let be a proper approximation for the corresponding family of probabilistic grammars. Let z1,…,zn be a sample of derivations. Then there exists a constant β(L,q,p,N) and constant M such that for any 0 < δ < 1 and 0 < ε < Kn and any n > M and if
then we have
where Kn = sN log3n.

Proof Sketch

β(L,q,p,N) is the constant from Proposition 2. The main idea in the proof is to solve for n in the following two inequalities (based on Equation [17] [see the following]) while relying on Lemma 3:

Theorem 2 gives little intuition about the number of samples required for accurate estimation of a grammar because it considers the “additive” setting: The empirical risk is within ε from the expected risk. More specifically, it is not clear how we should pick ε for the log-loss, because the log-loss can obtain arbitrary values.

We turn now to converting the additive bound in Theorem 2 to a multiplicative bound. Multiplicative bounds can be more informative than additive bounds when the range of the values that the log-loss can obtain is not known a priori. It is important to note that the two views are equivalent (i.e., it is possible to convert a multiplicative bound to an additive bound and vice versa). Let ρ ∈ (0,1) and choose ε = ρKn. Then, substituting this ε in Theorem 2, we get that if
then, with probability 1 − δ,

where H(p) is the Shannon entropy of p. This stems from the fact that for any f. This means that if we are interested in computing a sample complexity bound such that the ratio between the empirical risk and the expected risk (for log-loss) is close to 1 with high probability, we need to pick up ρ such that the righthand side of Equation (17) is smaller than the desired accuracy level (between 0 and 1). Note that Equation (17) is an oracle inequality—it requires knowing the entropy of p or some upper bound on it.

5.3 Unsupervised Case

In the unsupervised setting, we have nyields of derivations from the grammar, x1,…,xn, and our goal again is to identify grammar parameters θ from these yields. Our concept classes are now the sets of log marginalized distributions from . For each , we define as
We denote the set of by . Analogously, we define . Note that we also need to define the operator as a first step towards defining as proper approximations (for ) in the unsupervised setting. Let . Let f be the concept in such that . Then we define .

It does not immediately follow that is a proper approximation for . It is not hard to show that the boundedness property is satisfied with the same Kn and the same form of εbound(n) as in Proposition 2 (we would have for some β′(L,q,p,N) = β′ > 0). This relies on the property of bounded derivation length of p (see Appendix A, Proposition 7). The following result shows that we have tightness as well.

Utility Lemma 2

For ai,bi ≥ 0, if − log ∑iai + log ∑ibi ≥ ε then there exists an i such that − logai + logbi ≥ ε.

Proposition 5

There exists an M such that for any n > M we have
for and the operator as defined earlier.

Proof Sketch

From Utility Lemma 2 we have
Define to be all x such that there exists a z with yield(z) = x and |z| ≥ log2n. From the proof of Proposition 3 and the requirements on p, we know that there exists an α ≥ 1 such that
where the last inequality happens for some n larger than a fixed M.▪

Computing either the covering number or the pseudo-dimension of is a hard task, because the function in the classes includes the “log-sum-exp.” Dasgupta (1997) overcomes this problem for Bayesian networks with fixed structure by giving a bound on the covering number for (his respective) which depends on the covering number of .

Unfortunately, we cannot fully adopt this approach, because the derivations of a probabilistic grammar can be arbitrarily large. Instead, we present the following proposition, which is based on the “Hidden Variable Rule” from Dasgupta (1997). This proposition shows that the covering number of (or more accurately, its bounded approximations) can be bounded in terms of the covering number of the bounded approximations of , and the constants which control the underlying distribution p mentioned in Section 3.

Utility Lemma 3

For any two positive-valued sequences (a1,…,an) and (b1,…,bn) we have that .

Proposition 6 (Hidden Variable Rule for Probabilistic Grammars)

Let . Then, .

Proof

Let be the subset of derivations of length shorter than m. Consider . Let f′ and be the corresponding functions in . Then, for any distribution p,
where p′(x,z) is a probability distribution that uniformly divides the probability mass p(x) across all derivations for the specific x, that is:
The inequality in Equation (18) stems from Utility Lemma 3.

Set m to be the quantity that appears in the proposition to get the necessary result (f′ and f are arbitrary functions in and respectively. Then consider and f0 to be functions from the respective covers.).▪

For the unsupervised case, then, we get the following sample complexity result.

Theorem 3

Let G be a grammar. Let be a proper approximation for the corresponding family of probabilistic grammars. Let p(x,z) be a distribution over derivations which satisfies the requirements in Section 3.1. Let x1,…,xn be a sample of strings from p(x). Then there exists a constant β′(L,q,p,N) and constant M such that for any 0 < δ < 1, 0 < ε < Kn, any n > M, and if
where , we have that
where Kn = sN log3n.

Theorem 3 states that the number of samples we require in order to accurately estimate a probabilistic grammar from unparsed strings depends on the level of ambiguity in the grammar, represented as Λ(m). We note that this dependence is polynomial, and we consider this a positive result for unsupervised learning of grammars. More specifically, if Λ is an exponential function (such as the case with PCFGs), when compared to the supervised learning, there is an extra multiplicative factor in the sample complexity in the unsupervised setting that behaves like .

We note that the following Equation (20) can again be reduced to a multiplicative case, similarly to the way we described it for the supervised case. Setting ε = ρKn (ρ ∈ (0,1)), we get the following requirement on n:
where .

We turn now to describing algorithms and their properties for minimizing empirical risk using the framework described in Section 4.

6.1 Supervised Case

ERM with proper approximations leads to simple algorithms for estimating the probabilities of a probabilistic grammar in the supervised setting. Given an ε > 0 and a δ > 0, we draw n examples according to Theorem 2. We then set γ = ns. To minimize the log-loss with respect to these n examples, we use the proper approximation .

Note that the value of the empirical log-loss for a probabilistic grammar parametrized by θ is
Because we make the assumption that deg(G) ≤ 2 (Section 3.2), we have
To minimize the log-loss with respect to , we need to minimize Equation (21) under the constraint that γ ≤ θk,i ≤ 1 − γ and θk1 + θk,2 = 1. It can be shown that the solution for this optimization problem is
where is the number of times that ψk,i fires in Example j. (We include a full derivation of this result in Appendix B.) The interpretation of Equation (22) is simple: We count the number of times a rule appears in the samples and then normalize this value by the total number of times rules associated with the same multinomial appear in the samples. This frequency count is the maximum likelihood solution with respect to the full hypothesis class (Corazza and Satta 2006; see Appendix B). Because we constrain ourselves to obtain a value away from 0 or 1 by a margin of γ, we need to truncate this solution, as done in Equation (22).

This truncation to a margin γ can be thought of as a smoothing factor that enables us to compute sample complexity bounds. We explore this connection to smoothing with a Dirichlet prior in a Maximum a posteriori (MAP) Bayesian setting in Section 7.2.

6.2 Unsupervised Case

Similarly to the supervised case, minimizing the empirical log-loss in the unsupervised setting requires minimizing (with respect to θ) the following:
with the constraint that γ ≤ θk,i ≤ 1 − γ (i.e., θ ∈ Θ(γ)) where γ = ns. This is done after drawing n examples according to Theorem 3.

6.2.1 Hardness of ERM with Proper Approximations

It turns out that minimizing Equation (23) under the specified constraints is actually an NP-hard problem when G is a PCFG. This result follows using a similar proof to the one in Cohen and Smith (2010c) for the hardness of Viterbi training and maximizing log-likelihood for PCFGs. We turn to giving the full derivation of this hardness result for PCFGs and the modification required for adapting the results from Cohen and Smith to the case of having an arbitrary γ margin constraint.

In order to show an NP-hardness result, we need to “convert” the problem of the maximization of Equation (23) to a decision problem. We do so by stating the following decision problem.

Problem 1 (Unsupervised Minimization of the Log-Loss with Margin)

Input: A binarized context-free grammar G, a set of sentences x1, … , xn, a value γ ∈ , and a value α ∈ [0, 1].

Output: 1 if there exists θ ∈ Θ(γ) (and hence, ) such that
and 0 otherwise.

We will show the hardness result both when γ is not restricted at all as well as when we allow γ > 0. The proof of the hardness result is achieved by reducing the problem 3-SAT (Sipser 2006), known to be NP-complete, to Problem 1. The problem 3-SAT is defined as follows:

Problem 2 (3-SAT)

Input: A formula in conjunctive normal form, such that each clause has three literals.

Output: 1 if there is a satisfying assignment for φ, and 0 otherwise.

Given an instance of the 3-SAT problem, the reduction will, in polynomial time, create a grammar and a single string such that solving Problem 1 for this grammar and string will yield a solution for the instance of the 3-SAT problem.

Let be an instance of the 3-SAT problem, where ai, bi, and ci are literals over the set of variables {Y1,…,YN} (a literal refers to a variable Yj or its negation, ). Let Cj be the jth clause in φ, such that Cj = ajbjcj. We define the following CFG Gφ and string to parse sφ:

  • 1. 

    The terminals of Gφ are the binary digits Σ = {0,1}.

  • 2. 

    We create N nonterminals , r ∈ {1,…,N} and rules and .

  • 3. 

    We create N nonterminals , r ∈ {1,…,N} and rules and .

  • 4. 

    We create and .

  • 5. 

    We create the rule S1A1. For each j ∈ {2,…,m}, we create a rule SjSj − 1Aj where Sj is a new nonterminal indexed by and Aj is also a new nonterminal indexed by j ∈ {1,…,m}.

  • 6. 
    Let Cj = ajbjcj be clause j in φ. Let Y(aj) be the variable that aj mentions. Let (y1,y2,y3) be a satisfying assignment for Cj where yk ∈ { 0,1 } and is the value of Y(aj), Y(bj), and Y(cj), respectively, for k ∈ {1,2,3}. For each such clause-satisfying assignment, we add the rule
    For each Aj, we would have at most seven rules of this form, because one rule will be logically inconsistent with ajbjcj.
  • 7. 

    The grammar's start symbol is Sn.

  • 8. 

    The string to parse is sφ = (10)3m, that is, 3m consecutive occurrences of the string 10.

A parse of the string sφ using Gφ will be used to get an assignment by setting Yr = 0 if the rule or is used in the derivation of the parse tree, and 1 otherwise. Notice that at this point we do not exclude “contradictions” that come from the parse tree, such as used in the tree together with or . To maintain the restriction on the degree of grammars, we convert Gφ to the binary normal form described in Section 3.2. The following lemma gives a condition under which the assignment is consistent (so that contradictions do not occur in the parse tree).

Lemma 4

Let φ be an instance of the 3-SAT problem, and let Gφ be a probabilistic CFG based on the given grammar with weights θφ. If the (multiplicative) weight of the Viterbi parse (i.e., the highest scoring parse according to the PCFG) of sφ is 1, then the assignment extracted from the parse tree is consistent.

Proof

Because the probability of the Viterbi parse is 1, all rules of the form which appear in the parse tree have probability 1 as well. There are two possible types of inconsistencies. We show that neither exists in the Viterbi parse:

  • 1. 

    For any r, an appearance of both rules of the form and cannot occur because all rules that appear in the Viterbi parse tree have probability 1.

  • 2. 

    For any r, an appearance of rules of the form and cannot occur, because whenever we have an appearance of the rule , we have an adjacent appearance of the rule (because we parse substrings of the form 10), and then we again use the fact that all rules in the parse tree have probability 1. The case of and is handled analogously.

Thus, both possible inconsistencies are ruled out, resulting in a consistent assignment.▪

Figure 3 gives an example of an application of the reduction.
Figure 3

An example of a Viterbi parse tree which represents a satisfying assignment for . In θφ, all rules appearing in the parse tree have probability 1. The extracted assignment would be Y1 = 0, Y2 = 1, Y3 = 1, Y4 = 0. Note that there is no usage of two different rules for a single nonterminal.

Figure 3

An example of a Viterbi parse tree which represents a satisfying assignment for . In θφ, all rules appearing in the parse tree have probability 1. The extracted assignment would be Y1 = 0, Y2 = 1, Y3 = 1, Y4 = 0. Note that there is no usage of two different rules for a single nonterminal.

Close modal

Lemma 5

Define φ and Gφ as before. There exists θφ such that the Viterbi parse of sφ is 1 if and only if φ is satisfiable. Moreover, the satisfying assignment is the one extracted from the parse tree with weight 1 of sφ under θφ.

Proof

Assume that there is a satisfying assignment. Each clause Cj = ajbjcj is satisfied using a tuple (y1,y2,y3), which assigns values for Y(aj), Y(bj), and Y(cj). This assignment corresponds to the following rule:
Set its probability to 1, and set all other rules of Aj to 0. In addition, for each r, if Yr = y, set the probabilities of the rules and to 1 and and to 0. The rest of the weights for SjSj − 1Aj are set to 1. This assignment of rule probabilities results in a Viterbi parse of weight 1.
Assume that the Viterbi parse has probability 1. From Lemma 4, we know that we can extract a consistent assignment from the Viterbi parse. In addition, for each clause Cj we have a rule
that is assigned probability 1, for some (y1,y2,y3). One can verify that (y1,y2,y3) are the values of the assignment for the corresponding variables in clause Cj, and that they satisfy this clause. This means that each clause is satisfied by the assignment we extracted.▪

We are now ready to prove the following result.

Theorem 4

Problem 1 is NP-hard when either requiring γ > 0 or when fixing γ = 0.

Proof

We first describe the reduction for the case of γ = 0. In Problem 1, set γ = 0, α = 1, G = Gφ, γ = 0, and x1 = sφ. If φ is satisfiable, then the left side of Equation (24) can get value 0, by setting the rule probabilities according to Lemma 5, hence we would return 1 as the result of running Problem 1.

If φ is unsatisfiable, then we would still get value 0 only if L(G) = {sφ}. If Gφ generates a single derivation for (10)3m, then we actually do have a satisfying assignment from Lemma 4. Otherwise (more than a single derivation), the optimal θ (or ). In that case, it is no longer true that (10)3m is the only generated sentence, and this is a contradiction to getting value 0 for Problem 1.

We next show that Problem 1 is NP-hard even if we require γ > 0. Let . Set α = γ, and the rest of the inputs to Problem 1 the same as before. Assume that φ is satisfiable. Let θ be the rule probabilities from Equation (5) after being shifted with a margin of γ. Then, because there is a derivation that uses only rules that have probability 1 − γ, we have
because the size of the parse tree for (10)3m is at most 10m (using the binarized Gφ) and assuming α = γ < (1 − γ)10m. This inequality indeed holds whenever . Therefore, we have − log h(x1 |θ) > − log α. Problem 1 would return 0 in this case.
Now, assume that φ is not satisfiable. That means that any parse tree for the string (10)3m would have to contain two different rules headed by the same non-terminal. This means that
and therefore − log h(x1 |T(θ,γ)) ≤ − logα, and Problem 1 would return 1.▪

6.2.2 An Expectation-Maximization Algorithm

Instead of solving the optimization problem implied by Equation (21), we propose a rather simple modification to the expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin 1977) to approximate the optimal solution—this algorithm finds a local maximum for the maximum likelihood problem using proper approximations. The modified algorithm is given in Algorithm 1.

The modification from the usual expectation-maximization algorithm is done in the M-step: Instead of using the expected value of the sufficient statistics by counting and normalizing, we truncate the values by γ. It can be shown that if θ(0) ∈ Θ(γ), then the likelihood is guaranteed to increase (and hence, the log-loss is guaranteed to decrease) after each iteration of the algorithm.

graphic

The reason for this likelihood increase stems from the fact that the M-step solves the optimization problem of minimizing the log-loss (with respect to θ ∈ Θ(γ)) when the posterior calculate at the E-step as the base distribution is used. This means that the M-step minimizes (in iteration t): where the expectation is taken with respect to the distribution . With this notion in mind, the likelihood increase after each iteration follows from principles similar to those described in Bishop (2006) for the EM algorithm.

Our framework can be specialized to improve the two main criteria which have a trade-off: the tightness of the proper approximation and the sample complexity. For example, we can improve the tightness of our proper approximations by taking a subsequence of . This will make the sample complexity bound degrade, however, because Kn will grow faster. Table 2 shows the trade-offs between parameters in our model and the effectiveness of learning.

Table 2

Trade-off between quantities in our learning model and effectiveness of different criteria. Kn is the constant that satisfies the boundedness property (Theorems 2 and 3) and s is a fixed constant larger than 1 (Section 4.1).

criterion
as Kn increases …
as s increases …
tightness of proper approximation improves improves 
sample complexity bound degrades degrades 
criterion
as Kn increases …
as s increases …
tightness of proper approximation improves improves 
sample complexity bound degrades degrades 

We note that the sample complexity bounds that we give in this article give insight about the asymptotic behavior of grammar estimation, but are not necessarily sufficiently tight to be used in practice. It still remains an open problem to obtain sample complexity bounds which are sufficiently tight in this respect. For a discussion about the connection of grammar learning in theory and practice, we refer the reader to Clark and Lappin (2010).

It is also important to note that MLE is not the only option for estimating finite state probabilistic grammars. There has been some recent advances in learning finite state models (HMMs and finite state transducers) by using spectral analysis of matrices which consist of quantities estimated from observations only (Hsu, Kakade, and Zhang 2009; Balle, Quattoni, and Carreras 2011), based on the observable operator models of Jaeger (1999). These algorithms are not prone to local minima, and converge to the correct model as the number of samples increases, but require some assumptions about the underlying model that generates the data.

7.1 Tsybakov Noise

In this article, we chose to introduce assumptions about distributions that generate natural language data. The choice of these assumptions was motivated by observations about properties shared among treebanks. The main consequence of making these assumptions is bounding the amount of noise in the distribution (i.e., the amount of variation in probabilities across labels given a fixed input).

There are other ways to restrict the noise in a distribution. One condition for such noise restriction, which has received considerable recent attention in the statistical literature, is the Tsybakov noise condition (Tsybakov 2004; Koltchinskii 2006). Showing that a distribution satisfies the Tsybakov noise condition enables the use of techniques (e.g., from Koltchinskii 2006) for deriving distribution-dependent sample complexity bounds that depend on the parameters of the noise. It is therefore of interest to see whether Tsybakov noise holds under the assumptions presented in Section 3.1. We show that this is not the case, and that Tsybakov noise is too permissive. In fact, we show that p can be a probabilistic grammar itself (and hence, satisfy the assumptions in Section 3.1), and still not satisfy the Tsybakov noise conditions.

Tsybakov noise was originally introduced for classification problems (Tsybakov 2004), and was later extended to more general settings, such as the one we are facing in this article (Koltchinskii 2006). We now explain the definition of Tsybakov noise in our context.

Let C > 0 and κ ≥ 1. We say that a distribution p(x,z) satisfies the (C,κ) Tsybakov noise condition if for any ε > 0 and such that , we have
This interpretation of Tsybakov noise implies that the diameter of the set of functions from the concept class that has small excess risk should shrink to 0 at the rate in Equation (25). Distribution-dependent bounds from Koltchinskii (2006) are monotone with respect to the diameter of this set of functions, and therefore demonstrating that it goes to 0 enables sharper derivations of sample complexity bounds.
We turn now to illustrating that the Tsybakov condition does not hold for probabilistic grammars in most cases. Let G be a probabilistic grammar. Define A = AG(θ) as a matrix such that

Theorem 5

Let G be a grammar with K ≥ 2 and degree 2. Assume that p is 〈G, θ*〉 for some θ*, such that and that c1c2. If AG(θ*) is positive definite, then p does not satisfy the Tsybakov noise condition for any (C,κ), where C > 0 and κ ≥ 1.

See Appendix C for the proof of Theorem 5.

In Appendix C we show that AG(θ) is positive semi-definite for any choice of θ. The main intuition behind the proof is that given a probabilistic grammar p, we can construct an hypothesis h such that the KL divergence between p and h is small, but dist(p,h) is lower-bounded and is not close to 0.

We conclude that probabilistic grammars, as generative distributions of data, do not generally satisfy the Tsybakov noise condition. This motivates an alternative choice of assumptions that could lead to better understanding of rates of convergences and bounds on the excess risk. Section 3.1 states such assumptions which were also justified empirically.

7.2 Comparison to Dirichlet Maximum A Posteriori Solutions

The transformation T(θ,γ) from Section 4.1 can be thought of as a smoother for the probabilities θ: It ensures that the probability of each rule is at least γ (and as a result, the probabilities of all rules cannot exceed 1 − γ). Adding pseudo-counts to frequency counts is also a common way to smooth probabilities in models based on multinomial distributions, including probabilistic grammars (Manning and Schütze 1999). These pseudo-counts can be framed as a maximum a posteriori (MAP) alternative to the maximum likelihood problem, with the choice of Bayesian prior over the parameters in the form of a Dirichlet distribution. In comparison to our framework, with (symmetric) Dirichlet smoothing, instead of truncating the probabilities with a margin γ we would set the probability of each rule (in the supervised setting) to
for i = 1,2, where are the counts in the data of event i in multinomial k for Example j. Dirichlet smoothing can be formulated as the result of adding a symmetric Dirichlet prior over the parameters θk,i with hyperparameter α. Then Equation (26) is the mode of the posterior after observing appearances of event i in multinomial k.

The effect of Dirichlet smoothing becomes weaker as we have more samples, because the frequency counts become dominant in both the numerator and the denominator when there are more data. In this sense, the prior's effect on learning diminishes as we use more data. A similar effect occurs in our framework: γ = ns where n is the number of samples—the more samples we have, the more we trust the counts in the data to be reliable. There is a subtle difference, however. With the Dirichlet MAP solution, the smoothing is less dominant only if the counts of the features are large, regardless of the number of samples we have. With our framework, smoothing depends only on the number of samples we have. These two scenarios are related, of course: The more samples we have, the more likely it is that the counts of the events will grow large.

7.3 Other Derivations of Sample Complexity Bounds

In this section, we discuss other possible solutions to the problem of deriving sample complexity bounds for probabilistic grammars.

7.3.1 Using Talagrand's Inequality

Our bounds are based on VC theory together with classical results for empirical processes (Pollard 1984). There have been some recent developments to the derivation of rates of convergence in statistical learning theory (Massart 2000; Bartlett, Bousquet, and Mendelson 2005; Koltchinskii 2006), most prominently through the use of Talagrand's inequality (Talagrand 1994), which is a concentration of measure inequality, in the spirit of Lemma 2.

The bounds achieved with Talagrand's inequality are also distribution-dependent, and are based on the diameter of the ε-minimal set—the set of hypotheses which have an excess risk smaller than ε. We saw in Section 7.1 that the diameter of the ε-minimal set does not follow the Tsybakov noise condition, but it is perhaps possible to find meaningful bounds for it, in which case we may be able to get tighter bounds using Talagrand's inequality. We note that it may be possible to obtain data-dependent bounds for the diameter of the ε-minimal set, following Koltchinskii (2006), by calculating the diameter of the ε-minimal set using .

7.3.2 Simpler Bounds for the Supervised Case

As noted in Section 6.1, minimizing empirical risk with the log-loss leads to a simple frequency count for calculating the estimated parameters of the grammar. In Corazza and Satta (2006), it has been also noted that to minimize the non-empirical risk, it is necessary to set the parameters of the grammar to the normalized expected count of the features.

This means that we can get bounds on the deviation of a certain parameter from the optimal parameter by applying modifications to rather simple inequalities such as Hoeffding's inequality, which determines the probability of the average of a set of i.i.d. random variables deviating from its mean. The modification would require us to split the event space into two cases: one in which the count of some features is larger than some fixed value (which will happen with small probability because of the bounded expectation of features), and one in which they are all smaller than that fixed value. Handling these two cases separately is necessary because Hoeffding's inequality requires that the count of the rules is bounded.

The bound on the deviation from the mean of the parameters (the true probability) can potentially lead to a bound on the excess risk in the supervised case. This formulation of the problem would not generalize to the unsupervised case, however, where the empirical risk minimization does not amount to simple frequency count.

7.4 Open Problems

We conclude the discussion with some directions for further exploration and future work.

7.4.1 Sample Complexity Bounds with Semi-Supervised Learning

Our bounds focus on the supervised case and the unsupervised case. There is a trivial extension to the semi-supervised case. Consider the objective function to be the sum of the likelihood for the labeled data together with the marginalized likelihood of the unlabeled data (this sum could be a weighted sum). Then, use the sample complexity bounds for each summand to derive a sample complexity bound on this sum.

It would be more interesting to extend our results to frameworks such as the one described by Balcan and Blum (2010). In that case, our discussion of sample complexity would attempt to identify how unannotated data can reduce the space of candidate probabilistic grammars to a smaller set, after which we can use the annotated data to estimate the final grammar. This reduction of the space is accomplished through a notion of compatibility, a type of fitness that the learner believes the estimated grammar should have given the distribution that generates the data. The key challenge in the case of probabilistic grammars would be to properly define this compatibility notion such that it fits the log-loss. If this is achieved, then similar machinery to that described in this paper (with proper approximations) can be followed to derive semi-supervised sample complexity bounds for probabilistic grammars.

7.4.2 Sharper Bounds for the Pseudo-Dimension of Probabilistic Grammars

The pseudo-dimension of a probabilistic grammar with the log-loss is bounded by the number of parameters in the grammar, because the logarithm of a distribution generated by a probabilistic grammar is a linear function. Typically the set of counts for the feature vectors of a probabilistic grammar resides in a subspace of a dimension which is smaller than the full dimension specified by the number of parameters, however. The reason for this is that there are usually relationships (which are often linear) between the elements in the feature counts. For example, with HMMs, the total feature count for emissions should equal the total feature count for transitions. With PCFGs, the total number of times that nonterminal rules fire equals the total number of times that features with that nonerminal in the right-hand side fired, again reducing the pseudo-dimension. An open problem that remains is characterization of the exact value pseudo-dimension for a given grammar, determined by consideration of various properties of that grammar. We conjecture, however, that a lower bound on the pseudo-dimension would be rather close to the full dimension of the grammar (the number of parameters).

It is interesting to note that there has been some work to identify the VC dimension and pseudo-dimension for certain types of grammars. Bane, Riggle, and Sonderegger (2010), for example, calculated the VC dimension for constraint-based grammars. Ishigami and Tani (1993; Ishigami and Tani (1997) computed the VC dimension for finite state automata with various properties.

7.5 Conclusion

We presented a framework for performing empirical risk minimization for probabilistic grammars, in which sample complexity bounds, for the supervised case and the unsupervised case, can be derived. Our framework is based on the idea of bounded approximations used in the past to derive sample complexity bounds for graphical models.

Our framework required assumptions about the probability distribution that generates sentences or derivations in the language of the given grammar. These assumptions were tested using corpora, and found to fit the data well.

We also discussed algorithms that can be used for minimizing empirical risk in our framework, given enough samples. We showed that directly trying to minimize empirical risk in the unsupervised case is NP-hard, and suggested an approximation based on an expectation-maximization algorithm.

We include in this appendix proofs for several results in the article.

Utility Lemma 1

Let such that ∑iai = 1. Define b1 = a1, c1 = 1 − a1, bi = , and ci = 1 − bi for i ≥ 2. Then

Proof

Proof by induction on i ∈ {1, …, N}. Clearly, the statement holds for i = 1. Assume it holds for arbitrary i < N. Then:
and this completes the proof.▪

Lemma 1

Denote by the set . Denote by A∈,n the event “one of ziD is in .” If properly approximates , then:
where the expectations are taken with respect to the data set D.

Proof

Consider the following:
Note first that , by the definition of gn as the minimizer of the empirical risk. We next bound . We know from the requirement of proper approximation that we have
and that equals the right side of Equation (Appendix A.1).▪

Proposition 2

Let and let be as defined earlier. There exists a constant β = β(L, q, p, N) > 0 such that has the boundedness property with Km = sN log3m and .

Proof

Let . Let . Then, for all we have , where the first inequality follows from () and the second from |z| ≤ log2m. In addition, from the requirements on p we have
for . Finally, for and if m > 1 then .▪

Utility Lemma 4

(From Dasgupta [1997].) Let a ∈ [0,1] and let b = a if a ∈ [γ,1 − γ], b = γ if aγ, and b = 1 − γ if a ≥ 1 − γ. Then for any ε ≤ 1/2 such that γ ≤ ε/ (1 + ε) we have log a/b ≤ ε.

Proposition 3

Let and let as defined earlier. There exists an M such that for any m > M we have
for and .

Proof

Let be the set of derivations of size bigger than log2m. Let . Define f′ = T(f, ms). For any we have that
Without loss of generality, assume . Let . From Utility Lemma 4 we have that . Plug this into Equation A.2 (N = 2K) to get that for all we have . It remains to show that the measure . Note that for m > M where M is fixed.▪

Proposition 7

There exists a β′(L,p,q,N) > 0 such that has the boundedness property with Km = sN log3m and .

Proof

From the requirement of p, we know that for any x we have a z such that yield(z) = x and |z| ≤ α|x|. Therefore, if we let , then we have for any and that (similarly to the proof of Proposition 2). Denote by f1(x,z) the function in such that .

In addition, from the requirements on p and the definition of Km we have
where z(x) is some derivation for x. We have
for some constant κ > 0. Finally, for some β′(L,p,q,N) = β′ > 0 and some constant M, if m > M then .

Utility Lemma 2

For ai, bi ≥ 0, if − log ∑iai + log ∑ibi ≥ ε then there exists an i such that − log ai + log bi ≥ ε.

Proof

Assume − logai + logbi < ε for all i. Then, , therefore , therefore − log ∑ iai + log ∑ ibi < ε which is a contradiction to − log ∑ iai + log ∑ ibiε.▪

The next lemma is the main concentation of measure result that we use. Its proof requires some simple modification to the proof given for Theorem 24 in Pollard (1984, pages 30–31).

Lemma 2

Let be a permissible class of functions such that for every we have . Let , that is, the set of functions from after being truncated by Kn. Then for ε > 0 we have
provided and εbound(n) < ε.

Proof

First note that
We have , and also, from Markov inequality, we have

At this point, we can follow the proof of Theorem 24 in Pollard (1984), and its extension on pages 30–31 to get Lemma 2, using the shifted set of functions .▪

Central to our algorithms for minimizing the log-loss (both in the supervised case and the unsupervised case) is a convex optimization problem of the form
for constants ck,i which depend on or some other intermediate distribution in the case of the expectation-maximization algorithm and γ which is a margin determined by the number of samples. This minimization problem can be decomposed into several optimization problems, one for each k, each having the following form:
where ci ≥ 0 and 1/2 > γ ≥ 0. Ignore for a moment the constraints γβi ≤ 1 − γ. In that case, this can be thought of as a regular maximum likelihood estimation problem, so βi = ci / (c1 + c2). We give a derivation of this result in this simple case for completion. We use Lagranian multipliers to solve this problem. Let F(β1,β2) = c1β1 + c2β2. Define the Lagrangian:
Taking the derivative of the term we minimize in the Lagrangian, we have
Setting the derivatives to 0 for minimization, we have
g(λ) is the objective function of the dual problem of Equation (B.1)Equation (B.2). We would like to minimize Equation (B.5) with respect to λ. The derivative of g(λ) is
hence when equating the derivative of g(λ) to 0, we get λ = − (c1 + c2), and therefore the solution is . We need to verify that the solution to the dual problem indeed gets the optimal value for the primal. Because the primal problem is convex, it is sufficient to verify that the Karush-Kuhn-Tucker (KKT) conditions hold (Boyd and Vandenberghe 2004). Indeed, we have
where stands for the equality constraint. The rest of the KKT conditions trivially hold, therefore β* is the optimal solution for Equations (B.1)(B.2).
Note that if 1 − γ < ci / (c1 + c2) < γ, then this is the solution even when again adding the constraints in Equation (B.3) and (B.4). When c1 / (c1 + c2) < γ, then the solution is and . Similarly, when c2 / (c1 + c2) < γ then the solution is and . We describe why this is true for the first case. The second case follows very similarly. Assume c1 / (c1 + c2) < γ. We want to show that for any choice of β ∈ [0,1] such that β > γ we have
Divide both sides of the inequality by c1 + c2 and we get that we need to show that
Because we have β > γ, and we also have c1 / (c1 + c2) < γ, it is sufficient to show that

Equation (B.6) is precisely the definition of the KL divergence between the distribution of a coin with probability γ of heads and the distribution of a coin with probability β of heads, and therefore the right side in Equation (B.6) is positive, and we get what we need.

Lemma 6

A = AG(θ) is positive semi-definite for any probabilistic grammar 〈G, θ〉.

Proof

Let dk,i be a collection of constants. Define the random variable:
We have that
which is always larger or equal to 0. Therefore, A is positive semi-definite.▪

Lemma 7

Let 0 < μ < 1/2, c1,c2 ≥ 0. Let κ,C > 0. Also, assume that c1c2. For any ε > 0, define:
Then, for small enough ε, we have t(ε) ≤ 0.

Proof

We have that t(ε) ≤ 0 if
First, show that
which happens if (after substituting a = α1μ, b = α2μ)
Note we have α1α2 > 1 because c1c2. In addition, we have α1 + α2 − 2 ≥ 0 for small enough ε (can be shown by taking the derivative, with respect to ε of α1 + α2 − 2, which is always positive for small enough ε, and in addition, noticing that the value of α1 + α2 − 2 is 0 when ε = 0.) Therefore, Equation (C.2) is true.
Substituting Equation (C.2) in Equation (C.1), we have that t(ε) ≤ 0 if
which is equivalent to
Taking again the derivative of the left side of Equation (C.3), we have that it is an increasing function of ε (if c1c2), and in addition at ε = 0 it obtains the value c1 + c2. Therefore, Equation (C.3) holds, and therefore t(ε) ≤ 0 for small enough ε.▪

Theorem 5

Let G be a grammar with K ≥ 2 and degree 2. Assume that p is 〈G, θ*〉 for some θ*, such that and that c1c2. If AG(θ*) is positive definite, then p does not satisfy the Tsybakov noise condition for any (C, κ), where C > 0 and κ ≥ 1.

Proof

Define λ to be the eigenvalue of AG(θ) with the smallest value (λ is positive). Also, define v(θ) to be a vector indexed by k,i such that
Simple algebra shows that for any (and the fact that ), we have
For a C > 0 and κ ≥ 1, define α = 1/κ. Let ε < α. First, we construct an h such that DKL(p || h) < ε + ε/2 but dist(p, h) > 1/κ as ε→0. The construction follows. Parametrize h by θ such that θ is identical to θ* except for k = 1,2, in which case we have
Note that μθ1,1 ≤ 1/2 and θ2,1 < μ. Then, we have that
We also have
if
(This can be shown by dividing Equation [C.6] by c1 + c2 and then using the concavity of the logarithm function.) From Lemma 7, we have that Equation (C.7) holds. Therefore,
Now, consider the following, which can be shown through algebraic manipulation:
Then, additional algebraic simplification shows that
A fact from linear algebra states that
where λ is the smallest eigenvalue in A. From the construction of θ and Equation (C.4)(C.5), we have that . Therefore,
which means . Therefore, p does not satisfy the Tsybakov noise condition with parameters (D, κ) for any D > 0.

Table D.1 gives a table of notation for symbols used throughout this article.

Table 1

Table of notation symbols used in this article.

graphic
 
graphic
 

The authors thank the anonymous reviewers for their comments and Avrim Blum, Steve Hanneke, Mark Johnson, John Lafferty, Dan Roth, and Eric Xing for useful conversations. This research was supported by National Science Foundation grant IIS-0915187.

1 

It is important to remember that minimizing the log-loss does not equate to minimizing the error of a linguistic analyzer or natural language processing application. In this article we focus on the log-loss case because we believe that probabilistic models of language phenomena have inherent usefulness as explanatory tools in computational linguistics, aside from their use in systems.

2 

We note that itself is a random variable, because it depends on the sample drawn from p.

3 

We note that being able to attain the minimum through an hypothesis q* is not necessarily possible in the general case. In our instantiations of ERM for probabilistic grammars, however, the minimum can be attained. In fact, in the unsupervised case the minimum can be attained by more than a single hypothesis. In these cases, q* is arbitrarily chosen to be one of these minimizers.

4 

Treebanks offer samples of cleanly segmented sentences. It is important to note that the distributions estimated may not generalize well to samples from other domains in these languages. Our argument is that the family of the estimated curve is reasonable, not that we can correctly estimate the curve's parameters.

5 

For simplicity and consistency with the log-loss, we measure entropy in nats, which means we use the natural logarithm when computing entropy.

6 

We note that this notion of binarization is different from previous types of binarization appearing in computational linguistics for grammars. Typically in previous work about binarized grammars such as CFGs, the grammars are constrained to have at most two nonterminals in the right side in Chomsky normal form. Another form of binarization for linear context-free rewriting systems is restriction of the fan-out of the rules to two (Gómez-Rodríguez and Satta 2009; Gildea 2010). We, however, limit the number of rules for each nonterminal (or more generally, the number of elements in each multinomial).

7 

There are other ways to manage the unboundedness of KL divergence in the language learning literature. Clark and Thollard (2004), for example, decompose the KL divergence between probabilistic finite-state automata into several terms according to a decomposition of Carrasco (1997) and then bound each term separately.

8 

By varying s we get a family of approximations. The larger s is, the tighter the approximation is. Also, the larger s is, as we see later, the looser our sample complexity bound will be.

9 

The “permissible class” requirement is a mild regularity condition regarding measurability that holds for proper approximations. We refer the reader to Pollard (1984) for more details.

Abe
,
N.
,
J.
Takeuchi
, and
M.
Warmuth
.
1991
.
Polynomial learnability of probabilistic concepts with respect to the Kullback-Leiber divergence.
In
Proceedings of the Conference on Learning Theory
,
pages
277
289
.
Abe
,
N.
and
M.
Warmuth
.
1992
.
On the computational complexity of approximating distributions by probabilistic automata.
Machine Learning
,
2
:
205
260
.
Angluin
,
D.
1987
.
Learning regular sets from queries and counterexamples.
Information and Computation
,
75
:
87
106
.
Anthony
,
M.
and
P. L.
Bartlett
.
1999
.
Neural Network Learning: Theoretical Foundations
.
Cambridge University Press
.
Balcan
,
M.
and
A.
Blum
.
2010
.
A discriminative model for semisupervised learning.
Journal of the Association for Computing Machinery
,
57
(3)
:
1
46
.
Balle
,
B.
,
A.
Quattoni
, and
X.
Carreras
.
2011
.
A spectral learning algorithm for finite state transducers.
In
Proceedings of the European Conference on Machine Learning/the Principles and Practice of Knowledge Discovery in Databases
,
pages
156
171
.
Bane
,
M.
,
J.
Riggle
, and
M.
Sonderegger
.
2010
.
The VC dimension of constraint-based grammars.
Lingua
,
120
(5)
:
1194
1208
.
Bartlett
,
P.
,
O.
Bousquet
, and
S.
Mendelson
.
2005
.
Local Rademacher complexities.
Annals of Statistics
,
33
(4)
:
1497
1537
.
Bishop
,
C. M.
2006
.
Pattern Recognition and Machine Learning
.
Springer
,
Berlin
.
Boyd
,
S.
and
L.
Vandenberghe
.
2004
.
Convex Optimization
.
Cambridge University Press
.
Carrasco
,
R.
1997
.
Accurate computation of the relative entropy between stochastic regular grammars.
Theoretical Informatics and Applications
,
31
(5)
:
437
444
.
Carroll
,
G.
and
E.
Charniak
.
1992
.
Two experiments on learning probabilistic dependency grammars from corpora.
Technical report
,
Brown University
,
Providence, RI
.
Charniak
,
E.
1993
.
Statistical Language Learning
.
MIT Press
,
Cambridge, MA
.
Charniak
,
E.
and
M.
Johnson
.
2005
.
Coarse-to-fine n-best parsing and maxent discriminative reranking.
In
Proceedings of the Association for Computational Linguistics
,
pages
173
180
.
Chi
,
Z.
1999
.
Statistical properties of probabilistic context-free grammars.
Computational Linguistics
,
25
(1)
:
131
160
.
Clark
,
A.
,
R.
Eyraud
, and
A.
Habrard
.
2008
.
A polynomial algorithm for the inference of context free languages.
In
Proceedings of the International Colloquium on Grammatical Inference
,
pages
29
42
.
Clark
,
A.
and
S.
Lappin
.
2010
.
Unsupervised learning and grammar induction.
In Alexander Clark, Chris Fox, andShalom Lappin, editors
,
The Handbook of Computational Linguistics and Natural Language Processing
.
Wiley-Blackwell
,
London
,
pages
197
220
.
Clark
,
A.
and
F.
Thollard
.
2004
.
PAC-learnability of probabilistic deterministic finite state automata.
Journal of Machine Learning Research
,
5
:
473
497
.
Cohen
,
S. B.
and
N. A.
Smith
.
2010a
.
Covariance in unsupervised learning of probabilistic grammars.
Journal of Machine Learning Research
,
11
:
3017
3051
.
Cohen
,
S. B.
and
N. A.
Smith
.
2010b
.
Empirical risk minimization with approximations of probabilistic grammars.
In
Proceedings of the Advances in Neural Information Processing Systems
,
pages
424
432
.
Cohen
,
S. B.
and
N. A.
Smith
.
2010c
.
Viterbi training for PCFGs: Hardness results and competitiveness of uniform initialization.
In
Proceedings of the Association for Computational Linguistics
,
pages
1502
1511
.
Collins
,
M.
2003
.
Head-driven statistical models for natural language processing.
Computational Linguistics
,
29
:
589
637
.
Collins
,
M.
2004
.
Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods.
In H. Bunt, J. Carroll, and G. Satta
,
Text, Speech and Language Technology (New Developments in Parsing Technology)
.
Kluwer
,
Dordrecht
,
pages
19
55
.
Corazza
,
A.
and
G.
Satta
.
2006
.
Cross-entropy and estimation of probabilistic context-free grammars.
In
Proceedings of the North American Chapter of the Association for Computational Linguistics
,
pages
335
342
.
Cover
,
T. M.
and
J. A.
Thomas
.
1991
.
Elements of Information Theory
.
Wiley
,
London
.
Dasgupta
,
S.
1997
.
The sample complexity of learning fixed-structure bayesian networks.
Machine Learning
,
29
(2–3)
:
165
180
.
de la Higuera
,
C.
2005
.
A bibliographical study of grammatical inference.
Pattern Recognition
,
38
:
1332
1348
.
Dempster
,
A.
,
N.
Laird
, and
D.
Rubin
.
1977
.
Maximum likelihood estimation from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society B
,
39
:
1
38
.
Gildea
,
D.
2010
.
Optimal parsing strategies for linear context-free rewriting systems.
In
Proceedings of the North American Chapter of the Association for Computational Linguistics
,
pages
769
776
.
Gómez-Rodr蝙guez
,
C.
and
G.
Satta
.
2009
.
An optimal-time binarization algorithm for linear context-free rewriting systems with fan-out two.
In
Proceedings of the Association for Computational Linguistics-International Joint Conference on Natural Language Processing
,
pages
985
993
.
Grenander
,
U.
1981
.
Abstract Inference
.
Wiley
,
New York
.
Haussler
,
D.
1992
.
Decision-theoretic generalizations of the PAC model for neural net and other learning applications.
Information and Computation
,
100
:
78
150
.
Hsu
,
D.
,
S. M.
Kakade
, and
T.
Zhang
.
2009
.
A spectral algorithm for learning hidden Markov models.
In
Proceedings of the Conference on Learning Theory
.
Ishigami
,
Y.
and
S.
Tani
.
1993
.
The VC-dimensions of finite automata with n states.
In
Proceedings of Algorithmic Learning Theory
,
pages
328
341
.
Ishigami
,
Y.
and
S.
Tani
.
1997
.
VC-dimensions of finite automata and commutative finite automata with k letters and n states.
Applied Mathematics
,
74
(3)
:
229
240
.
Jaeger
,
H.
1999
.
Observable operator models for discrete stochastic time series.
Neural Computation
,
12
:
1371
1398
.
Kearns
,
M.
and
L.
Valiant
.
1989
.
Cryptographic limitations on learning Boolean formulae and finite automata.
In
Proceedings of the 21st Association for Computing Machinery Symposium on the Theory of Computing
,
pages
433
444
.
Kearns
,
M. J.
and
U. V.
Vazirani
.
1994
.
An Introduction to Computational Learning Theory
.
MIT Press
,
Cambridge, MA
.
Klein
,
D.
and
C. D.
Manning
.
2004
.
Corpus-based induction of syntactic structure: Models of dependency and constituency.
In
Proceedings of the Association for Computational Linguistics
,
pages
478
487
.
Koltchinskii
,
V.
2006
.
Local Rademacher complexities and oracle inequalities in risk minimization.
The Annals of Statistics
,
34
(6)
:
2593
2656
.
Leermakers
,
R.
1989
.
How to cover a grammar.
In
Proceedings of the Association for Computational Linguistics
,
pages
135
142
.
Manning
,
C. D.
and
H.
Schütze
.
1999
.
Foundations of Statistical Natural Language Processing
.
MIT Press
,
Cambridge, MA
.
Massart
,
P.
2000
.
Some applications of concentration inequalities to statistics.
Annales de la Facult´e des Sciences de Toulouse
,
IX
(2)
:
245
303
.
Nijholt
,
A.
1980
.
Context-Free Grammars: Covers, Normal Forms, and Parsing
(volume 93 of Lecture Notes in Computer Science)
.
Springer-Verlag
,
Berlin
.
Palmer
,
N.
and
P.W.
Goldberg
.
2007
.
PAC-learnability of probabilistic deterministic finite state automata in terms of variation distance.
In
Proceedings of Algorithmic Learning Theory
,
pages
157
170
.
Pereira
,
F. C. N.
and
Y.
Schabes
.
1992
.
Inside-outside reestimation from partially bracketed corpora.
In
Proceedings of the Association for Computational Linguistics
,
pages
128
135
.
Pitt
,
L.
1989
.
Inductive inference, DFAs, and computational complexity.
Analogical and Inductive Inference
,
397
:
18
44
.
Pollard
,
D.
1984
.
Convergence of Stochastic Processes
.
Springer-Verlag
,
New York
.
Ron
,
D.
1995
.
Automata Learning and Its Applications
.
Ph.D. thesis
,
Hebrew University of Jerusalem
.
Ron
,
D.
,
Y.
Singer
, and
N.
Tishby
.
1998
.
On the learnability and usage of acyclic probabilistic finite automata.
Journal of Computer and System Sciences
,
56
(2)
:
133
152
.
Shalev-Shwartz
,
S.
,
O.
Shamir
,
K.
Sridharan
, and
N.
Srebro
.
2009
.
Learnability and stability in the general learning setting.
In
Proceedings of the Conference on Learning Theory
.
Sipser
,
M.
2006
.
Introduction to the Theory of Computation, Second Edition
.
Thomson Course Technology
,
Boston, MA
.
Talagrand
,
M.
1994
.
Sharper bounds for Gaussian and empirical processes.
Annals of Probability
,
22
:
28
76
.
Terwijn
,
S. A.
2002
.
On the learnability of hidden Markov models.
In P. Adriaans,H. Fernow, & M. van Zaane
.
Grammatical Inference: Algorithms and Applications
(Lecture Notes in Computer Science)
.
Springer
,
Berlin
,
pages
344
348
.
Tsybakov
,
A.
2004
.
Optimal aggregation of classifiers in statistical learning.
The Annals of Statistics
,
32
(1)
:
135
166
.
Vapnik
,
V. N.
1998
.
Statistical Learning Theory
.
Wiley-Interscience
,
New York
.

Author notes

*

Department of Computer Science, Columbia University, New York, NY 10027, United States. E-mail: [email protected]. This research was completed while the first author was at Carnegie Mellon University.

**

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, United States. E-mail: [email protected].