Abstract
This paper presents an attempt at building a large scale distributed composite language model that is formed by seamlessly integrating an n-gram model, a structured language model, and probabilistic latent semantic analysis under a directed Markov random field paradigm to simultaneously account for local word lexical information, mid-range sentence syntactic structure, and long-span document semantic content. The composite language model has been trained by performing a convergent N-best list approximate EM algorithm and a follow-up EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the Bleu score and “readability” of translations when applied to the task of re-ranking the N-best list from a state-of-the-art parsing-based machine translation system.
1. Introduction
The Markov chain (n-gram) source models, which predict each word on the basis of the previous n − 1 words, have been the workhorses of state-of-the-art speech recognizers and machine translators that help to resolve acoustic or foreign language ambiguities by placing higher probability on more likely original underlying word strings. Although the Markov chains are efficient at encoding local word interactions, the n-gram model clearly ignores the rich syntactic and semantic structures that constrain natural languages. Attempting to increase the order of an n-gram to capture longer range dependencies in natural language immediately runs into the curse of dimensionality (Bengio et al. 2003). The performance of conventional n-gram technology has essentially reached a plateau (Rosenfeld 2000b; Zhang 2008), and it has proven remarkably difficult to improve on n-grams (Jelinek 1991; Jelinek and Chelba 1999). Research groups (Och 2005; Zhang, Hildebrand, and Vogel 2006; Brants et al. 2007; Emami, Papineni, and Sorensen 2007) have shown that using an immense distributed computing paradigm, up to 6-grams can be trained on up to billions and trillions of tokens, yielding consistent system improvements because of excellent n-gram hit ratios on unseen test data, but Zhang (2008) did not observe much improvement beyond 6-grams. As the machine translation (MT) working groups stated in their final report (Lavie et al. 2006, page 3), “These approaches have resulted in small improvements in MT quality, but have not fundamentally solved the problem. There is a dire need for developing novel approaches to language modeling.”
Over the past two decades, more sophisticated models have been developed that outperform n-grams; these are mainly the syntactic language models (Della Pietra et al. 1994; Chelba 2000; Chelba and Jelinek 2000; Charniak 2001; Roark 2001; Wang and Harper 2002; Jelinek 2004; Benedí and Sánchez 2005; Van Uytsel and Compernolle 2005) that effectively exploit sentence-level syntactic structure of natural language, and the topic language models (Saul and Pereira 1997; Gildea and Hofmann 1999; Bellegarda 2000; Wallach 2006) that exploit document-level semantic content. Unfortunately, each of these language models only targets some specific, distinct linguistic phenomena (Pereira 2000; Rosenfeld 2000a, 2000b); thus, each captures and exploits different aspects of natural language regularity. A natural question we should ask is whether/how we can construct more complex and powerful but computationally tractable language models by integrating many existing/emerging language model components, with each component focusing on specific linguistic phenomena like syntactic structure, semantic topic, morphology, and pragmatics in complementary, supplementary, and coherent ways (Bellegarda 2001, 2003).
Several techniques for combining language models have been investigated. The most commonly used method is linear interpolation (Chen and Goodman 1999; Jelinek and Mercer 1980; Goodman 2001), where each individual model is trained separately and then combined by a weighted linear combination. All of the syntactic structure-based models have used linear interpolation to combine trigrams to achieve further improvement over using their own models alone (Charniak 2001; Chelba and Jelinek 2000; Chelba 2000; Roark 2001). The weights in this case are trained using held-out data. Even though this technique is simple and easy to implement, it does not generally yield very effective combinations (Rosenfeld 1996) because the linear additive form is a strong assumption in capturing subtleties in each of the component models (see more explanation and analysis in Section 6.2 and Appendix A). The second method is based on maximum entropy philosophy, which became very popular in machine learning and natural language processing communities due to the work by Berger, Della Pietra, and Della Pietra (1996), Della Pietra, Della Pietra, and Lafferty (1997), Lau et al. (1993) and Rosenfeld (1996). In fact, for a complete data case, maximum entropy is nothing but maximum likelihood estimation for undirected Markov random fields (MRFs) (Berger, Della Pietra, and Della Pietra 1996; Della Pietra, Della Pietra, and Lafferty 1997). As stated in Wang et al. (2005b), however, there are two weaknesses with maximum entropy approach. The first weakness is that this approach can only model distributions over explicitly observed features, but we know there is hidden information in natural language, such as syntactic structure and semantic topic. The second weakness is that if the statistical model is too complex it becomes intractable to estimate model parameters; computationally very expensive Markov chain Monte Carlo sampling methods (Mark, Miller, and Grenander 1996; Rosenfeld 2000b; Rosenfeld, Chen, and Zhu 2001) would have to be used. One way to overcome the first hurdle is to use a preprocessing tool to extract hidden features (e.g., Rosenfeld [1996] used mutual information clustering method to find word pair triggers) then combine these triggers with trigrams through a maximum conditional entropy approach to allow the discourse topic to influence word prediction; Khudanpur and Wu (2000) used Chelba and Jelinek's structured language model and a word clustering model to extract relevant grammatical and semantic features, then to again combine these features with trigrams through a maximum conditional entropy approach to form a syntactic, semantic, and lexical language model. Wang and colleagues (Wang et al. 2005a; Wang, Schuurmans, and Zhao 2012) have proposed the latent maximum entropy (LME) principle, which extends standard maximum entropy estimation by incorporating hidden dependency structure, but still the LME wouldn't overcome the second hurdle. The third method is directed Markov random field (Wang et al. 2005b) that overcomes both weaknesses in the maximum entropy approach. Wang et al. used this approach to combine trigram, probabilistic context-free grammar (PCFG), and probabilistic latent semantic analysis (PLSA) models; a generalized inside–outside algorithm is derived that alters the well-known inside–outside algorithm for PCFG (Baker 1979; Lari and Young 1990) with modular modification to take into account the effect of n-gram and PLSA while remaining at the same cubic time complexity. When applying this to the Wall Street Journal corpus with 40 million tokens, they achieved moderate perplexity reduction. Because the probabilistic dependency structure in a structured language model (SLM) (Chelba 2000; Chelba and Jelinek 2000) is more complex and powerful than that in a PCFG, Wang et al. (2006) studied the stochastic properties for the composite language model that integrates n-gram, SLM, and PLSA under the directed MRF framework (Wang et al. 2005b) and derived another generalized inside–outside algorithm to train a composite n-gram, SLM, and PLSA language model from a general expectation maximization (EM) (Dempster, Laird, and Rubin 1977) algorithm by following Jelinek's ingenious definition of the inside and outside probabilities for SLM (Jelinek 2004). Again, the generalized inside–outside algorithm alters Jelinek's inside–outside algorithm with modular modification and has the same sixth order of sentence-length time complexity. Unfortunately, there are no experimental results reported.
In this article, we study the same composite n-gram, SLM, and PLSA model under the directed MRF framework as in Wang et al. (2006). The composite n-gram/SLM/PLSA language model under the directed MRF paradigm is first introduced in Section 2. In Section 3, instead of using the sixth order generalized inside–outside algorithm proposed in Wang et al. (2006), we show how to train this composite model via an N-best list approximate EM algorithm that has linear time complexity and a follow-up EM algorithm to improve word prediction power. We prove the convergence of the N-best list approximate EM algorithm. To resolve the data sparseness problem, we generalize Jelinek and Mercer's recursive mixing scheme for Markov source (Jelinek and Mercer 1980) to a mixture of Markov chains. To handle large-scale corpora up to a billion tokens, we demonstrate how to implement these algorithms under a distributed computing environment and how to store this language model on a supercomputer. In Section 4, we describe how to use the model for testing. Related works are then summarized and compared in Section 5. Because language modeling is a data-rich and feature-rich density estimation problem, there is always a trade-off between approximate error and estimation error, thus in Section 6 we conduct comprehensive experiments on corpora with 44 million tokens, 230 million tokens, and 1.3 billion tokens, and compare perplexity results with n-grams (n = 3, 4, 5 respectively) on these three corpora under various situations; drastic perplexity reductions are obtained. We explain why the composite language models lead to better predictive capacity than linear interpolation. The proposed composite language models are applied to the task of re-ranking the N-best list from Hiero (Chiang 2005; Chiang 2007), a state-of-the-art parsing-based machine translation system; we achieve significantly better translation quality measured by the Bleu score and “readability” of translations. Finally, we draw our conclusions and propose future work in Section 7.
The main theme of our approach is “to exploit information, be it syntactic structure or semantic fabric, which involves a fairly high degree of cognition. This is precisely the kind of knowledge that humans naturally and inherently use to process natural language, so it can be reasonably conjectured to represent a key ingredient for success” (Bellegarda 2003, p. 105). In that light, the directed MRF framework, “whose ultimate goal is to integrate all available knowledge sources, appears most likely to harbor a potential breakthrough. It is hoped that the on-going effort conducted in this work to leverage such latent synergies will lead, in the not-too-distant future, to more polyvalent, multi-faceted, effective and tractable solutions for language modeling – this is only beginning to scratch the surface in developing systems capable of deep understanding of natural language” (Bellegarda 2003, p. 105).
2. The Composite n-gram/SLM/PLSA Language Model
The n-gram (Jelinek 1998; Jurafsky and Martin 2008) language model is essentially a WORD-PREDICTOR, that is, given its entire document history, it predicts the next word wk+1 ∈ based on the last n − 1 words with probability where = wk−n+2, ⋯, wk and denotes the vocabulary.
The SLM proposed in Chelba and Jelinek (1998, 2000) and Chelba (2000) uses syntactic information beyond the regular n-gram models to capture sentence-level long-range dependencies. The SLM is based on statistical parsing techniques that allow syntactic analysis of sentences; it assigns a probability p(W, T) to every sentence W and every possible binary parse T. The terminals of T are the words of W with part of speech (POS) tags, and the nodes of T are annotated with phrase headwords and non-terminal labels. Let W be a sentence of length n words to which we have prepended the sentence beginning marker 〈s〉 and appended the sentence end marker 〈/s〉 so that w0 = 〈s〉 and wn+1 = 〈/s〉. Let Wk = w0, ⋯ ,wk be the word k-prefix of the sentence (the words from the beginning of the sentence up to the current position k) and WkTk be the word-parse k-prefix. A word-parse k-prefix has a set of exposed heads , with each head being a pair (headword, non-terminal label), where denotes the set of non-terminal label (NTlabel), or in the case of a root-only tree (word, POS tag) where denotes the set of POS tags. The exposed heads at a given position k in the input sentence are a function of the word-parse k-prefix.
The WORD-PREDICTOR predicts the next word wk+1 ∈ based on the m most recently exposed headwords in the word-parse k-prefix with probability , and then passes control to the TAGGER.
The TAGGER predicts the POS tag tk+1 ∈ to the next word wk+1 based on the next word wk+1 and the POS tags of the m most recently exposed headwords (denoted as ) in the word-parse k-prefix with probability .
The CONSTRUCTOR builds the partial parse Tk+1 from Tk, wk+1, and tk+1 in a series of moves ending with NULL, where a parse move a is made with probability ; a ∈ = {(unary, NTlabel), (adjoin-left, NTlabel), (adjoin-right, NTlabel), NULL}. Depending on an action a = adjoin-right or adjoin-left, the headword h−1 or h−2 is percolated up by one tree level, the indices of the current exposed headwords h−3, h−4, ⋯ are increased by 1, and these headwords together with h−1 or h−2 become the new exposed headwords. Once the CONSTRUCTOR hits NULL, the headword indexing and current parse structure remain as they are, and the CONSTRUCTOR passes control to the WORD-PREDICTOR.
A PLSA model (Hofmann 2001) is a generative probabilistic model of word-document co-occurrences using the bag-of-words assumption described as follows:
Choose a document d with probability p(d).
SEMANTIZER selects a semantic class with probability p(g|d) where denotes the set of topics.
WORD-PREDICTOR picks a word with probability p(w|g).
If we look at the example in Figure 1, for the composite n-gram/m-SLM/PLSA language model there exists a SEMANTIZER's action to choose a topic g before any WORD-PREDICTOR's action. Moreover, for m-SLM, its WORD-PREDICTOR predicts the next word, such as a, based on m most recently exposed headwords “〈s〉-SB, show-np, has-vp,” but for the composite model, the WORD-PREDICTOR predicts the next word a based on m most recently exposed headwords “〈s〉-SB, show-np, has-vp,” n-grams “as its host,” and a topic g. These are the only differences between SLM and our proposed composite language model.
3. Training Algorithm
3.1 N-best List Approximate EM
The N-best list approximate EM involves two steps:
- EM update: Perform one iteration (or several iterations) of the EM algorithm to estimate model parameters that maximize N-best list likelihood of the training corpus ,That is,
M-step: Maximize with respect to p′ to get the new update for p.
We use Zangwill's global convergence theorem (Zangwill 1969) to analyze the behavior of convergence of the N-best list approximate EM.
First, we define two concepts needed for Zangwill's global convergence theorem. A map is from points of Θ to subsets of Θ is called a point-to-set map on Θ. It is said to be closed at θ if θi → θ, θi ∈ Θ and implies . For a point-to-point map, continuity implies closedness. Then the global convergence theorem (Zangwill 1969) states the following.
Theorem
Let be a point-to-set map (an algorithm) that, given a point θ0 ∈ Θ, generates a sequence through the iteration . Let Ω ∈ Θ be the set of fixed points of . Suppose (i) is closed over the complement of Ω; (ii) there is a continuous function φ on Θ such that (a) if θ ∉ Ω, φ(λ) > φ(θ) for all , and (b) if θ ∈ Ω, φ(λ) ≥ φ(θ) for all .
Then all the limit points of {θi} are in Ω and φ(θi) converges monotonically to φ(θ) for some θ ∈ Ω.
This theorem has been used by Wu (1983) to prove the convergence of a standard EM algorithm (Dempster, Laird, and Rubin 1977). We now use this theorem to show that the N-best list approximate EM algorithm globally converges to the stationary points of the N-best list likelihood. We encounter one difficulty at this point, however, due to the maximization operator in Equation (11); after each iteration the N-best list may have been changed, therefore the set of data presented for the estimation of model parameters may be different from the previous one. Nevertheless, we prove the convergence of the N-best list approximate EM algorithm by checking whether it satisfies two conditions in Zangwill's global convergence theorem. Because the composite model is essentially a mixture model of a curved exponential family through a complex hierarchy, there is a closed form solution for the function irrespective of the N-best list parse trees, so the N-best list approximate EM algorithm is a one-to-one map. Because is continuous in both p′ and p, the map is closed, thus condition (i) is satisfied.
This completes the proof that the N-best list approximate EM algorithm monotonically increases the N-best list likelihood and converges in the sense of Zangwill's global convergence.
In the following, we formally derive the N-best list approximate EM algorithm with linear sentence length time complexity.
3.1.1 N-best List Search Strategy
For each sentence W in document d, instead of scanning all the hidden events (both allowed parse trees and semantic annotation strings) we restrict the algorithm to operate with N-best hidden events. We find that, for each document, a large number of topics should be pruned and only a small set of allowed topics should be kept due to the considerations of both computational time and resource demand, otherwise we have to use many more machines to store WORD-PREDICTOR's parameters.
We can either find both the N-best parses for each sentence and N-best topics for each document simultaneously or separately. The latter is much preferred, because the first case is much more computationally expensive.
To extract the N-best topics, we run an EM algorithm for a PLSA model on training corpus , then keep the N most likely topics (denoted as ) according to the values of p(g|d); the rest of the topics are purged.
3.1.2 EM Update
Once we have both the N-best parse trees for each sentence in document d and the N-best topics for document d, we derive the EM algorithm to estimate model parameters.
For the TAGGER and the CONSTRUCTOR, we use Equations (20) and (21), and the expected count of each event of and over parse Tl of sentence Wl in document d is the real count appearing in parse tree Tl of sentence Wl in document d times the conditional distribution —that is, and , respectively.
When only SLM is considered, the expected count for each model component, WORD-PREDICTOR, TAGGER, and CONSTRUCTOR, over parse Tl of sentence Wl in document d is the real count that appeared in parse Tl of sentence Wl in document d times the posterior probability , as is done in Chelba and Jelinek (1998, 2000) and Chelba (2000).
In the M-step, the recursive linear interpolation scheme (Jelinek and Mercer 1980) is used to obtain a smooth probability estimate for each model component (WORD-PREDICTOR, TAGGER, and CONSTRUCTOR). The TAGGER and CONSTRUCTOR are conditional probabilistic models of the type p(u|z1, ⋯ ,zn) where u, z1, ⋯ , zn belong to a mixed set of words, POS tags, NTtags, and CONSTRUCTOR actions (u only); and z1, ⋯ , zn form a linear Markov chain. The recursive mixing scheme is the standard one among relative frequency estimates of different orders k = 0, ⋯ ,n and has been explained in Chelba and Jelinek (1998, 2000) and Chelba (2000). The WORD-PREDICTOR is, however, a conditional probabilistic model where there are three kinds of context, , , and g—each forms a linear Markov chain. The model has a combinatorial number of relative frequency estimates of different orders among three linear Markov chains. We generalize Jelinek and Mercer's (1980) original recursive mixing scheme to handle the situation where the context is a mixture of Markov chains. The factored language (FL) model (Bilmes and Kirchhoff 2003) is close to the smoothing technique we propose here, the major difference is that FL considers all possible combination of the context of conditional probability that can be concisely represented by a factor graph, whereas our approach strictly respects the order of Markov chains for word sequence and headword sequence because we believe natural language tightly follows these orders; moreover, where FL uses a backoff technique, we use linear interpolation.
In the M-step, assuming that the count ranges and the corresponding interpolation values for each order are kept fixed to their initial values, the only parameters to be re-estimated using the EM algorithm are the maximal order counts for each model component. The interpolation scheme outlined here is then used to obtain a smooth probability estimate for each model component.
3.2 Follow-up EM
As explained in Chelba and Jelinek (2000) and Chelba (2000), for the SLM component a large fraction of the partial parse trees that can be used for assigning probability to the next word do not survive in the synchronous, multi-stack search strategy, thus they are not used in the N-best approximate EM algorithm for the estimation of WORD-PREDICTOR to improve its predictive power. To remedy this weakness, we estimate a separate WORD-PREDICTOR (and SEMANTIZER) model using the partial parse trees exploited by the synchronous, multi-stack search strategy.
3.3 Distributed Architecture
Similarly, we use a distributed architecture as in Figure 4 to perform the follow-up EM algorithm to re-estimate WORD-PREDICTOR.
4. Using the Model for Testing
When we use Equation (30) to compute perplexity, the system only uses information coming from previous words to generate a topic distribution, which then is used to predict the next word, so the sum over all next words is 1.
We find that the perplexity results are sensitive to these three methods and the initial values. For example, for batch EM, if we set initial values to be those obtained by using the pseudo-document up to the previous word and trained by batch EM, we obtain worse perplexity results. Table 8 in Section 6.2 gives perplexity results that use these three methods to re-estimate the parameters of the SEMANTIZER, where the on-line EM with fixed learning rate not only has the cheapest computational cost but also leads to the highest perplexity reductions.
5. Related Work
Besides the work by Wang et al. (2005b, 2006) that was discussed in the Introduction, the closest work to ours is that by Khudanpur and Wu (2000) where the authors used SLM and a word clustering model to extract relevant grammatical and semantic features, then integrated these features with n-grams by a maximum conditional entropy approach. Our composite language model is a generative model, all features play important roles in the EM iterations to allow maximal order events for WORD-PREDICTOR to appear; in Khudanpur and Wu (2000), however, the counts for all events are fixed after feature extraction from SLM and word clustering and no new maximal order events for WORD-PREDICTOR are possibly extracted, this potentially hinders the predictive power of WORD-PREDICTOR. Moreover, the training algorithm in Khudanpur and Wu is computationally expensive. Both methods use the first-stage N-best list approximate EM to extract headwords, thus the complexity is at the same order at this stage; at second stage, however, where we use the follow-up EM, they use the maximum entropy approach. The maximum entropy approach is more expensive, mainly in feature expectation and normalization as well as optimization (such as iterative scaling or the quasi Newton method); ours is quite simple, which is expected relative to frequency estimates with proper smoothing.
The highest reported perplexity reductions are those by Goodman (2001), where the author examines the techniques of caching, clustering, higher-order n-grams, skipping models, and sentence-mixture models in various combinations (mainly linear interpolation). The author compares to the baseline of a Katz smoothed trigram with no count cutoffs. On a small training corpus with 100k tokens, a 50% perplexity reduction (1 bit improvement) is obtained. On a larger corpus with 284 million tokens without punctuation, the improvement declines to 38%; we assume that this improvement shrinks to 30% when compared with 4-gram as the baseline.
6. Experimental Results
In this section, we first explain the experimental set-up for our experiments, we then show comprehensive perplexity results in various situations, and we end by reporting the results when we apply the composite language model to the task of re-ranking the N-best list from a state-of-the-art parsing-based machine translation system.
6.1 Experimental Set-up
In previous work (Gildea and Hofmann 1999; Bellegarda 2000; Chelba 2000; Chelba and Jelinek 2000; Charniak 2001; Roark 2001), all complex language models have been trained on relatively small data sets. There is the impression that complex language models only lead to better results than n-grams on small training corpora. For example, Jurafsky and Martin (2008, page 482), state, “We said earlier that statistical parsers can take advantage of longer-distance information than n-grams, which suggests that they might do a better job at language modeling/word prediction. It turns out that if we have a very large amount of training data, a 4-gram or 5-gram is nonetheless still the best way to do language modeling.” To verify whether this is true, we have trained our language models using three different training sets: one has 44 million tokens, another has 230 million tokens, and the third has 1.3 billion tokens. An independent test set with 354k tokens is chosen. The independent check data set used to determine the linear interpolation coefficients has 1.7 million tokens for the 44 million token training corpus, and 13.7 million tokens for both the 230 million and 1.3 billion token training corpora. All these data sets are taken from the LDC English Gigaword corpus with non-verbalized punctuation and we remove all punctuation. Table 1 provides the detailed information on how these data sets were chosen from the LDC English Gigaword corpus.
1.3 billion token training corpus | |
afp | 19940512.0003 ∼ 19961015.0568 |
afw | 19941111.0001 ∼ 19960414.0652 |
nyt | 19940701.0001 ∼ 19950131.0483 |
nyt | 19950401.0001 ∼ 20040909.0063 |
xin | 19970901.0001 ∼ 20041125.0119 |
230 million token training corpus | |
afp | 19940622.0336 ∼ 19961031.0797 |
apw | 19941111.0001 ∼ 19960419.0765 |
nyt | 19940701.0001 ∼ 19941130.0405 |
44 million token training corpus | |
afp | 19940601.0001 ∼ 19950721.0137 |
13.7 million token check corpus | |
nyt | 19950201.0001 ∼ 19950331.0494 |
1.7 million token check corpus | |
afp | 19940512.0003 ∼ 19940531.0197 |
354k token test corpus | |
cna | 20041101.0006 ∼ 20041217.0009 |
1.3 billion token training corpus | |
afp | 19940512.0003 ∼ 19961015.0568 |
afw | 19941111.0001 ∼ 19960414.0652 |
nyt | 19940701.0001 ∼ 19950131.0483 |
nyt | 19950401.0001 ∼ 20040909.0063 |
xin | 19970901.0001 ∼ 20041125.0119 |
230 million token training corpus | |
afp | 19940622.0336 ∼ 19961031.0797 |
apw | 19941111.0001 ∼ 19960419.0765 |
nyt | 19940701.0001 ∼ 19941130.0405 |
44 million token training corpus | |
afp | 19940601.0001 ∼ 19950721.0137 |
13.7 million token check corpus | |
nyt | 19950201.0001 ∼ 19950331.0494 |
1.7 million token check corpus | |
afp | 19940512.0003 ∼ 19940531.0197 |
354k token test corpus | |
cna | 20041101.0006 ∼ 20041217.0009 |
These are selected from the LDC English Gigaword corpus. AFP = Agence France-Presse; AFW = Associated Press Worldstream; NYT = New York Times; XIN = Xinhua News Agency; and CNA = Central News Agency of Taiwan denote the sections of the LDC English Gigaword corpus.
The vocabulary sizes in all three cases are:
word (also WORD-PREDICTOR operation) vocabulary: 60k, open—all words outside the vocabulary are mapped to the 〈unk〉 token, these 60k words are chosen from the most frequently occurring words in the 44 million token corpus;
POS tag (also TAGGER operation) vocabulary: 69, closed;
non-terminal tag vocabulary: 54, closed;
CONSTRUCTOR operation vocabulary: 157, closed.
The out-of-vocabulary (OOV) rate on the 44 million, 230 million, 1.3 billion token training corpora is 0.6%, 0.9%, and 1.2%, respectively. The OOV rate on the 1.7 million and 13.7 million token check corpora is 0.6% and 1.3%, respectively. The OOV rate on the 354k token test corpus is 2.0%. Table 2 lists the statistics about the number of types of n-grams on these three corpora.
. | n = 3 . | n = 4 . | n = 5 . |
---|---|---|---|
44 M | 14,302,355 | 23,833,023 | 29,068,173 |
230 M | 51,115,539 | 94,617,433 | 120,978,281 |
1.3 B | 224,767,319 | 481,645,099 | 660,599,586 |
. | n = 3 . | n = 4 . | n = 5 . |
---|---|---|---|
44 M | 14,302,355 | 23,833,023 | 29,068,173 |
230 M | 51,115,539 | 94,617,433 | 120,978,281 |
1.3 B | 224,767,319 | 481,645,099 | 660,599,586 |
Similar to SLM (Chelba 2000; Chelba and Jelinek 2000), after the parse undergoes headword percolation and binarization, each model component of WORD-PREDICTOR, TAGGER, and CONSTRUCTOR is initialized from a set of parsed sentences. We use the openNLP software2 to parse a large number of sentences in the LDC English Gigaword corpus to generate an automatic treebank, which has a slightly different word-tokenization than that of the manual treebank such as the Penn Treebank used in Chelba and Jelinek (2000) and Chelba (2000). For the 44 and 230 million token corpora, all sentences are automatically parsed and used to initialize model parameters, whereas for the 1.3 billion token corpus, we parse the sentences from a portion of the corpus that contains 230 million tokens, then use them to initialize model parameters. The parser at openNLP is trained on the Penn Treebank, which has only one million tokens, and there is a mismatch between the Penn Treebank and the LDC English Gigaword corpus. Nevertheless, experimental results show that this approach is effective to provide initial values of model parameters.
6.2 Perplexity Results
Table 3 gives the perplexity results (Bahl et al. 1977) of n-grams (n = 3, 4, and 5) using linear interpolation and Kneser-Ney (1995) smoothing when the training corpus has 44 million, 230 million, and 1.3 billion tokens, respectively. We have implemented a distributed n-gram with linear interpolation smoothing, but we don't have distributed n-grams with Kneser-Ney smoothing implemented by us. Instead, we use the SRI Language Modeling Toolkit to obtain perplexity results of n-grams with Kneser-Ney smoothing for the 44 million and 230 million token corpora using a single machine that has 20G memory at the Ohio Supercomputer center. We are not able to compute perplexity results of n-grams with Kneser-Ney smoothing on the 1.3 billion token corpus, thus we leave these results blank in Table 3. From the results in Table 3, we decided to use a linearly smoothed trigram as the baseline model for the 44 million token corpus, a linearly smoothed 4-gram as the baseline model for the 230 million token corpus, and a linearly smoothed 5-gram as the baseline model for the 1.3 billion token corpus.
44 M . | linear . | Kneser-Ney . |
---|---|---|
n = 3 | 262 | 244 |
n = 4 | 258 | 235 |
n = 5 | 260 | 235 |
230 M | linear | Kneser-Ney |
n = 3 | 217 | 195 |
n = 4 | 200 | 183 |
n = 5 | 201 | 183 |
1.3 B | linear | Kneser-Ney |
n = 3 | 161 | — |
n = 4 | 141 | — |
n = 5 | 138 | — |
44 M . | linear . | Kneser-Ney . |
---|---|---|
n = 3 | 262 | 244 |
n = 4 | 258 | 235 |
n = 5 | 260 | 235 |
230 M | linear | Kneser-Ney |
n = 3 | 217 | 195 |
n = 4 | 200 | 183 |
n = 5 | 201 | 183 |
1.3 B | linear | Kneser-Ney |
n = 3 | 161 | — |
n = 4 | 141 | — |
n = 5 | 138 | — |
As we mentioned in Section 3.1.1, we can keep only a small set of topics due to the considerations of computational time and resource demand. Table 4 shows the perplexity results and computation time of composite n-gram/PLSA language models that are trained on the three corpora when the pre-defined number of total topics is 200, but different numbers of most-likely topics are kept for each document in PLSA; the rest are pruned. For the composite 5-gram/PLSA model trained on the 1.3 billion token corpus, 400 cores have to be used to keep the top five most likely topics. For the composite trigram/PLSA model trained on the 44M token corpus, the computation time increases drastically, with less than 5% percent perplexity improvement. In the following experiments, therefore, we keep the top five topics for each document from a total of 200 topics—all other 195 topics are pruned.
corpus . | n . | # of topics . | ppl . | time (hours) . | # of servers . | # of clients . | # of types of . |
---|---|---|---|---|---|---|---|
44M | 3 | 5 | 196 | 0.5 | 40 | 100 | 120.1M |
3 | 10 | 194 | 1.0 | 40 | 100 | 218.6M | |
3 | 20 | 190 | 2.7 | 80 | 100 | 537.8M | |
3 | 50 | 189 | 6.3 | 80 | 100 | 1.123B | |
3 | 100 | 189 | 11.2 | 80 | 100 | 1.616B | |
3 | 200 | 188 | 19.3 | 80 | 100 | 2.280B | |
230M | 4 | 5 | 146 | 25.6 | 280 | 100 | 0.681B |
1.3B | 5 | 2 | 111 | 26.5 | 400 | 100 | 1.790B |
5 | 5 | 102 | 75.0 | 400 | 100 | 4.391B |
corpus . | n . | # of topics . | ppl . | time (hours) . | # of servers . | # of clients . | # of types of . |
---|---|---|---|---|---|---|---|
44M | 3 | 5 | 196 | 0.5 | 40 | 100 | 120.1M |
3 | 10 | 194 | 1.0 | 40 | 100 | 218.6M | |
3 | 20 | 190 | 2.7 | 80 | 100 | 537.8M | |
3 | 50 | 189 | 6.3 | 80 | 100 | 1.123B | |
3 | 100 | 189 | 11.2 | 80 | 100 | 1.616B | |
3 | 200 | 188 | 19.3 | 80 | 100 | 2.280B | |
230M | 4 | 5 | 146 | 25.6 | 280 | 100 | 0.681B |
1.3B | 5 | 2 | 111 | 26.5 | 400 | 100 | 1.790B |
5 | 5 | 102 | 75.0 | 400 | 100 | 4.391B |
All composite language models are first trained by performing the N-best list approximate EM algorithm until convergence, then the EM algorithm for a second stage of parameter re-estimation for WORD-PREDICTOR and SEMANTIZER until convergence. We fix the size of topics in the PLSA to be 200 and then prune to 5 in the experiments, where the unpruned 5 topics in general account for 70% probability in p(g|d). Table 5 shows comprehensive perplexity results for a variety of different models such as composite n-gram/m-SLM, n-gram/PLSA, m-SLM/PLSA, their linear combinations, and so on, where we use on-line EM with a fixed learning rate to re-estimate the parameters of the SEMANTIZER of test document. The m-SLM performs competitively with its counterpart n-gram (n = m + 1) on large scale corpus. Table 6 lists the statistics about the number of types in the predictor of the m-SLMs on these three corpora, where for the 230 million token and 1.3 billion token corpora we cut off the fractional expected counts that are less than a predefined threshold of 0.005, to significantly reduce the number of the predictor's types by 70%.
language model . | 44M n = 3, m = 2 . | reduction . | 230M n = 4, m = 3 . | reduction . | 1.3B n = 5,m = 4 . | reduction . |
---|---|---|---|---|---|---|
baselinen-gram (linear) | 262 | 200 | 138 | |||
n-gram (Kneser-Ney) | 244 | 6.9% | 183 | 8.5% | — | — |
m-SLM | 279 | −6.5% | 190 | 5.0% | 137 | 0.0% |
PLSA | 825 | −214.9% | 812 | −306.0% | 773 | −460.0% |
n-gram + m-SLM | 247 | 5.7% | 184 | 8.0% | 129 | 6.5% |
n-gram + PLSA | 235 | 10.3% | 179 | 10.5% | 128 | 7.2% |
n-gram + m-SLM + PLSA | 222 | 15.3% | 175 | 12.5% | 123 | 10.9% |
n-gram/m-SLM | 243 | 7.3% | 171 | 14.5% | (125) | 9.4% |
n-gram/PLSA | 196 | 25.2% | 146 | 27.0% | 102 | 26.1% |
m-SLM/PLSA | 198 | 24.4% | 140 | 30.0% | (103) | 25.4% |
n-gram/PLSA + m-SLM/PLSA | 183 | 30.2% | 140 | 30.0% | (93) | 32.6% |
n-gram/m-SLM+m-SLM/PLSA | 183 | 30.2% | 139 | 30.5% | (94) | 31.9% |
n-gram/m-SLM + n-gram/PLSA | 184 | 29.8% | 137 | 31.5% | (91) | 34.1% |
n-gram/m-SLM + n-gram/PLSA + m-SLM/PLSA | 180 | 31.3% | 130 | 35.0% | — | — |
n-gram/m-SLM/PLSA | 176 | 32.8% | — | — | — | — |
language model . | 44M n = 3, m = 2 . | reduction . | 230M n = 4, m = 3 . | reduction . | 1.3B n = 5,m = 4 . | reduction . |
---|---|---|---|---|---|---|
baselinen-gram (linear) | 262 | 200 | 138 | |||
n-gram (Kneser-Ney) | 244 | 6.9% | 183 | 8.5% | — | — |
m-SLM | 279 | −6.5% | 190 | 5.0% | 137 | 0.0% |
PLSA | 825 | −214.9% | 812 | −306.0% | 773 | −460.0% |
n-gram + m-SLM | 247 | 5.7% | 184 | 8.0% | 129 | 6.5% |
n-gram + PLSA | 235 | 10.3% | 179 | 10.5% | 128 | 7.2% |
n-gram + m-SLM + PLSA | 222 | 15.3% | 175 | 12.5% | 123 | 10.9% |
n-gram/m-SLM | 243 | 7.3% | 171 | 14.5% | (125) | 9.4% |
n-gram/PLSA | 196 | 25.2% | 146 | 27.0% | 102 | 26.1% |
m-SLM/PLSA | 198 | 24.4% | 140 | 30.0% | (103) | 25.4% |
n-gram/PLSA + m-SLM/PLSA | 183 | 30.2% | 140 | 30.0% | (93) | 32.6% |
n-gram/m-SLM+m-SLM/PLSA | 183 | 30.2% | 139 | 30.5% | (94) | 31.9% |
n-gram/m-SLM + n-gram/PLSA | 184 | 29.8% | 137 | 31.5% | (91) | 34.1% |
n-gram/m-SLM + n-gram/PLSA + m-SLM/PLSA | 180 | 31.3% | 130 | 35.0% | — | — |
n-gram/m-SLM/PLSA | 176 | 32.8% | — | — | — | — |
. | m = 2 . | m = 3 . | m = 4 . |
---|---|---|---|
44 M | 189,002,525 | 269,685,833 | 318,174,025 |
230 M | 267,507,672 | 1,154,020,346 | 1,417,977,184 |
1.3 B | 946,683,807 | 1,342,323,444 | 1,849,882,215 |
. | m = 2 . | m = 3 . | m = 4 . |
---|---|---|---|
44 M | 189,002,525 | 269,685,833 | 318,174,025 |
230 M | 267,507,672 | 1,154,020,346 | 1,417,977,184 |
1.3 B | 946,683,807 | 1,342,323,444 | 1,849,882,215 |
In Table 5, for the composite n-gram/m-SLM model (n = 3, m = 2 and n = 4, m = 3) trained on 44 million tokens and 230 million tokens, we cut off its fractional expected counts that are less than a threshold 0.005; this significantly reduces the number of the predictor's types by 85%. When we train the composite language on the 1.3 billion token corpus, we have to both aggressively prune the parameters of WORD-PREDICTOR and shrink the order of n-gram and m-SLM in order to store them in a supercomputer having 1,000 cores. In particular, for the composite 5-gram/4-SLM model, its size is too big to store, thus we use its approximation, a linear combination of 5-gram/2-SLM and 2-gram/4-SLM. For the 5-gram/2-SLM or 2-gram/4-SLM, again we cut off its fractional expected counts that are less than a threshold 0.005, which significantly reduces the number of the predictor's types by 85%. For the composite 4-SLM/PLSA model, we cut off its fractional expected counts that are less than a threshold 0.002, again this significantly reduces the number of predictor's types by 85%. For the composite 4-SLM/PLSA model or its linear combination with models, we ignore all the tags and use only the words in the four headwords. We have checked that the conditional language model (Equation [30]) sums to 1 for large randomly selected conditional events. The composite n-gram/m-SLM/PLSA model gives significant perplexity reductions over baseline n-grams (n = 3, 4, 5) and m-SLMs (m = 2, 3, 4). The majority of gains comes from the PLSA component, but when adding the SLM component into the n-gram/PLSA, there is a further 10% relative perplexity reduction.
Table 7 shows how large the composite 5-gram/PLSA, 5-gram/2-SLM (or 2-gram/4-SLM), and 4-SLM/PLSA models are when trained by the 1.3 billion token corpus after aggressive pruning. The total minimum number of servers used to store the parameters of the predictor for the composite 5-gram/PLSA, 5-gram/2-SLM (or 2-gram/4-SLM), and 4-SLM/PLSA models is, respectively, 400, 240, 400, and the number of clients to store the partitioned data of the 1.3 billion token corpus is 100 for these three composite language models. There is no way to store the parameters of the linear combination of the composite 5-gram/PLSA, 5-gram/2-SLM (or 2-gram/4-SLM), and 4-SLM/PLSA models in our currently available supercomputer resources.
compositemodel . | typesof . | # oftypes . | # ofservers . | # ofclients . |
---|---|---|---|---|
5-gram/PLSA | 4.39 B | 400 | 100 | |
5-gram/2-SLM | 2.01 B | 240 | 100 | |
2-gram/4-SLM | ||||
4-SLM/PLSA | 4.88 B | 400 | 100 |
compositemodel . | typesof . | # oftypes . | # ofservers . | # ofclients . |
---|---|---|---|---|
5-gram/PLSA | 4.39 B | 400 | 100 | |
5-gram/2-SLM | 2.01 B | 240 | 100 | |
2-gram/4-SLM | ||||
4-SLM/PLSA | 4.88 B | 400 | 100 |
Appendix A shows an example of sentence probability that is provided by 5-gram, 5-gram/PLSA, and 5-gram/4-SLM+5-gram/PLSA models, respectively; these language models are trained using the 1.3 billion tokens corpus. The example demonstrates that our composite model is able to extract topic information and grammatical structure to improve word prediction for natural language.
Table 8 shows the perplexity results for composite n-gram/PLSA and n-gram/m-SLM/PLSA language models when three methods are used to re-estimate the parameters of the SEMANTIZER of test document; we use superscript 1, 2, and 3 to denote that during testing we used one step on-line EM, on-line EM with fixed learning rate, and batch EM, respectively. The on-line EM with fixed learning rate gives the best perplexity results as well as the least computation time. Again, when we train the composite language on the 1.3 billion token corpus, we have to shrink the order of the n-gram and m-SLM in order to store them in a supercomputer having 1,000 cores. For the composite 4-SLM/PLSA model or its linear combination with models, we ignore all the tags and use only the words in the four headwords. For the composite 5-gram/4-SLM model or its linear combination with models, we in fact use its approximation, a linear combination of the 5-gram/2-SLM and 2-gram/4-SLM models.
language model . | 44M n = 3, m = 2 . | reduction . | 230M n = 4, m = 3 . | reduction . | 1.3B n = 5, m = 4 . | reduction . |
---|---|---|---|---|---|---|
n-gram (linear) | 262 | 200 | 138 | |||
n-gram/PLSA1 | 202 | 22.9% | 150 | 25.0% | 107 | 22.5% |
n-gram/m-SLM + n-gram/PLSA1 | 192 | 26.7% | 142 | 29.0% | (97) | 29.1% |
n-gram/PLSA2 | 196 | 25.2% | 146 | 27.0% | 102 | 26.1% |
n-gram/m-SLM + n-gram/PLSA2 | 184 | 29.8% | 137 | 31.5% | (91) | 34.1% |
n-gram/PLSA3 | 201 | 23.3% | 148 | 26.0% | 104 | 24.6% |
n-gram/m-SLM + n-gram/PLSA3 | 189 | 27.9% | 140 | 30.0% | (92) | 33.3% |
language model . | 44M n = 3, m = 2 . | reduction . | 230M n = 4, m = 3 . | reduction . | 1.3B n = 5, m = 4 . | reduction . |
---|---|---|---|---|---|---|
n-gram (linear) | 262 | 200 | 138 | |||
n-gram/PLSA1 | 202 | 22.9% | 150 | 25.0% | 107 | 22.5% |
n-gram/m-SLM + n-gram/PLSA1 | 192 | 26.7% | 142 | 29.0% | (97) | 29.1% |
n-gram/PLSA2 | 196 | 25.2% | 146 | 27.0% | 102 | 26.1% |
n-gram/m-SLM + n-gram/PLSA2 | 184 | 29.8% | 137 | 31.5% | (91) | 34.1% |
n-gram/PLSA3 | 201 | 23.3% | 148 | 26.0% | 104 | 24.6% |
n-gram/m-SLM + n-gram/PLSA3 | 189 | 27.9% | 140 | 30.0% | (92) | 33.3% |
To better explain and analyze our model, we mark the perplexity results for the 40 million token corpus in Table 5 on the vertices in Figure 3 to reveal many insights. The baseline trigram result is given by the vertex p(w|w−2w−1), the 2-SLM result is given by the vertex p(w|h−2h−1), the PLSA result is given by the vertex p(w|g), the trigram/2-SLM result is given by the vertex p(w|w−2w−1h−2h−1), the trigram/PLSA result is given by the vertex p(w|w−2w−1g), and the trigram/2-SLM/PLSA is given by the vertex p(w|w−2w−1h−2h−1g). The trigram + 2-SLM result is given by a linear combination of vertices p(w|w−2w−1) and p(w|h−2h−1); the trigram + PLSA result is given by a linear combination of vertices p(w|w−2w−1) and p(w|g); and the trigram + 2-SLM + PLSA result is given by a linear combination of vertices p(w|w−2w−1), p(w|h−2h−1), and p(w|g). The trigram/PLSA + 2-SLM/PLSA result is given by a linear combination of vertices p(w|w−2w−1g) and p(w|h−2h−1g), and so on. The trigram/PLSA + trigram/2-SLM + 2-SLM/PLSA result is given by a linear combination of vertices p(w|w−2w−1g), p(w|w−2w−1h−2h−1g), and p(w|h−2h−1g). The composite trigram/2-SLM/PLSA language model is more powerful and expressive than the linear combination of trigram, 2-SLM, and PLSA for two reasons. First, valuable relative frequency estimates such as f(w|w−2w−1h−2h−1g), f(w|w−2w−1h−2h−1), and so forth, are encoded into the composite language model, as seen from Figure 3. As long as there are events such as w−2w−1wh−2h−1g, and so on, that occur explicitly or implicitly in the training corpus, the composite trigram/2-SLM/PLSA will take them into account to improve the prediction power for test data, whereas a linear combination of trigram, 2-SLM, and PLSA just neglects a large amount of this valuable information. The second reason is that the weights used in a simple linear combination are context-independent, thus more restricted. Similarly, the composite trigram/2-SLM/PLSA language model is more powerful and expressive than a linear combination of pairwise composite language models (e.g., trigram/2-SLM, trigram/PLSA, and 2-SLM/PLSA), since the composite trigram/2-SLM/PLSA can take advantage of the relative frequency estimate f(w|w−2w−1h−2h−1g), f(w|w−2w−1h−1g), and f(w|w−1h−2h−1g). The improvement in this case shrinks, however, because pairwise composite language models use some valuable lower order relative frequency estimates such as f(w|w−2w−1g), and so forth. Stated another way, each vertex of the lattice in Figure 3 is an expert of WORD-PREDICTOR that is proficient in making a prediction based on the context represented at the vertex; it predicts words based on the information provided by a committee consisting of experts from parent vertices as well as the relative frequency estimate it extracts. These experts are hierarchically organized, with the WORD-PREDICTOR of the composite trigram/2-SLM/PLSA (i.e., p(w|w−2w−1h−2h−1g)) overseeing all available information to make the most powerful prediction.
Finally, we conducted experiments where we fixed the size of the training data and increased the complexity of our language models. Because available resources are limited, preventing us from considering complex language models that are trained on the 1.3 billion token corpus, we considered complex language models trained on the 44 million token corpus instead. Table 9 shows the perplexity results. We can see that as we increase the order for n-gram and m-SLM from n = 3 and m = 2 to n = 4 and m = 3, the composite language models become better and have up to 5% perplexity reductions; when we increase the order for n-gram and m-SLM to n = 5 and m = 4, however, the composite language models become worse and slightly overfit the data even if we use linear interpolation smoothing, and there are no further perplexity reductions.
language model . | 44M n = 3, m = 2 . | reduction . | 44M n = 4, m = 3 . | reduction . | 44M n = 5, m = 4 . | reduction . |
---|---|---|---|---|---|---|
baselinen-gram (linear) | 262 | 258 | 260 | |||
n-gram (Kneser-Ney) | 244 | 6.9% | 235 | 8.9% | 235 | 9.6% |
m-SLM | 279 | −6.5% | 254 | 1.6% | 254 | 2.3% |
n-gram + m-SLM | 247 | 5.7% | 233 | 9.7% | 234 | 10.0% |
n-gram + PLSA | 235 | 10.3% | 230 | 10.9% | 231 | 11.2% |
n-gram + m-SLM + PLSA | 222 | 15.3% | 220 | 14.7% | 221 | 15.0% |
n-gram/m-SLM | 243 | 7.3% | 232 | 10.1% | 235 | 9.6% |
n-gram/PLSA | 196 | 25.2% | 189 | 26.7% | 193 | 25.8% |
m-SLM/PLSA | 198 | 24.4% | 190 | 26.4% | 192 | 26.2% |
n-gram/PLSA + m-SLM/PLSA | 183 | 30.2% | 179 | 30.6% | 178 | 31.5% |
n-gram/m-SLM + m-SLM/PLSA | 183 | 30.2% | 178 | 31.0% | 180 | 30.8% |
n-gram/m-SLM + n-gram/PLSA | 184 | 29.8% | 176 | 31.8% | 178 | 31.5% |
n-gram/m-SLM + n-gram/PLSA + m-SLM/PLSA | 180 | 31.3% | 173 | 33.0% | 173 | 33.5% |
n-gram/m-SLM/PLSA | 176 | 32.8% | 169 | 34.5% | 171 | 34.2% |
language model . | 44M n = 3, m = 2 . | reduction . | 44M n = 4, m = 3 . | reduction . | 44M n = 5, m = 4 . | reduction . |
---|---|---|---|---|---|---|
baselinen-gram (linear) | 262 | 258 | 260 | |||
n-gram (Kneser-Ney) | 244 | 6.9% | 235 | 8.9% | 235 | 9.6% |
m-SLM | 279 | −6.5% | 254 | 1.6% | 254 | 2.3% |
n-gram + m-SLM | 247 | 5.7% | 233 | 9.7% | 234 | 10.0% |
n-gram + PLSA | 235 | 10.3% | 230 | 10.9% | 231 | 11.2% |
n-gram + m-SLM + PLSA | 222 | 15.3% | 220 | 14.7% | 221 | 15.0% |
n-gram/m-SLM | 243 | 7.3% | 232 | 10.1% | 235 | 9.6% |
n-gram/PLSA | 196 | 25.2% | 189 | 26.7% | 193 | 25.8% |
m-SLM/PLSA | 198 | 24.4% | 190 | 26.4% | 192 | 26.2% |
n-gram/PLSA + m-SLM/PLSA | 183 | 30.2% | 179 | 30.6% | 178 | 31.5% |
n-gram/m-SLM + m-SLM/PLSA | 183 | 30.2% | 178 | 31.0% | 180 | 30.8% |
n-gram/m-SLM + n-gram/PLSA | 184 | 29.8% | 176 | 31.8% | 178 | 31.5% |
n-gram/m-SLM + n-gram/PLSA + m-SLM/PLSA | 180 | 31.3% | 173 | 33.0% | 173 | 33.5% |
n-gram/m-SLM/PLSA | 176 | 32.8% | 169 | 34.5% | 171 | 34.2% |
Let denote the true (but unknown) distribution of natural language, its information projection to n-grams is the minimum Kullback-Leibler divergence from to n-grams (Amari and Nagaoka 2000; Wang, Greiner, and Wang 2009) and is denoted as . Let denote the empirical distribution of natural language—in particular, denotes the empirical distribution for a million token corpus, denotes the empirical distribution for a billion token corpus, and denotes the empirical distribution for a trillion token corpus. The information projection of to trigram is pM3, to 4-gram is pM4, and to 5-gram is pM5. The distance between and , , is the approximation error when using n-gram to represent , that is, the best the n-gram can do when abundant data are available. The distance between and , is the estimation error when only the million token corpus is available. The Pythagorean theorem states that the distance between and , , is the sum of the approximation error and the estimation error (Barron and Sheu 1991; Amari and Nagaoka 2000; Wang, Greiner, and Wang 2009). In language modeling research, because is unknown, the distance between and pMn, n = 3, 4 is approximately computed by the perplexity result using test data. By the Glivenko-Cantelli theorem (Vapnik 1998), we know that the empirical distribution converges to the true distribution ; similarly, the information projection of empirical distribution on an n-gram converges to the information projection on an n-gram of true distribution (i.e., the estimation error shrinks to 0). In the same vein, we can define the information projection of or to the composite language models and the corresponding approximate error and estimation error, and so forth. In this case, the Pythagorean theorem breaks down due to the non-convexity of the set of composite language models. As noted by Dr. Ciprian Chelba in our private communication on March 20th, 2010, “When playing with large data, the model capacity is an important factor to language model performance: The supply of more data needs to be matched by demand on the model side. A simple way to achieve this in n-grams is to increase the order n as much as the data will allow. This of course implies that the computational aspects of storing and serving such models are solved and that it is not a constraint” (see also Chelba et al. 2010). This is also true for our composite language models as justified from the results in Tables 5 and 9: The composite n-gram/m-SLM/PLSA language model has rich features, thus has smaller approximation error than the n-gram, m-SLM, PLSA, or any composite model of two, or their linear combinations. Table 5 shows that the information projection of the empirical distribution for the million and billion token corpora, and on the composite n-gram/m-SLM/PLSA language model, is closer to the true distribution . This is reflected approximately by the perplexity results on test data.
6.3 Re-ranking Machine Translation Results
We have applied our composite 5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 language model that is trained by a 1.3 billion word corpus for the task of re-ranking the N-best list in statistical MT. We used the same two 1,000-best lists that were used by Zhang and colleagues (Zhang, Hildebrand, and Vogel 2006; Zhang 2008; Zhang et al. 2011). The first list was generated on 919 sentences of 100 documents from the MT03 Chinese–English evaluation set, and the second was generated on 191 sentences of 20 documents from the MT04 Chinese–English evaluation set, both by Hiero (Chiang 2007), a state-of-the-art parsing-based translation model. Its decoder uses a trigram language model trained with modified Kneser-Ney smoothing (Jurafsky and Martin 2008) on a 200 million token corpus. Each translation has 11 features and language model is one of them. We substitute our language model and use MERT (Och 2003) to optimize the Bleu score (Papineni et al. 2002). We conduct two experiments on these two data sets. In the first experiment, we partition the first data set that consists of 100 documents into ten pieces; each piece consists of 10 documents, nine pieces are used as training data to optimize the Bleu score (Papineni et al. 2002) by MERT (Och 2003), and the remaining single piece is used to re-rank the 1,000-best list and obtain the Bleu score. The cross-validation process is then repeated 10 times (the folds), with each of the 10 pieces used exactly once as the validation data. The 10 results from the folds then can be averaged (or otherwise combined) to produce a single estimation for Bleu score. The mean and variance of the Bleu score are calculated with each different LM. We assume that the score follows Student's t-distribution and we compute the 95% confidence interval according to mean and variance. Table 10 shows the Bleu scores through 10-fold cross-validation. The composite 5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 language model gives 1.57 percentage point Bleu score improvement over the baseline and 0.79 percentage point Bleu score improvement over the 5-gram. We are not able to further improve Bleu score when we use either the 5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA2 or 5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA3. This is because there is not much diversity on the 1,000-best list, and essentially only 20 ∼ 30 distinct sentences are in the 1,000-best list.
system model . | mean (%) . | 95% CI (%) . |
---|---|---|
Baseline | 31.75 | 0.22 |
5-gram | 32.53 | 0.24 |
5-gram/2-SLM + 2-gram/4-SLM | 32.87 | 0.24 |
5-gram/PLSA1 | 33.01 | 0.24 |
5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 | 33.32 | 0.25 |
system model . | mean (%) . | 95% CI (%) . |
---|---|---|
Baseline | 31.75 | 0.22 |
5-gram | 32.53 | 0.24 |
5-gram/2-SLM + 2-gram/4-SLM | 32.87 | 0.24 |
5-gram/PLSA1 | 33.01 | 0.24 |
5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 | 33.32 | 0.25 |
In the second experiment, we used the first data set as training data to optimize the Bleu score by MERT, then the second data set is used to re-rank the 1,000-best list and obtain the Bleu score. To obtain the confidence interval of the Bleu score, we resort to the bootstrap resampling described by Koehn (2004). We randomly select 10 re-ranked documents from the 20 re-ranked documents in the second data set with replacement. We draw the translation results of the 10 documents and compute the Bleu score. We repeat this procedure 1,000 times. When we compute the 95% confidence interval, we drop the top 25 and bottom 25 Bleu scores, and only consider the range of 26th to 975th Bleu scores. Table 11 shows the Bleu scores. These statistics are computed with different language models, but on the same chosen test sets. The 5-gram gives 0.51 percentage point Bleu score improvement over the baseline. The composite 5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 language model gives 1.19 percentage point Bleu score improvement over the baseline and 0.68 percentage point Bleu score improvement over the 5-gram.
system model . | mean (%) . | 95% CI (%) . |
---|---|---|
Baseline | 27.59 | 0.31 |
5-gram | 28.10 | 0.32 |
5-gram/2-SLM + 2-gram/4-SLM | 28.34 | 0.32 |
5-gram/PLSA1 | 28.53 | 0.31 |
5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 | 28.78 | 0.31 |
system model . | mean (%) . | 95% CI (%) . |
---|---|---|
Baseline | 27.59 | 0.31 |
5-gram | 28.10 | 0.32 |
5-gram/2-SLM + 2-gram/4-SLM | 28.34 | 0.32 |
5-gram/PLSA1 | 28.53 | 0.31 |
5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 | 28.78 | 0.31 |
Chiang 2007 studied the performance of machine translation on Hiero, the Bleu score is 33.31% when n-gram is used to re-rank the N-best list; the Bleu score becomes significantly higher (37.09%) when the n-gram is embedded directly into Hiero's one pass decoder, however. This is because there is not much diversity in the N-best list. It is expected that putting our composite language into a one-pass decoder should result in much improved Bleu scores.
Besides reporting the Bleu scores, we look at the “readability” of translations, similar to the study conducted by Charniak, Knight, and Yamada (2003). The translations are sorted into four groups: good/bad syntax crossed with good/bad meaning by human judges (see Table 12). We find that many more sentences are perfect, many more are grammatically correct, and many more are semantically correct. The syntactic language model (Charniak et al. 2003) only improves translations to have good grammar, but does not improve translations to preserve meaning. The composite 5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 language model improves both significantly. Bear in mind that Charniak et al. (2003) integrated Charniak's language model with the syntax-based translation model proposed by Yamada and Knight (2001) to rescore a tree-to-string translation forest, whereas we use only our language model for N-best list re-ranking. Also, the same study (Charniak et al. 2003) found that the outputs produced using the n-grams received higher scores from Bleu; ours did not. The difference between human judgments and Bleu scores indicates that closer agreement may be possible by incorporating syntactic structure and semantic information into the Bleu score evaluation. For example, semantically similar words like insure and ensure as in Bleu paper (Papineni et al. 2002) should be substituted in the formula, and there is a weight to measure the goodness of syntactic structure. This modification will lead to a better metric and such information can be provided by our composite language models.
system model . | P . | S . | G . | W . |
---|---|---|---|---|
Baseline | 95 | 398 | 20 | 406 |
5-gram | 122 | 406 | 24 | 367 |
5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 | 151 | 425 | 33 | 310 |
system model . | P . | S . | G . | W . |
---|---|---|---|---|
Baseline | 95 | 398 | 20 | 406 |
5-gram | 122 | 406 | 24 | 367 |
5-gram/2-SLM + 2-gram/4-SLM + 5-gram/PLSA1 | 151 | 425 | 33 | 310 |
In Appendix B, we give examples of “perfect” sentences, “only semantically correct” sentences, and “only grammatically correct” sentences.
7. Conclusion and Future Work
We have built a powerful large-scale distributed composite language model which integrates well-known n-gram, SLM, and PLSA models under the directed MRF paradigm. The composite language model has been trained by performing a convergent N-best list approximate EM algorithm and a follow-up EM algorithm to improve word prediction power on corpora up to a billion tokens, and stored on a supercomputer. We have achieved drastic perplexity reductions and obtained significantly better translation quality measured by the Bleu score and “readability” of translations in the task of re-ranking the N-best list from a state-of-the-art parsing-based MT system. As far as we know, this is the first work building a complex large-scale distributed language model with a principled approach that simultaneously exploits syntactic, semantic, and lexical regularities and is still more powerful than n-grams trained on a very large corpus with up to a billion tokens. It is reasonable to conjecture that composite language models can achieve drastic perplexity reduction and significantly better translation quality than n-gram when trained on Web-scale corpora that have trillions of tokens.
As stated in Wang et al. (2010, p. 45), “Since Banko and Brill's pioneering work almost a decade ago (Banko and Brill 2001), it has been widely observed that the effectiveness of statistical natural language processing (NLP) techniques is highly susceptible to the data size used to develop them. As empirical studies have repeatedly shown that simple algorithms can often outperform their more complicated counterparts in wide varieties of NLP applications with large data sets, many have come to believe that it is the size of data, not the sophistication of the algorithms, that ultimately play the central role in modern NLP (Norvig 2008).” It is true that ‘the more the data, the better the result,’ a dictum recently reiterated in a somewhat stronger form in Halevy, Norvig, and Pereira (2009), but care needs to be taken here. As we explained in the last paragraph of Section 6.2, after we increase the size of data, we should also increase the complexity of the model in order to achieve best results. For language modeling in particular, because the expressive power of simple n-grams is rather limited, it is worthwhile to exploit latent semantic information and syntactic structure that constrain the generation of natural language; this usually involves designing sophisticated algorithms. Of course, this implies that it takes a huge amount of resources to perform the computation. As cloud computing becomes the dominant platform for data management and information processing as utility computing, this will become feasible, affordable, and cheap.
The development of the large-scale distributed composite language model is in its infancy; we are planning to deepen our research and push this research in its limit. Specifically, we plan to integrate more advanced topic language models such as LDA (Blei, Ng, and Jordan 2003) and resort to a hierarchical non-parametric Bayesian model (Teh 2006; Teh and Jordan 2010) for smoothing fractional counts due to latent variables to handle the sparse data problem in Kneser-Ney's sense in a principled manner, thus constructing a family of large-scale distributed composite lexical, syntactic, and semantic language models. Finally we will put this family of composite language models into a phrased-based machine translation decoder (Koehn, Och, and Marcu 2003) that produces a lattice of alternative translations/transcriptions or a syntax-based decoder (Chiang 2005, 2007) that produces a forest of alternatives (such integration would, in the exact case, reside in an extremely difficult complexity class, probably PSPACE-complete) to significantly improve the performance of the state-of-the-art machine translation systems.
Appendix A: An Example of Sentence Probability
We chose a document from the LDC English Gigaword corpus to show how sentence probability varies when computed by 5-gram, 5-gram/PLSA, and 5-gram/PLSA + 4-SLM/PLSA. The document tag is 〈XIN_ENG_20041126_0168.story〉. This document's perplexity computed by 5-gram, 5-gram + PLSA, 5-gram + 4-SLM + PLSA, 5-gram/PLSA, and 5-gram/PLSA + 4-SLM/PLSA that are trained using 1.3 billion tokens corpus is 97, 93, 83, 71, and 64, respectively. We show the first four sentences below.
〈s〉 cpc initiates education campaign to strengthen members ' wavering convictions 〈/s〉 〈s〉 by zhao lei 〈/s〉 〈s〉 beijing nov. 'nmbr xinhua the communist party of china cpc has decided to launch a mass internal educational campaign from january next year to prevent its members from wavering in their convictions 〈/s〉 〈s〉 the decision aiming to keep the nature of the party members intact was made at the meeting of the political bureau of the cpc central committee on this oct. 'nmbr the cpc 's top power organ 〈/s〉 ⋯⋯
We then list the word conditional probabilities given its document history for the fourth sentence. The first line is the fourth sentence; the second line (a) denotes the natural log value of the conditional word probabilities given its document history computed by 5-gram; the third line (b) denotes the natural log value of the conditional word probabilities given its document history computed by 5-gram + PLSA; the fourth line (c) denotes the natural log value of the conditional word probabilities given its document history computed by 5-gram + PLSA + 4-SLM; the fifth line (d) denotes the natural log value of the conditional word probabilities given its document history computed by 5-gram/PLSA; and the sixth line (e) denotes the natural log value of the conditional word probabilities given its document history computed by 5-gram/PLSA + 4-SLM/PLSA.
The conditional probability of the word(s) party or political bureau given document history computed by 5-gram/PLSA or 5-gram/PLSA + 4-SLM/PLSA is significantly boosted due to the appearance of semantic related words such as cpc and communist party in the previous sentences, this clearly shows that the composite language models (5-gram/PLSA and 5-gram/PLSA + 4-SLM/PLSA) trigger long-span document-level discourse topics to influence word prediction. In contrast, there is no effect when using linear combination models (i.e., 5-gram + PLSA and 5-gram + 4-SLM + PLSA). Similarly, the conditional probability of the words was made (or the word intact) given document history computed by 5-gram/PLSA + 4-SLM/PLSA is significantly boosted due the appearance of the grammatical headword decision (or keep) in the same sentence, this clearly shows that the composite language model (5-gram/PLSA + 4-SLM/PLSA) exploits sentence level syntactic structure to influence word prediction. In this case, the n-gram has to increase its order to 11 or 8. The linear combination model 5-gram + 4-SLM + PLSA is quite effective, although it has negative impact for the prediction of function words such as of the after the word(s) natural or political bureau.
Table 13 shows the statistics when n-grams are the same as the SLM's WORD-PREDICTOR in the most likely parse structure of each sentence in training corpora. Whenever the n-grams are not the same as SLMߣs WORD-PREDICTOR, the SLM component will be effective to furnish sentence-level long-range grammatical information.
Corpus . | . | . | . |
---|---|---|---|
44 M | 57% | 46% | 38% |
230 M | 59% | 46% | 38% |
1.3 B | 55% | 48% | 43% |
Corpus . | . | . | . |
---|---|---|---|
44 M | 57% | 46% | 38% |
230 M | 59% | 46% | 38% |
1.3 B | 55% | 48% | 43% |
This example and Table 13 clearly demonstrate that an n-gram alone is not able to achieve a similar effect to. SLM and PLSA even using Web-scale data, and the directed MRF paradigm effectively synergizes n-gram, m-SLM, and PLSA in a complementary, supplementary, and coherent way to form a powerful language model for word prediction of natural language.
Appendix B: Examples of Translation Results
In the following, we give examples of “perfect” sentences, “only semantically correct” sentences, and “only grammatically correct” sentences, where the digit numbers are the sentence number in the n-best list from Hiero (a) denotes the reference sentence, (b) denotes the result provided by the composite language model, and (c) denotes the result provided by 5-gram.
A few examples of “perfect” sentences provided by the composite language model:
—512—
a. Sri Lanka's Prime Minister Calls on the People to Work together for Permanent Peace
b. Sri Lanka prime minister called on national common efforts to achieve lasting peace
c. Sri Lanka prime minister called on the national common achieve lasting peace
—54—
a. Wilner said the maximum penalty for securities fraud is 10 years imprisonment. However, the sentence is expected to be “significantly shorter” under the plea deal.
b. wiener, said securities fraud charges could be sentenced to 10 years' imprisonment, according to pleaded guilty mitigation, the sentence is “shorten”.
c. wiener, sentenced to 10 years' imprisonment maximum securities fraud charges, according to pleaded guilty mitigation, the sentence is “shorten”.
—206—
a. He said at a press conference in Doha, capital of Qarta, that if the United States “attacks Iraq, it may trigger a global disaster.”
b. his press conference in doha, capital of qatar, said “if the united states attacks iraq, it will trigger a world disaster”.
c. his press conference in doha, capital of qatar, said that the united states attacks iraq, “if it will trigger a world disaster”.
—249—
a. Some Areas in Northwest Australia Face floods
b. floods in some areas in the northwest australia
c. australia northwest part of floods
A few examples of “only grammatically correct” sentences provided by the composite language model:
—458—
a. Sutiyoso said that gardens and flower beds would reduce the impression that the US embassy is a fort.
b. szudy about woven said that garden landscape could reduce the us embassy to a fortress.
c. szudy over so that garden landscape can reduce the u.s. embassy to a fortress.
—676—
a. He said that during last Christmas and the New Year, mainland tourists' spending accounted for 30
b. during christmas last year, he said, the mainland visitors spending will account for a three to four percent of the kaneyuki business and become the major consumer of the industry.
c. last year, he said, mainland visitors during the christmas spending for the kaneyuki 3 to 4 percent of the business, has become the major consumption.
A few examples of “only semantically correct” sentences provided by the composite language model:
—507—
a. The famous historic city of Cologne also narrowly escaped the disaster in the heavy rains.
b. cologne, a famous historical city also escaped unscathed in the heavy rain.
c. cologne, a famous historical city in heavy rain, escaped unscathed.
—416—
a. However, he insisted on the timetable laid down by Bush. That is UN only has “weeks but not months” to try to disarm Iraq peacefully and it would be military action thereafter.
b. however, he insists the bush timetable, the united nations is “weeks rather than months” to urge iraq to the peace disarm, then we will take military action.
c. however, he insists that the bush timetable, the only “weeks rather than months” to urge iraq to the peace disarm, she went on to take military action.
—787—
a. France circulated its proposals in the form of “a non-paper.”
b. france is to distribute their proposals in the form of “non - paper.”
c. france is the form of “non - paper” distribute their proposals.
—313—
a. In China, three-quarters of the 1.3 billion population were reported to have celebrated the New Year by watching television.
b. 1.3 billion population in china, according to reports, 3 / 4 is to watch tv celebrate lunar new year.
c. 1.3 billion population in china, according to reports, 3 / 4 is to celebrate televisions.
Notes
Acknowledgements
We would like to dedicate this work to the memory of Fred Jelinek, who passed away while we were finalizing this manuscript. Fred Jelinek laid the foundation for modern speech recognition and text translation technology. His work has greatly influenced us. This research is supported by the National Science Foundation under grant IIS RI-small 0812483, a Google research award, and Air Force Office of Scientific Research under grant FA9550-10-1-0335. We would like to thank the Ohio Supercomputer Center for an allocation of computing time to make this research possible; Ciprian Chelba for providing the SLM code, answering many questions regarding SLM, and consulting on various aspects of the work; Ying Zhang and Philip Resnik for providing the 1,000-best list from Hiero for re-ranking in machine translation; Peng Xu for suggesting to look at the conditional probability of a word given its document history to make the perplexity result much more convincing. Finally we would also like to thank the reviewers, who made a number of invaluable suggestions about the writing of the paper and pointed out many weaknesses in our original manuscript.
References
Author notes
Kno.e.sis Center and Department of Computer Science and Engineering, Wright State University, Dayton OH 45435. E-mail: [email protected].
Kno.e.sis Center and Department of Computer Science and Engineering, Wright State University, Dayton OH 45435. E-mail: [email protected].
Kno.e.sis Center, Wright State University, Dayton OH 45435. E-mail: [email protected].
Kno.e.sis Center and Department of Computer Science and Engineering, Wright State University, Dayton OH 45435. E-mail: [email protected].