All composite language models are first trained by performing the N-best list approximate EM algorithm until convergence, then the EM algorithm for a second stage of parameter re-estimation for WORD-PREDICTOR and SEMANTIZER until convergence. We fix the size of topics in the PLSA to be 200 and then prune to 5 in the experiments, where the unpruned 5 topics in general account for 70% probability in p(g|d). Table 5 shows comprehensive perplexity results for a variety of different models such as composite n-gram/m-SLM, n-gram/PLSA, m-SLM/PLSA, their linear combinations, and so on, where we use on-line EM with a fixed learning rate to re-estimate the parameters of the SEMANTIZER of test document. The m-SLM performs competitively with its counterpart n-gram (n = m + 1) on large scale corpus. Table 6 lists the statistics about the number of types in the predictor of the m-SLMs on these three corpora, where for the 230 million token and 1.3 billion token corpora we cut off the fractional expected counts that are less than a predefined threshold of 0.005, to significantly reduce the number of the predictor's types by 70%.
Perplexity results for various language models on test corpora, where + denotes linear combination, / denotes composite model; n denotes the order of the n-gram, and m denotes the order of the SLM; the topic nodes are pruned from 200 to 5.
language model . | 44M n = 3, m = 2 . | reduction . | 230M n = 4, m = 3 . | reduction . | 1.3B n = 5,m = 4 . | reduction . |
---|---|---|---|---|---|---|
baselinen-gram (linear) | 262 | 200 | 138 | |||
n-gram (Kneser-Ney) | 244 | 6.9% | 183 | 8.5% | — | — |
m-SLM | 279 | −6.5% | 190 | 5.0% | 137 | 0.0% |
PLSA | 825 | −214.9% | 812 | −306.0% | 773 | −460.0% |
n-gram + m-SLM | 247 | 5.7% | 184 | 8.0% | 129 | 6.5% |
n-gram + PLSA | 235 | 10.3% | 179 | 10.5% | 128 | 7.2% |
n-gram + m-SLM + PLSA | 222 | 15.3% | 175 | 12.5% | 123 | 10.9% |
n-gram/m-SLM | 243 | 7.3% | 171 | 14.5% | (125) | 9.4% |
n-gram/PLSA | 196 | 25.2% | 146 | 27.0% | 102 | 26.1% |
m-SLM/PLSA | 198 | 24.4% | 140 | 30.0% | (103) | 25.4% |
n-gram/PLSA + m-SLM/PLSA | 183 | 30.2% | 140 | 30.0% | (93) | 32.6% |
n-gram/m-SLM+m-SLM/PLSA | 183 | 30.2% | 139 | 30.5% | (94) | 31.9% |
n-gram/m-SLM + n-gram/PLSA | 184 | 29.8% | 137 | 31.5% | (91) | 34.1% |
n-gram/m-SLM + n-gram/PLSA + m-SLM/PLSA | 180 | 31.3% | 130 | 35.0% | — | — |
n-gram/m-SLM/PLSA | 176 | 32.8% | — | — | — | — |
language model . | 44M n = 3, m = 2 . | reduction . | 230M n = 4, m = 3 . | reduction . | 1.3B n = 5,m = 4 . | reduction . |
---|---|---|---|---|---|---|
baselinen-gram (linear) | 262 | 200 | 138 | |||
n-gram (Kneser-Ney) | 244 | 6.9% | 183 | 8.5% | — | — |
m-SLM | 279 | −6.5% | 190 | 5.0% | 137 | 0.0% |
PLSA | 825 | −214.9% | 812 | −306.0% | 773 | −460.0% |
n-gram + m-SLM | 247 | 5.7% | 184 | 8.0% | 129 | 6.5% |
n-gram + PLSA | 235 | 10.3% | 179 | 10.5% | 128 | 7.2% |
n-gram + m-SLM + PLSA | 222 | 15.3% | 175 | 12.5% | 123 | 10.9% |
n-gram/m-SLM | 243 | 7.3% | 171 | 14.5% | (125) | 9.4% |
n-gram/PLSA | 196 | 25.2% | 146 | 27.0% | 102 | 26.1% |
m-SLM/PLSA | 198 | 24.4% | 140 | 30.0% | (103) | 25.4% |
n-gram/PLSA + m-SLM/PLSA | 183 | 30.2% | 140 | 30.0% | (93) | 32.6% |
n-gram/m-SLM+m-SLM/PLSA | 183 | 30.2% | 139 | 30.5% | (94) | 31.9% |
n-gram/m-SLM + n-gram/PLSA | 184 | 29.8% | 137 | 31.5% | (91) | 34.1% |
n-gram/m-SLM + n-gram/PLSA + m-SLM/PLSA | 180 | 31.3% | 130 | 35.0% | — | — |
n-gram/m-SLM/PLSA | 176 | 32.8% | — | — | — | — |
Statistics about the number of types in the predictor of the m-SLMs (m = 2, 3, 4) on the 44 million, 230 million, and 1.3 billion token corpora. For the 230 million and 1.3 billion token corpora, fractional expected counts that are less than a threshold are pruned to significantly reduce the number of m-SLM (m = 3, 4) predictor's types by 70%.
. | m = 2 . | m = 3 . | m = 4 . |
---|---|---|---|
44 M | 189,002,525 | 269,685,833 | 318,174,025 |
230 M | 267,507,672 | 1,154,020,346 | 1,417,977,184 |
1.3 B | 946,683,807 | 1,342,323,444 | 1,849,882,215 |
. | m = 2 . | m = 3 . | m = 4 . |
---|---|---|---|
44 M | 189,002,525 | 269,685,833 | 318,174,025 |
230 M | 267,507,672 | 1,154,020,346 | 1,417,977,184 |
1.3 B | 946,683,807 | 1,342,323,444 | 1,849,882,215 |