As we mentioned in Section 3.1.1, we can keep only a small set of topics due to the considerations of computational time and resource demand. Table 4 shows the perplexity results and computation time of composite n-gram/PLSA language models that are trained on the three corpora when the pre-defined number of total topics is 200, but different numbers of most-likely topics are kept for each document in PLSA; the rest are pruned. For the composite 5-gram/PLSA model trained on the 1.3 billion token corpus, 400 cores have to be used to keep the top five most likely topics. For the composite trigram/PLSA model trained on the 44M token corpus, the computation time increases drastically, with less than 5% percent perplexity improvement. In the following experiments, therefore, we keep the top five topics for each document from a total of 200 topics—all other 195 topics are pruned.
Perplexity (ppl) results and time consumed of the composite n-gram/PLSA language model trained on three corpora when different numbers of most-likely topics are kept for each document in PLSA.
corpus . | n . | # of topics . | ppl . | time (hours) . | # of servers . | # of clients . | # of types of![]() . |
---|---|---|---|---|---|---|---|
44M | 3 | 5 | 196 | 0.5 | 40 | 100 | 120.1M |
3 | 10 | 194 | 1.0 | 40 | 100 | 218.6M | |
3 | 20 | 190 | 2.7 | 80 | 100 | 537.8M | |
3 | 50 | 189 | 6.3 | 80 | 100 | 1.123B | |
3 | 100 | 189 | 11.2 | 80 | 100 | 1.616B | |
3 | 200 | 188 | 19.3 | 80 | 100 | 2.280B | |
230M | 4 | 5 | 146 | 25.6 | 280 | 100 | 0.681B |
1.3B | 5 | 2 | 111 | 26.5 | 400 | 100 | 1.790B |
5 | 5 | 102 | 75.0 | 400 | 100 | 4.391B |
corpus . | n . | # of topics . | ppl . | time (hours) . | # of servers . | # of clients . | # of types of![]() . |
---|---|---|---|---|---|---|---|
44M | 3 | 5 | 196 | 0.5 | 40 | 100 | 120.1M |
3 | 10 | 194 | 1.0 | 40 | 100 | 218.6M | |
3 | 20 | 190 | 2.7 | 80 | 100 | 537.8M | |
3 | 50 | 189 | 6.3 | 80 | 100 | 1.123B | |
3 | 100 | 189 | 11.2 | 80 | 100 | 1.616B | |
3 | 200 | 188 | 19.3 | 80 | 100 | 2.280B | |
230M | 4 | 5 | 146 | 25.6 | 280 | 100 | 0.681B |
1.3B | 5 | 2 | 111 | 26.5 | 400 | 100 | 1.790B |
5 | 5 | 102 | 75.0 | 400 | 100 | 4.391B |