Skip to Main Content

As we mentioned in Section 3.1.1, we can keep only a small set of topics due to the considerations of computational time and resource demand. Table 4 shows the perplexity results and computation time of composite n-gram/PLSA language models that are trained on the three corpora when the pre-defined number of total topics is 200, but different numbers of most-likely topics are kept for each document in PLSA; the rest are pruned. For the composite 5-gram/PLSA model trained on the 1.3 billion token corpus, 400 cores have to be used to keep the top five most likely topics. For the composite trigram/PLSA model trained on the 44M token corpus, the computation time increases drastically, with less than 5% percent perplexity improvement. In the following experiments, therefore, we keep the top five topics for each document from a total of 200 topics—all other 195 topics are pruned.

Table 4 

Perplexity (ppl) results and time consumed of the composite n-gram/PLSA language model trained on three corpora when different numbers of most-likely topics are kept for each document in PLSA.

corpus
n
# of topics
ppl
time (hours)
# of servers
# of clients
# of types of
44M 196 0.5 40 100 120.1M 
10 194 1.0 40 100 218.6M 
20 190 2.7 80 100 537.8M 
50 189 6.3 80 100 1.123B 
100 189 11.2 80 100 1.616B 
200 188 19.3 80 100 2.280B 
230M 146 25.6 280 100 0.681B 
1.3B 111 26.5 400 100 1.790B 
102 75.0 400 100 4.391B 
corpus
n
# of topics
ppl
time (hours)
# of servers
# of clients
# of types of
44M 196 0.5 40 100 120.1M 
10 194 1.0 40 100 218.6M 
20 190 2.7 80 100 537.8M 
50 189 6.3 80 100 1.123B 
100 189 11.2 80 100 1.616B 
200 188 19.3 80 100 2.280B 
230M 146 25.6 280 100 0.681B 
1.3B 111 26.5 400 100 1.790B 
102 75.0 400 100 4.391B 

Close Modal

or Create an Account

Close Modal
Close Modal