Skip to Main Content
Table 5

Comparison of character 5-gram models derived from either the original corpus or a word trigram model. Size of the models is presented in terms of the number of character n-grams, the numbers of states and transitions in the automaton representation, and the file size in MB. The two corpus estimated models have the same topology, hence the same size; as do the two word trigram estimated models.

Sourcen-grams (x1000)states (x1000)transitions (x1000)MBEstimationbits/char
Corpus 336 60 381 6.5 Kneser-Ney 2.04 
Witten-Bell (WB) 2.01 
Word trigram 277 56 322 5.6 Sampled (WB) 2.36 
KL min 1.99 
Sourcen-grams (x1000)states (x1000)transitions (x1000)MBEstimationbits/char
Corpus 336 60 381 6.5 Kneser-Ney 2.04 
Witten-Bell (WB) 2.01 
Word trigram 277 56 322 5.6 Sampled (WB) 2.36 
KL min 1.99 
Close Modal

or Create an Account

Close Modal
Close Modal