Skip to Main Content

The out-of-vocabulary (OOV) rate on the 44 million, 230 million, 1.3 billion token training corpora is 0.6%, 0.9%, and 1.2%, respectively. The OOV rate on the 1.7 million and 13.7 million token check corpora is 0.6% and 1.3%, respectively. The OOV rate on the 354k token test corpus is 2.0%. Table 2 lists the statistics about the number of types of n-grams on these three corpora.

Table 2 

Statistics about the number of types of n-grams (n = 3, 4, 5) on the 44 million, 230 million, and 1.3 billion token corpora.


n = 3
n = 4
n = 5
44 M 14,302,355 23,833,023 29,068,173 
230 M 51,115,539 94,617,433 120,978,281 
1.3 B 224,767,319 481,645,099 660,599,586 

n = 3
n = 4
n = 5
44 M 14,302,355 23,833,023 29,068,173 
230 M 51,115,539 94,617,433 120,978,281 
1.3 B 224,767,319 481,645,099 660,599,586 

Close Modal

or Create an Account

Close Modal
Close Modal