The out-of-vocabulary (OOV) rate on the 44 million, 230 million, 1.3 billion token training corpora is 0.6%, 0.9%, and 1.2%, respectively. The OOV rate on the 1.7 million and 13.7 million token check corpora is 0.6% and 1.3%, respectively. The OOV rate on the 354k token test corpus is 2.0%. Table 2 lists the statistics about the number of types of n-grams on these three corpora.
Statistics about the number of types of n-grams (n = 3, 4, 5) on the 44 million, 230 million, and 1.3 billion token corpora.
. | n = 3 . | n = 4 . | n = 5 . |
---|---|---|---|
44 M | 14,302,355 | 23,833,023 | 29,068,173 |
230 M | 51,115,539 | 94,617,433 | 120,978,281 |
1.3 B | 224,767,319 | 481,645,099 | 660,599,586 |
. | n = 3 . | n = 4 . | n = 5 . |
---|---|---|---|
44 M | 14,302,355 | 23,833,023 | 29,068,173 |
230 M | 51,115,539 | 94,617,433 | 120,978,281 |
1.3 B | 224,767,319 | 481,645,099 | 660,599,586 |