Skip to Main Content
Table 5

Gross corpus statistics for the pre-processed corpora used to train and evaluate our models. We compare to the WSJ section of the PTB: train (Sections 02–21); dev. (Section 22); test (Section 23). Due to its flat annotation style, the FTB sentences have fewer constituents per sentence. In the ATB, morphological variation accounts for the high proportion of word types to sentences.



ATB
FTB
WSJ
Train #sentences 18,818 13,448 39,832 
#tokens 597,933 397,917 950,028 
#word types 37,188 26,536 44,389 
#POS types 32 30 45 
#phrasal types 31 24 27 
avg. length 31.8 29.6 23.9 
Dev. #sentences 2,318 1,235 1,700 
#tokens 70,656 38,298 40,117 
#word types 12,358 6,794 6,840 
avg. length 30.5 31.0 23.6 
OOV rate 15.6% 17.8% 12.8% 
Test #sentences 2,313 1,235 2,416 
#tokens 70,065 37,961 56,684 


ATB
FTB
WSJ
Train #sentences 18,818 13,448 39,832 
#tokens 597,933 397,917 950,028 
#word types 37,188 26,536 44,389 
#POS types 32 30 45 
#phrasal types 31 24 27 
avg. length 31.8 29.6 23.9 
Dev. #sentences 2,318 1,235 1,700 
#tokens 70,656 38,298 40,117 
#word types 12,358 6,794 6,840 
avg. length 30.5 31.0 23.6 
OOV rate 15.6% 17.8% 12.8% 
Test #sentences 2,313 1,235 2,416 
#tokens 70,065 37,961 56,684 
Close Modal

or Create an Account

Close Modal
Close Modal