Gross corpus statistics for the pre-processed corpora used to train and evaluate our models. We compare to the WSJ section of the PTB: train (Sections 02–21); dev. (Section 22); test (Section 23). Due to its flat annotation style, the FTB sentences have fewer constituents per sentence. In the ATB, morphological variation accounts for the high proportion of word types to sentences.
. | . | ATB . | FTB . | WSJ . |
---|---|---|---|---|
Train | #sentences | 18,818 | 13,448 | 39,832 |
#tokens | 597,933 | 397,917 | 950,028 | |
#word types | 37,188 | 26,536 | 44,389 | |
#POS types | 32 | 30 | 45 | |
#phrasal types | 31 | 24 | 27 | |
avg. length | 31.8 | 29.6 | 23.9 | |
Dev. | #sentences | 2,318 | 1,235 | 1,700 |
#tokens | 70,656 | 38,298 | 40,117 | |
#word types | 12,358 | 6,794 | 6,840 | |
avg. length | 30.5 | 31.0 | 23.6 | |
OOV rate | 15.6% | 17.8% | 12.8% | |
Test | #sentences | 2,313 | 1,235 | 2,416 |
#tokens | 70,065 | 37,961 | 56,684 |
. | . | ATB . | FTB . | WSJ . |
---|---|---|---|---|
Train | #sentences | 18,818 | 13,448 | 39,832 |
#tokens | 597,933 | 397,917 | 950,028 | |
#word types | 37,188 | 26,536 | 44,389 | |
#POS types | 32 | 30 | 45 | |
#phrasal types | 31 | 24 | 27 | |
avg. length | 31.8 | 29.6 | 23.9 | |
Dev. | #sentences | 2,318 | 1,235 | 1,700 |
#tokens | 70,656 | 38,298 | 40,117 | |
#word types | 12,358 | 6,794 | 6,840 | |
avg. length | 30.5 | 31.0 | 23.6 | |
OOV rate | 15.6% | 17.8% | 12.8% | |
Test | #sentences | 2,313 | 1,235 | 2,416 |
#tokens | 70,065 | 37,961 | 56,684 |