Skip to Main Content
Table 1: 
Statistics of Wikipedia data sets; en and zh are shorthand for English and Chinese, respectively. Synthetic documents and sentences are used in our automatic evaluation experiments discussed in Section 7.
Wiki-enWiki-zh
All Documents 31,562 26,280 
Training Documents 25,562 22,280 
Development Documents 3,000 2,000 
Test Documents 3,000 2,000 
Multilabel Ratio 10.18% 29.73% 
Average #Words 1,152.08 615.85 
Vocabulary Size 175,555 169,179 
 
Synthetic Documents 200 200 
Synthetic Sentences 18,922 18,312 
Wiki-enWiki-zh
All Documents 31,562 26,280 
Training Documents 25,562 22,280 
Development Documents 3,000 2,000 
Test Documents 3,000 2,000 
Multilabel Ratio 10.18% 29.73% 
Average #Words 1,152.08 615.85 
Vocabulary Size 175,555 169,179 
 
Synthetic Documents 200 200 
Synthetic Sentences 18,922 18,312 
Close Modal

or Create an Account

Close Modal
Close Modal