Skip to Main Content
Table 2: 
Statistics of our datasets. “Treebank” is the UD treebank identifier, “#Token” is the number of tokens, “%Punct” is the percentage of punctuation tokens, “#Omit” is the small number of sentences containing non-leaf punctuation tokens (see footnote 19), and “#Type” is the number of punctuation types after preprocessing. (Recall from §4 that preprocessing distinguishes between left and right quotation mark types, and between abbreviation dot and period dot types.)
LanguageTreebank#Token%Punct#Omit#Type
Arabic 𝖺𝗋 282K 7.9 255 18 
Chinese 𝗓𝗁 123K 13.8 23 
English 𝖾𝗇𝖾𝗇_𝖾𝗌𝗅 255K97.7K 11.79.8 402 3516 
Hindi 𝗁𝗂 352K 6.7 21 15 
Spanish 𝖾𝗌_𝖺𝗇𝖼𝗈𝗋𝖺 560K 11.7 25 16 
LanguageTreebank#Token%Punct#Omit#Type
Arabic 𝖺𝗋 282K 7.9 255 18 
Chinese 𝗓𝗁 123K 13.8 23 
English 𝖾𝗇𝖾𝗇_𝖾𝗌𝗅 255K97.7K 11.79.8 402 3516 
Hindi 𝗁𝗂 352K 6.7 21 15 
Spanish 𝖾𝗌_𝖺𝗇𝖼𝗈𝗋𝖺 560K 11.7 25 16 
Close Modal

or Create an Account

Close Modal
Close Modal