Skip to Main Content
Table 1: 
Typological features of the 11 languages in TyDi QA. We use + to indicate that this phenomena occurs, ++ to indicate that it occurs frequently, and +++ to indicate very frequently.
LanguageLatin scriptaWhite spacetokensSentence boundariesWordformationbGendercPro-drop
English + + + + +d — 
Arabic — + + ++ + + 
Bengali — + + + — + 
Finnish + + + +++ — — 
Indonesian + + + + — + 
Japanese — — + + — + 
Kiswahili + + + +++ e + 
Korean — +f + +++ — + 
Russian — + + ++ + + 
Telugu — + + +++ + + 
Thai — — — + + + 
LanguageLatin scriptaWhite spacetokensSentence boundariesWordformationbGendercPro-drop
English + + + + +d — 
Arabic — + + ++ + + 
Bengali — + + + — + 
Finnish + + + +++ — — 
Indonesian + + + + — + 
Japanese — — + + — + 
Kiswahili + + + +++ e + 
Korean — +f + +++ — + 
Russian — + + ++ + + 
Telugu — + + +++ + + 
Thai — — — + + + 
a

‘—’ indicates Latin script is not the conventional writing system. Intermixing of Latin script should still be expected.

b

We include inflectional and derivation phenomena in our notion of word formation.

c

We limit the gender feature to sex-based gender systems associated with coreferential gendered personal pronouns.

d

English has grammatical gender only in third person personal and possessive pronouns.

e

Kiswahili has morphological noun classes (Corbett, 1991), but here we note sex-based gender systems.

f

In Korean, tokens are often separated by white space, but prescriptive spacing conventions are commonly flouted.

Close Modal

or Create an Account

Close Modal
Close Modal