101 Languages in Flores-101. We include the ISO 639-3 code, the language family, and script. Next to each language family, we include more fine-grained subgrouping information. We also include the amount of resources available in OPUS (for bitext with English) and cc100 (for monolingual data) at the time this report was written. The parallel datasets were used to train the baseline described in §5, the monolingual datasets were only used to calculate SentencePiece, see Section §5.
ISO 639-3 . | Language . | Family . | Subgrouping . | Script . | Bitext w/ En . | Mono Data . |
---|---|---|---|---|---|---|
afr | Afrikaans | Indo-European | Germanic | Latin | 570K | 26.1M |
amh | Amharic | Afro-Asiatic | Afro-Asiatic | Ge’ez | 339K | 3.02M |
ara | Arabic | Afro-Asiatic | Afro-Asiatic | Arabic | 25.2M | 126M |
hye | Armenian | Indo-European | Other IE | Armenian | 977K | 25.4M |
asm | Assamese | Indo-European | Indo-Aryan | Bengali | 43.7K | 738K |
ast | Asturian | Indo-European | Romance | Latin | 124K | – |
azj | Azerbaijani | Turkic | Turkic | Latin | 867K | 41.4M |
bel | Belarusian | Indo-European | Balto-Slavic | Cyrillic | 42.4K | 24M |
ben | Bengali | Indo-European | Indo-Aryan | Bengali | 2.16M | 57.9M |
bos | Bosnian | Indo-European | Balto-Slavic | Latin | 187K | 15.9M |
bul | Bulgarian | Indo-European | Balto-Slavic | Cyrillic | 10.3M | 235M |
mya | Burmese | Sino-Tibetan | Sino-Tibetan+Kra-Dai | Myanmar | 283K | 2.66M |
cat | Catalan | Indo-European | Romance | Latin | 5.77M | 77.7M |
ceb | Cebuano | Austronesian | Austronesian | Latin | 484K | 4.11M |
zho | Chinese (Simpl) | Sino-Tibetan | Sino-Tibetan+Kra-Dai | Han | 37.9M | 209M |
zho | Chinese (Trad) | Sino-Tibetan | Sino-Tibetan+Kra-Dai | Han | 37.9M | 85.2M |
hrv | Croatian | Indo-European | Balto-Slavic | Latin | 42.2K | 144M |
ces | Czech | Indo-European | Balto-Slavic | Latin | 23.2M | 124M |
dan | Danish | Indo-European | Germanic | Latin | 10.6M | 344M |
nld | Dutch | Indo-European | Germanic | Latin | 82.4M | 230M |
est | Estonian | Uralic | Uralic | Latin | 4.82M | 46M |
tgl | Filipino (Tagalog) | Austronesian | Austronesian | Latin | 70.6K | 107M |
fin | Finnish | Uralic | Uralic | Latin | 15.2M | 377M |
fra | French | Indo-European | Romance | Latin | 289M | 428M |
ful | Fula | Atlantic-Congo | Nilotic+Other AC | Latin | 71K | 531K |
glg | Galician | Indo-European | Romance | Latin | 1.13M | 4.22M |
lug | Ganda | Atlantic-Congo | Bantu | Latin | 14.4K | 537K |
kat | Georgian | Kartvelian | Other | Georgian | 1.23M | 31.7M |
deu | German | Indo-European | Germanic | Latin | 216M | 417M |
ell | Greek | Indo-European | Other IE | Greek | 23.7M | 201M |
guj | Gujarati | Indo-European | Indo-Aryan | Gujarati | 160K | 9.41M |
hau | Hausa | Afro-Asiatic | Afro-Asiatic | Latin | 335K | 5.87M |
heb | Hebrew | Afro-Asiatic | Afro-Asiatic | Hebrew | 6.64M | 208M |
hin | Hindi | Indo-European | Indo-Aryan | Devanagari | 3.3M | 104M |
hun | Hungarian | Uralic | Uralic | Latin | 16.3M | 385M |
isl | Icelandic | Indo-European | Germanic | Latin | 1.17M | 37.5M |
ibo | Igbo | Atlantic-Congo | Nilotic+Other AC | Latin | 145K | 693K |
ind | Indonesian | Austronesian | Austronesian | Latin | 39.1M | 1.05B |
gle | Irish | Indo-European | Other IE | Latin | 329K | 1.54M |
ita | Italian | Indo-European | Romance | Latin | 116M | 179M |
jpn | Japanese | Japonic | Other | Han, Hiragana, Katakana | 23.2M | 458M |
jav | Javanese | Austronesian | Austronesian | Latin | 1.49M | 24.4M |
kea | Kabuverdianu | Indo-European | Romance | Latin | 5.46K | 178K |
kam | Kamba | Atlantic-Congo | Bantu | Latin | 50K | 181K |
kan | Kannada | Dravidian | Dravidian | Telugu-Kannada | 155K | 13.1M |
kaz | Kazakh | Turkic | Turkic | Cyrillic | 701K | 35.6M |
khm | Khmer | Austro-Asiatic | Austro-Asiatic | Khmer | 398K | 8.87M |
kor | Korean | Koreanic | Other | Hangul | 7.46M | 390M |
kir | Kyrgyz | Turkic | Turkic | Cyrillic | 566K | 2.02M |
lao | Lao | Kra-Dai | Sino-Tibetan+Kra-Dai | Lao | 153K | 2.47M |
lav | Latvian | Indo-European | Balto-Slavic | Latin | 4.8M | 68.4M |
lin | Lingala | Atlantic-Congo | Bantu | Latin | 21.1K | 336K |
lit | Lithuanian | Indo-European | Balto-Slavic | Latin | 6.69M | 111M |
luo | Luo | Nilo-Saharan | Nilotic+Other AC | Latin | 142K | 239K |
ltz | Luxembourgish | Indo-European | Germanic | Latin | 3.41M | – |
mkd | Macedonian | Indo-European | Balto-Slavic | Cyrillic | 1.13M | 28.8M |
msa | Malay | Austronesian | Austronesian | Latin | 968K | 77.5M |
mal | Malayalam | Dravidian | Dravidian | Malayalam | 497K | 24.8M |
mlt | Maltese | Afro-Asiatic | Afro-Asiatic | Latin | 5.82M | – |
mri | Mori | Austronesian | Austronesian | Latin | 196K | – |
mar | Marathi | Indo-European | Indo-Aryan | Devanagari | 109K | 14.4M |
mon | Mongolian | Mongolic | Other | Cyrillic | 555K | 20.4M |
npi | Nepali | Indo-European | Indo-Aryan | Devanagari | 19.6K | 17.9M |
nso | Northern Sotho | Atlantic-Congo | Bantu | Latin | 13.8K | 612K |
nob | Norwegian | Indo-European | Germanic | Latin | 10.9M | 338M |
nya | Nyanja | Atlantic-Congo | Bantu | Latin | 932K | – |
oci | Occitan | Indo-European | Romance | Latin | 5.11K | – |
ory | Oriya | Indo-European | Indo-Aryan | Oriya | 5K | 2.47M |
orm | Oromo | Afro-Asiatic | Afro-Asiatic | Latin | 162K | 752K |
pus | Pashto | Indo-European | Indo-Aryan | Perso-Arabic | 293K | 12M |
fas | Persian | Indo-European | Indo-Aryan | Perso-Arabic | 6.63M | 611M |
pol | Polish | Indo-European | Balto-Slavic | Latin | 40.9M | 256M |
por | Portuguese (Brazil) | Indo-European | Romance | Latin | 137M | 340M |
pan | Punjabi | Indo-European | Indo-Aryan | Gurmukhi | 142K | 5.02M |
ron | Romanian | Indo-European | Romance | Latin | 31.9M | 391M |
rus | Russian | Indo-European | Balto-Slavic | Cyrillic | 127M | 849M |
srp | Serbian | Indo-European | Balto-Slavic | Cyrillic | 7.01M | 35.7M |
sna | Shona | Atlantic-Congo | Bantu | Latin | 877K | – |
snd | Sindhi | Indo-European | Indo-Aryan | Perso-Arabic | 21.8K | 314K |
slk | Slovak | Indo-European | Balto-Slavic | Latin | 10.5M | 174M |
slv | Slovenian | Indo-European | Balto-Slavic | Latin | 5.42M | 74.7M |
som | Somali | Afro-Asiatic | Afro-Asiatic | Latin | 358K | 14.1M |
ckb | Sorani Kurdish | Indo-European | Indo-Aryan | Arabic | 305K | 7.98M |
spa | Spanish (Latin America) | Indo-European | Romance | Latin | 315M | 379M |
swh | Swahili | Atlantic-Congo | Bantu | Latin | 349K | 35.8M |
swe | Swedish | Indo-European | Germanic | Latin | 54.8M | 580M |
tgk | Tajik | Indo-European | Indo-Aryan | Cyrillic | 544K | – |
tam | Tamil | Dravidian | Dravidian | Tamil | 992K | 68.2M |
tel | Telugu | Dravidian | Dravidian | Telugu-Kannada | 381K | 17.2M |
tha | Thai | Kra-Dai | Sino-Tibetan+Kra-Dai | Thai | 10.6M | 319M |
tur | Turkish | Turkic | Turkic | Latin | 41.2M | 128M |
ukr | Ukrainian | Indo-European | Balto-Slavic | Cyrillic | 5.44M | 357M |
umb | Umbundu | Atlantic-Congo | Bantu | Latin | 217K | 142K |
urd | Urdu | Indo-European | Indo-Aryan | Perso-Arabic | 630K | 28M |
uzb | Uzbek | Turkic | Turkic | Latin | – | 7.54M |
vie | Vietnamese | Austro-Asiatic | Austro-Asiatic | Latin | 32.1M | 992M |
cym | Welsh | Indo-European | Other IE | Latin | 826K | 12.7M |
wol | Wolof | Atlantic-Congo | Nilotic+Other AC | Latin | 86.9K | 676K |
xho | Xhosa | Atlantic-Congo | Bantu | Latin | 130K | 995K |
yor | Yoruba | Atlantic-Congo | Nilotic+Other AC | Latin | 171K | 1.59M |
zul | Zulu | Atlantic-Congo | Bantu | Latin | 123K | 994K |
ISO 639-3 . | Language . | Family . | Subgrouping . | Script . | Bitext w/ En . | Mono Data . |
---|---|---|---|---|---|---|
afr | Afrikaans | Indo-European | Germanic | Latin | 570K | 26.1M |
amh | Amharic | Afro-Asiatic | Afro-Asiatic | Ge’ez | 339K | 3.02M |
ara | Arabic | Afro-Asiatic | Afro-Asiatic | Arabic | 25.2M | 126M |
hye | Armenian | Indo-European | Other IE | Armenian | 977K | 25.4M |
asm | Assamese | Indo-European | Indo-Aryan | Bengali | 43.7K | 738K |
ast | Asturian | Indo-European | Romance | Latin | 124K | – |
azj | Azerbaijani | Turkic | Turkic | Latin | 867K | 41.4M |
bel | Belarusian | Indo-European | Balto-Slavic | Cyrillic | 42.4K | 24M |
ben | Bengali | Indo-European | Indo-Aryan | Bengali | 2.16M | 57.9M |
bos | Bosnian | Indo-European | Balto-Slavic | Latin | 187K | 15.9M |
bul | Bulgarian | Indo-European | Balto-Slavic | Cyrillic | 10.3M | 235M |
mya | Burmese | Sino-Tibetan | Sino-Tibetan+Kra-Dai | Myanmar | 283K | 2.66M |
cat | Catalan | Indo-European | Romance | Latin | 5.77M | 77.7M |
ceb | Cebuano | Austronesian | Austronesian | Latin | 484K | 4.11M |
zho | Chinese (Simpl) | Sino-Tibetan | Sino-Tibetan+Kra-Dai | Han | 37.9M | 209M |
zho | Chinese (Trad) | Sino-Tibetan | Sino-Tibetan+Kra-Dai | Han | 37.9M | 85.2M |
hrv | Croatian | Indo-European | Balto-Slavic | Latin | 42.2K | 144M |
ces | Czech | Indo-European | Balto-Slavic | Latin | 23.2M | 124M |
dan | Danish | Indo-European | Germanic | Latin | 10.6M | 344M |
nld | Dutch | Indo-European | Germanic | Latin | 82.4M | 230M |
est | Estonian | Uralic | Uralic | Latin | 4.82M | 46M |
tgl | Filipino (Tagalog) | Austronesian | Austronesian | Latin | 70.6K | 107M |
fin | Finnish | Uralic | Uralic | Latin | 15.2M | 377M |
fra | French | Indo-European | Romance | Latin | 289M | 428M |
ful | Fula | Atlantic-Congo | Nilotic+Other AC | Latin | 71K | 531K |
glg | Galician | Indo-European | Romance | Latin | 1.13M | 4.22M |
lug | Ganda | Atlantic-Congo | Bantu | Latin | 14.4K | 537K |
kat | Georgian | Kartvelian | Other | Georgian | 1.23M | 31.7M |
deu | German | Indo-European | Germanic | Latin | 216M | 417M |
ell | Greek | Indo-European | Other IE | Greek | 23.7M | 201M |
guj | Gujarati | Indo-European | Indo-Aryan | Gujarati | 160K | 9.41M |
hau | Hausa | Afro-Asiatic | Afro-Asiatic | Latin | 335K | 5.87M |
heb | Hebrew | Afro-Asiatic | Afro-Asiatic | Hebrew | 6.64M | 208M |
hin | Hindi | Indo-European | Indo-Aryan | Devanagari | 3.3M | 104M |
hun | Hungarian | Uralic | Uralic | Latin | 16.3M | 385M |
isl | Icelandic | Indo-European | Germanic | Latin | 1.17M | 37.5M |
ibo | Igbo | Atlantic-Congo | Nilotic+Other AC | Latin | 145K | 693K |
ind | Indonesian | Austronesian | Austronesian | Latin | 39.1M | 1.05B |
gle | Irish | Indo-European | Other IE | Latin | 329K | 1.54M |
ita | Italian | Indo-European | Romance | Latin | 116M | 179M |
jpn | Japanese | Japonic | Other | Han, Hiragana, Katakana | 23.2M | 458M |
jav | Javanese | Austronesian | Austronesian | Latin | 1.49M | 24.4M |
kea | Kabuverdianu | Indo-European | Romance | Latin | 5.46K | 178K |
kam | Kamba | Atlantic-Congo | Bantu | Latin | 50K | 181K |
kan | Kannada | Dravidian | Dravidian | Telugu-Kannada | 155K | 13.1M |
kaz | Kazakh | Turkic | Turkic | Cyrillic | 701K | 35.6M |
khm | Khmer | Austro-Asiatic | Austro-Asiatic | Khmer | 398K | 8.87M |
kor | Korean | Koreanic | Other | Hangul | 7.46M | 390M |
kir | Kyrgyz | Turkic | Turkic | Cyrillic | 566K | 2.02M |
lao | Lao | Kra-Dai | Sino-Tibetan+Kra-Dai | Lao | 153K | 2.47M |
lav | Latvian | Indo-European | Balto-Slavic | Latin | 4.8M | 68.4M |
lin | Lingala | Atlantic-Congo | Bantu | Latin | 21.1K | 336K |
lit | Lithuanian | Indo-European | Balto-Slavic | Latin | 6.69M | 111M |
luo | Luo | Nilo-Saharan | Nilotic+Other AC | Latin | 142K | 239K |
ltz | Luxembourgish | Indo-European | Germanic | Latin | 3.41M | – |
mkd | Macedonian | Indo-European | Balto-Slavic | Cyrillic | 1.13M | 28.8M |
msa | Malay | Austronesian | Austronesian | Latin | 968K | 77.5M |
mal | Malayalam | Dravidian | Dravidian | Malayalam | 497K | 24.8M |
mlt | Maltese | Afro-Asiatic | Afro-Asiatic | Latin | 5.82M | – |
mri | Mori | Austronesian | Austronesian | Latin | 196K | – |
mar | Marathi | Indo-European | Indo-Aryan | Devanagari | 109K | 14.4M |
mon | Mongolian | Mongolic | Other | Cyrillic | 555K | 20.4M |
npi | Nepali | Indo-European | Indo-Aryan | Devanagari | 19.6K | 17.9M |
nso | Northern Sotho | Atlantic-Congo | Bantu | Latin | 13.8K | 612K |
nob | Norwegian | Indo-European | Germanic | Latin | 10.9M | 338M |
nya | Nyanja | Atlantic-Congo | Bantu | Latin | 932K | – |
oci | Occitan | Indo-European | Romance | Latin | 5.11K | – |
ory | Oriya | Indo-European | Indo-Aryan | Oriya | 5K | 2.47M |
orm | Oromo | Afro-Asiatic | Afro-Asiatic | Latin | 162K | 752K |
pus | Pashto | Indo-European | Indo-Aryan | Perso-Arabic | 293K | 12M |
fas | Persian | Indo-European | Indo-Aryan | Perso-Arabic | 6.63M | 611M |
pol | Polish | Indo-European | Balto-Slavic | Latin | 40.9M | 256M |
por | Portuguese (Brazil) | Indo-European | Romance | Latin | 137M | 340M |
pan | Punjabi | Indo-European | Indo-Aryan | Gurmukhi | 142K | 5.02M |
ron | Romanian | Indo-European | Romance | Latin | 31.9M | 391M |
rus | Russian | Indo-European | Balto-Slavic | Cyrillic | 127M | 849M |
srp | Serbian | Indo-European | Balto-Slavic | Cyrillic | 7.01M | 35.7M |
sna | Shona | Atlantic-Congo | Bantu | Latin | 877K | – |
snd | Sindhi | Indo-European | Indo-Aryan | Perso-Arabic | 21.8K | 314K |
slk | Slovak | Indo-European | Balto-Slavic | Latin | 10.5M | 174M |
slv | Slovenian | Indo-European | Balto-Slavic | Latin | 5.42M | 74.7M |
som | Somali | Afro-Asiatic | Afro-Asiatic | Latin | 358K | 14.1M |
ckb | Sorani Kurdish | Indo-European | Indo-Aryan | Arabic | 305K | 7.98M |
spa | Spanish (Latin America) | Indo-European | Romance | Latin | 315M | 379M |
swh | Swahili | Atlantic-Congo | Bantu | Latin | 349K | 35.8M |
swe | Swedish | Indo-European | Germanic | Latin | 54.8M | 580M |
tgk | Tajik | Indo-European | Indo-Aryan | Cyrillic | 544K | – |
tam | Tamil | Dravidian | Dravidian | Tamil | 992K | 68.2M |
tel | Telugu | Dravidian | Dravidian | Telugu-Kannada | 381K | 17.2M |
tha | Thai | Kra-Dai | Sino-Tibetan+Kra-Dai | Thai | 10.6M | 319M |
tur | Turkish | Turkic | Turkic | Latin | 41.2M | 128M |
ukr | Ukrainian | Indo-European | Balto-Slavic | Cyrillic | 5.44M | 357M |
umb | Umbundu | Atlantic-Congo | Bantu | Latin | 217K | 142K |
urd | Urdu | Indo-European | Indo-Aryan | Perso-Arabic | 630K | 28M |
uzb | Uzbek | Turkic | Turkic | Latin | – | 7.54M |
vie | Vietnamese | Austro-Asiatic | Austro-Asiatic | Latin | 32.1M | 992M |
cym | Welsh | Indo-European | Other IE | Latin | 826K | 12.7M |
wol | Wolof | Atlantic-Congo | Nilotic+Other AC | Latin | 86.9K | 676K |
xho | Xhosa | Atlantic-Congo | Bantu | Latin | 130K | 995K |
yor | Yoruba | Atlantic-Congo | Nilotic+Other AC | Latin | 171K | 1.59M |
zul | Zulu | Atlantic-Congo | Bantu | Latin | 123K | 994K |