Table 10: 

Monolingual corpora, their sources, size, and number of sentences.

LanguageSourceSize (MB)No. sentences
amh CC-100 (Conneau et al., 2020) 889.7MB 3,124,760 
hau CC-100 318.4MB 3,182,277 
ibo JW300 (Agić and Vulić, 2019), CC-100, CC-Aligned (El-Kishky et al., 2020), and IgboNLP (Ezeani et al., 2020) 118.3MB 1,068,263 
kin JW300, KIRNEWS (Niyongabo et al., 2020), and BBC Gahuza 123.4MB 726,801 
lug JW300, CC-100, and BUKEDDE News 54.0MB 506,523 
luo JW300 12.8MB 160,904 
pcm JW300, and BBC Pidgin 56.9MB 207,532 
swa CC-100 1,800MB 12,664,787 
wol OPUS (Tiedemann, 2012) (excl. CC-Aligned), Wolof Bible (MBS, 2020), and news corpora (Lu Defu Waxu, Saabal, and Wolof Online) 3.8MB 42,621 
yor JW300, Yoruba Embedding Corpus (Alabi et al., 2020), MENYO-20k (Adelani et al., 2021), CC-100, CC-Aligned, and news corpora (BBC Yoruba, Asejere, and Alaroye). 117.6MB 910,628 
LanguageSourceSize (MB)No. sentences
amh CC-100 (Conneau et al., 2020) 889.7MB 3,124,760 
hau CC-100 318.4MB 3,182,277 
ibo JW300 (Agić and Vulić, 2019), CC-100, CC-Aligned (El-Kishky et al., 2020), and IgboNLP (Ezeani et al., 2020) 118.3MB 1,068,263 
kin JW300, KIRNEWS (Niyongabo et al., 2020), and BBC Gahuza 123.4MB 726,801 
lug JW300, CC-100, and BUKEDDE News 54.0MB 506,523 
luo JW300 12.8MB 160,904 
pcm JW300, and BBC Pidgin 56.9MB 207,532 
swa CC-100 1,800MB 12,664,787 
wol OPUS (Tiedemann, 2012) (excl. CC-Aligned), Wolof Bible (MBS, 2020), and news corpora (Lu Defu Waxu, Saabal, and Wolof Online) 3.8MB 42,621 
yor JW300, Yoruba Embedding Corpus (Alabi et al., 2020), MENYO-20k (Adelani et al., 2021), CC-100, CC-Aligned, and news corpora (BBC Yoruba, Asejere, and Alaroye). 117.6MB 910,628 
Close Modal

or Create an Account

Close Modal
Close Modal