Monolingual corpora, their sources, size, and number of sentences.
Language . | Source . | Size (MB) . | No. sentences . |
---|---|---|---|
amh | CC-100 (Conneau et al., 2020) | 889.7MB | 3,124,760 |
hau | CC-100 | 318.4MB | 3,182,277 |
ibo | JW300 (Agić and Vulić, 2019), CC-100, CC-Aligned (El-Kishky et al., 2020), and IgboNLP (Ezeani et al., 2020) | 118.3MB | 1,068,263 |
kin | JW300, KIRNEWS (Niyongabo et al., 2020), and BBC Gahuza | 123.4MB | 726,801 |
lug | JW300, CC-100, and BUKEDDE News | 54.0MB | 506,523 |
luo | JW300 | 12.8MB | 160,904 |
pcm | JW300, and BBC Pidgin | 56.9MB | 207,532 |
swa | CC-100 | 1,800MB | 12,664,787 |
wol | OPUS (Tiedemann, 2012) (excl. CC-Aligned), Wolof Bible (MBS, 2020), and news corpora (Lu Defu Waxu, Saabal, and Wolof Online) | 3.8MB | 42,621 |
yor | JW300, Yoruba Embedding Corpus (Alabi et al., 2020), MENYO-20k (Adelani et al., 2021), CC-100, CC-Aligned, and news corpora (BBC Yoruba, Asejere, and Alaroye). | 117.6MB | 910,628 |
Language . | Source . | Size (MB) . | No. sentences . |
---|---|---|---|
amh | CC-100 (Conneau et al., 2020) | 889.7MB | 3,124,760 |
hau | CC-100 | 318.4MB | 3,182,277 |
ibo | JW300 (Agić and Vulić, 2019), CC-100, CC-Aligned (El-Kishky et al., 2020), and IgboNLP (Ezeani et al., 2020) | 118.3MB | 1,068,263 |
kin | JW300, KIRNEWS (Niyongabo et al., 2020), and BBC Gahuza | 123.4MB | 726,801 |
lug | JW300, CC-100, and BUKEDDE News | 54.0MB | 506,523 |
luo | JW300 | 12.8MB | 160,904 |
pcm | JW300, and BBC Pidgin | 56.9MB | 207,532 |
swa | CC-100 | 1,800MB | 12,664,787 |
wol | OPUS (Tiedemann, 2012) (excl. CC-Aligned), Wolof Bible (MBS, 2020), and news corpora (Lu Defu Waxu, Saabal, and Wolof Online) | 3.8MB | 42,621 |
yor | JW300, Yoruba Embedding Corpus (Alabi et al., 2020), MENYO-20k (Adelani et al., 2021), CC-100, CC-Aligned, and news corpora (BBC Yoruba, Asejere, and Alaroye). | 117.6MB | 910,628 |