MasakhaNER: Named Entity Recognition for African Languages

We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We analyze our datasets and conduct an extensive empirical evaluation of state-of-the-art methods across both supervised and transfer learning settings. We release the data, code, and models in order to inspire future research on African NLP.


Introduction
Africa has over 2,000 spoken languages (Eberhard et al., 2020); however, these languages are scarcely represented in existing natural language process ing (NLP) datasets, research, and tools (Martinus and Abbott, 2019). ∀ et al. (2020) investigate the reasons for these disparities by examining how NLP for lowresource languages is constrained by several societal factors. One of these factors is the geographical and language diversity of NLP re searchers. For example, of the 2695 affiliations of authors whose works were published at the five ma jor NLP conferences in 2019, only five were from African institutions (Caines, 2019). Conversely, many NLP tasks such as machine translation, text classification, partofspeech tagging, and named entity recognition would benefit from the knowl edge of native speakers who are involved in the development of datasets and models.
In this work, we focus on named entity recog nition (NER)-one of the most impactful tasks in NLP De Meulder, 2003; Lample et al., 2016). NER is an important information extrac tion task and an essential component of numer ous products including spellcheckers, localization of voice and dialogue systems, and conversational agents. It also enables identifying African names, places and organizations for information retrieval. African languages are underrepresented in this crucial task due to lack of datasets, reproducible results, and researchers who understand the chal lenges that such languages present for NER.
In this paper, we take an initial step towards im proving representation for African languages for the NER task, making the following contributions: (i) We bring together language speakers, dataset curators, NLP practitioners, and evaluation experts to address the challenges facing NER for African languages. Based on the avail ability of online news corpora and language annotators, we develop NER datasets, mod els, and evaluation covering ten widely spo ken African languages.
(ii) We curate NER datasets from local sources to ensure relevance of future research for native speakers of the respective languages.
(iii) We train and evaluate multiple NER mod els for all ten languages. Our experiments provide insights into the transfer across lan guages, and highlight open challenges.
(iv) We release the datasets, code, and models to facilitate future research on the specific chal lenges raised by NER for African languages.

Related Work
African NER datasets NER is a wellstudied se quence labeling task (Yadav and Bethard, 2018) and has been the subject of many shared tasks in different languages (Tjong Kim Sang, 2002; Tjong Kim Sang and De Meulder, 2003; Sangal et al., 2008; Shaalan, 2014; Benikova et al., 2014. How ever, most of the available datasets are in high resource languages. Although there have been ef forts to create NER datasets for lowerresourced languages, such as the WikiAnn corpus (Pan et al., 2017) covering 282 languages, such datasets con sist of "silverstandard" labels created by transfer ring annotations from English to other languages through crosslingual links in knowledge bases. Because the WikiAnn corpus data comes from Wikipedia, it includes some African languages; though most have fewer than 10k tokens. Other NER datasets for African languages in clude SADiLaR (Eiselen, 2016) for ten South African languages based on government data, and small corpora of fewer than 2K sentences for Yorùbá (Alabi et al., 2020) and Hausa (Hedderich et al., 2020). Additionally, the LORELEI lan guage packs (Strassel and Tracey, 2016) include some African languages (Yorùbá, Hausa, Amharic, Somali, Twi, Swahili, Wolof, Kinyarwanda, and Zulu), but are not publicly available.
NER models Popular sequence labeling mod els for NER include the CRF (Lafferty et al., 2001), CNNBiLSTM (Chiu andNichols, 2016), BiLSTMCRF (Huang et al., 2015), and CNN BiLSTMCRF (Ma and Hovy, 2016). The tra ditional CRF makes use of handcrafted features like partofspeech tags, context words and word capitalization. Neural NER models on the other hand are initialized with word embeddings like Word2Vec (Mikolov et al., 2013), GloVe (Penning ton et al., 2014) and FastText (Bojanowski et al., 2017). More recently, pretrained language mod els such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and LUKE (Yamada et al., 2020) have been applied to produce stateoftheart re sults for the NER task. Multilingual variants of these models like mBERT and XLMRoBERTa (Conneau et al., 2020) make it possible to train NER models for several languages using transfer learning. Languagespecific parameters and adap tation to unlabeled data of the target language have yielded further gains (Pfeiffer et al., 2020a,b). Table 1 provides an overview of the languages con sidered in this work, their language family, number of speakers and the regions in Africa where they are spoken. We chose to focus on these languages due to the availability of online news corpora, an notators, and most importantly because they are widely spoken native African languages. Both re gion and language family might indicate a notion of proximity for NER, either because of linguis tic features shared within that family, or because  data sources cover a common set of locally rele vant entities. We highlight language specifics for each language to illustrate the diversity of this se lection of languages in Section 3.1, and then show case the differences in named entities across these languages in Section 3.2.
Hausa (hau) has 2325 consonants, depending on the dialect and five short and five long vow els. Hausa has labialized phonemic consonants, as in /gw/ e.g. 'agwagwa.' As found in some African languages, implosive consonants also exist in Hausa, e.g. 'b, 'd, etc as in 'barna'. Similarly, the Hausa approximant 'r' is realized in two dis tinct manners: roll and trill, as in 'rai' and 'ra'ayi', respectively.
Igbo (ibo) is an agglutinative language, with many frequent suffixes and prefixes (Emenanjo, 1978). A single stem can yield many wordforms by addition of affixes that extend its original mean ing (Onyenwe and Hepple, 2016). Igbo is also tonal, with two distinctive tones (high and low) and a downstepped high tone in some cases. The al phabet consists of 28 consonants and 8 vowels (A, E, I, Ị, O, Ọ, U, Ụ). In addition to the Latin letters (except c), Igbo contains the following digraphs: (ch, gb, gh, gw, kp, kw, nw, ny, sh).
Kinyarwanda (kin) makes use of 24 Latin char acters with 5 vowels similar to English and 19 consonants excluding q and x. Moreover, Kin yarwanda has 74 additional complex consonants (such as mb, mpw, and njyw). (Government, 2014) It is a tonal language with three tones: low (no dia critic), high (signaled by "/") and falling (signaled by "^"). The default word order is SubjectVerb Object.
Luganda (lug) is a tonal language with subject verbobject word order. The Luganda alphabet is composed of 24 letters that include 17 consonants (p, v, f, m, d, t, l, r, n, z, s, j, c, g), 5 vowel sounds represented in the five alphabetical symbols (a, e, i, o, u) and 2 semivowels (w, y). It also has a special consonant ng ′ .
Luo (luo) is a tonal language with 4 tones (high, low, falling, rising) although the tonality is not marked in orthography. It has 26 Latin conso nants without Latin letters (c, q, v, x and z) and additional consonants (ch, dh, mb, nd, ng', ng, ny, nj, th, sh  It also includes the characters Ŋ ("ng", lowercase: ŋ) and Ñ ("gn" as in Spanish). Accents are present, but limited in number (À, É, Ë, Ó). However, un like many other NigerCongo languages, Wolof is not a tonal language.
Yorùbá (yor) has 25 Latin letters without the Latin characters (c, q, v, x and z) and with addi tional letters (ẹ, gb, s ̣ , ọ). Yorùbá is a tonal lan guage with three tones: low ("\"), middle ("−", optional) and high ("/"). The tonal marks and un derdots are referred to as diacritics and they are needed for the correct pronunciation of a word. Yorùbá is a highly isolating language and the sen tence structure follows SubjectVerbObject.

Named Entities
Most of the work on NER is centered around En glish, and it is unclear how well existing mod els can generalize to other languages in terms of sentence structure or surface forms. In Hu et al. (2020)'s evaluation on crosslingual gener alization for NER, only two African languages were considered and it was seen that transformer based models particularly struggled to generalize to named entities in Swahili. To highlight the dif ferences across our focus languages, Table 2 shows an English 2 example sentence, with colorcoded PER, LOC, and DATE entities, and the correspond ing translations. The following characteristics of 2 Although the original sentence is from BBC Pidgin https://www.bbc.com/pidgin/tori-51702073 the languages in our dataset could pose challenges for NER systems developed for English: • Amharic shares no lexical overlap with the English source sentence.
• While "Zhang" is identical across all Latin script languages, "Kano" features accents in Wolof and Yorùbá due to its localization.
• The Fidel script has no capitalization, which could hinder transfer from other languages.
• Igbo, Wolof, and Yorùbá all use diacritics, which are not present in the English alphabet.
• The surface form of named entities (NE) is the same in English and NigerianPidgin, but there exist lexical differences (e.g. in terms of how time is realized).
• Between the 10 African languages, "Nigeria" is spelled in 6 different ways.
• Numerical "18": Igbo, Wolof and Yorùbá write out their numbers, resulting in different numbers of tokens for the entity span.

Data and Annotation Methodology
Our data was obtained from local news sources, in order to ensure relevance of the dataset for na tive speakers from those regions. The dataset was annotated using the ELISA tool (Lin et al., 2018) by native speakers who come from the same re gions as the news sources and volunteered through the Masakhane community 3 . Annotators were not   A key objective of our annotation procedure was to create highquality datasets by ensuring a high annotator agreement. To achieve high agreement scores, we ran collaborative workshops for each language, which allowed annotators to discuss any disagreements. ELISA provides an entitylevel F1 score and also an interface for annotators to cor rect their mistakes, making it easy to achieve inter annotator agreement scores between 0.96 and 1.0 for all languages.
We report interannotator agreement scores in Table 4 using Fleiss' Kappa (Fleiss, 1971) at both the token and entity level. The latter considers each span an annotator proposed as an entity. As a result of our workshops, all our datasets have exceptionally high interannotator agreement. For Kinyarwanda, Luo, Swahili, and Wolof, we report perfect interannotator agreement scores (κ = 1). For each of these languages, two annotators an notated each token and were instructed to discuss and resolve conflicts among themselves. The Ap pendix provides a detailed entitylevel confusion matrix in Table 11.

NER baseline models
To evaluate baseline performance on our dataset, we experiment with three popular NER mod els: CNNBiLSTMCRF, multilingual BERT (mBERT), and XLMRoBERTa (XLMR). The lat ter two models are implemented using the Hug gingFace transformers toolkit (Wolf et al., 2019). For each language, we train the models on the in language training data and evaluate on its test data.
CNNBiLSTMCRF This architecture was pro posed for NER by Ma and Hovy (2016). For each input sequence, we first compute the vec tor representation for each word by concatenating characterlevel encodings from a CNN and vector embeddings for each word. Following Rijhwani et al. (2020), we use randomly initialized word embeddings since we do not have highquality pretrained embeddings for all the languages in our dataset. Our model is implemented using the DyNet toolkit (Neubig et al., 2017).
mBERT We finetune multilingual BERT (De vlin et al., 2019) on our NER corpus by adding a linear classification layer to the pretrained trans former model, and train it endtoend. mBERT was trained on 104 languages including only two African languages: Swahili and Yorùbá. We use the mBERTbase cased model with 12layer Trans former blocks consisting of 768hidden size and 110M parameters.
XLMR XLMR (Conneau et al., 2020) was trained on 100 languages including Amharic, Hausa, and Swahili. The major differences be tween XLMR and mBERT are (1) XLMR was trained on Common Crawl while mBERT was trained on Wikipedia; (2) XLMR is based on RoBERTa, which is trained with a masked lan guage model (MLM) objective while mBERT was additionally trained with a next sentence prediction objective. We make use of the XLMR base and large models for the baseline models. The XLMR base model consisting of 12 layers, with a hidden size of 768 and 270M parameters. On the other hand, the XLMRlarge has 24 layers, with a hid den size of 1024 and 550M parameters.
MeanEBiLSTM This is a simple BiLSTM model with an additional linear classifier. For each input sequence, we first extract a sentence embedding from mBERT or XLMR language model (LM) before passing it into the BiLSTM model. Following Reimers and Gurevych (2019), we make use of the mean of the 12layer output embeddings of the LM (i.e MeanE). This has been shown to provide better sentence representations than the embedding of the [CLS] token used for finetuning mBERT and XLMR.
Language BERT The mBERT and the XLM R models only supports two and three languages under study respectively.
One effective ap proach to adapt the pretrained transformer mod els to new domains is "domainadaptive fine tuning" (Howard and Ruder, 2018; Gururangan et al., 2020)-finetuning on unlabeled data in the new domain, which also works very well when adapting to a new language (Pfeiffer et al., 2020a; Alabi et al., 2020. For each of the African languages, we performed languageadaptive fine tuning on available unlabeled corpora mostly from JW300 (Agić and Vulić, 2019), indigenous news sources and XLMR Common Crawl cor pora (Conneau et al., 2020). The appendix pro vides the details of the unlabeled corpora in Ta ble 10. This approach is quite useful for languages whose scripts are not supported by the multi lingual transformer models like Amharic where we replace the vocabulary of mBERT by an Amharic vocabulary before we perform languageadaptive finetuning, similar to Alabi et al. (2020).

Improving the Baseline Models
In this section, we consider techniques to improve the baseline models such as utilizing gazetteers, transfer learning from other domains and lan guages, and aggregating NER datasets by regions.
For these experiments, we focus on the PER, ORG, and LOC categories, because the gazetteers from Wikipedia do not contain DATE entities and some source domains and languages that we transfer from do not have the DATE annotation. We ap ply these modifications to the XLMR model be cause it generally outperforms mBERT in our ex periments (see Section 6).

Gazetteers for NER
Gazetteers are lists of named entities collected from manually crafted resources such as GeoN ames or Wikipedia. Before the widespread adoption of neural networks, NER methods used gazetteersbased features to improve perfor mance (Ratinov and Roth, 2009). These features are created for each ngram in the dataset and are typically binaryvalued, indicating whether that n gram is present in the gazetteer.
Recently, Rijhwani et al. (2020) showed that augmenting the neural CNNBiLSTMCRF model with gazetteer features can improve NER per formance for lowresource languages. We con duct similar experiments on the languages in our dataset, using entity lists from Wikipedia as gazetteers. For Luo and NigerianPidgin, which do not have their own Wikipedia, we use entity lists from English Wikipedia.

Transfer Learning
Here, we focus on crossdomain transfer from Wikipedia to the news domain, and crosslingual transfer from English and Swahili NER datasets to the other languages in our dataset.

Domain Adaptation from WikiAnn
We make use of the WikiAnn corpus (Pan et al., 2017), which is available for five of the languages in our dataset: Amharic, Igbo, Kinyarwanda, Swahili and Yorùbá. For each language, the corpus con tains 100 sentences in each of the training, develop ment and test splits except for Swahili, which con tains 1K sentences in each split. For each language, we train on the corresponding WikiAnn training set and either zeroshot transfer to our respective test set or additionally finetune on our training data.
Crosslingual transfer For training the cross lingual transfer models, we use the CoNLL2003 5 NER dataset in English with over 14K training sen tences and our annotated corpus. The reason for CoNLL2003 is because it is in the same news do main as our annotated corpus. We also make use of the languages that are supported by the XLM R model and are widely spoken in East and West Africa like Swahili and Hausa. The English cor pus has been shown to transfer very well to low re source languages (Hedderich et al., 2020; Lauscher et al., 2020. We first train on either the English CoNLL2003 data or our training data in Swahili, Hausa, or NigerianPidgin before testing on the tar get African languages.

Aggregating Languages by Regions
As previously illustrated in Table 2, several entities have the same form in different languages while some entities may be more common in the region where the language is spoken. To study the per formance of NER models across geographical ar eas, we combine languages based on the region of Africa that they are spoken in (see Table 1): (1) East region with Kinyarwanda, Luganda, Luo, and Swahili; (2) West Region with Hausa, Igbo, NigerianPidgin, Wolof, and Yorùbá languages, (3) East and West regions-all languages except Amharic because of its distinct writing system. 5 We also tried OntoNotes 5.0 by combining FAC & ORG as "ORG" and GPE & LOC as "LOC" and others as "O" ex cept "PER", but it gave lower performance in zeroshot trans fer (19.38 F1) while CoNLL2003 gave 37.15 F1. Table 5 gives the F1score obtained by CNN BiLSTMCRF, mBERT and XLMR models on the test sets of the ten African languages when training on our inlanguage data. We addition ally indicate whether the language is supported by the pretrained language models (). The per centage of entities that are of outofvocabulary (OOV; entities in the test set that are not present in the training set) is also reported alongside re sults of the baseline models. In general, the datasets with greater numbers of OOV entities have lower performance with the CNNBiLSTM CRF model, while those with lower OOV rates (Hausa, Igbo, Swahili) have higher performance. We find that the CNNBiLSTMCRF model per forms worse than finetuning mBERT and XLM R models endtoend (FTune). We expect perfor mance to be better (e.g., for Amharic and Nigerian Pidgin with over 18 F1 point difference) when us ing pretrained word embeddings for the initializa tion of the BiLSTM model rather than random ini tialization (we leave this for future work as dis cussed in Section 7).

Baseline Models
Interestingly, the pretrained language models (PLMs) have reasonable performance even on lan guages they were not trained on such as Igbo, Kin yarwanda, Luganda, Luo, and Wolof. However, languages supported by the PLM tend to have bet ter performance overall. We observe that fine tuned XLMRbase models have significantly bet ter performance on five languages; two of the languages (Amharic and Swahili) are supported by the pretrained XLMR. Similarly, finetuning mBERT has better performance for Yorùbá since the language is part of the PLM's training cor pus. Although mBERT is trained on Swahili, XLMRbase shows better performance. This observation is consistent with Hu et al. (2020) and could be because XLMR is trained on more Swahili text (Common Crawl with 275M tokens) whereas mBERT is trained on a smaller corpus from Wikipedia (6M tokens 6 ).
Another observation is that mBERT tends to have better performance for the nonBantu NigerCongo languages i.e., Igbo, Wolof, and Yorùbá, while XLMRbase works better for Afro   Asiatic languages (i.e., Amharic and Hausa), Nilo Saharan (i.e., Luo) and Bantu languages like Kin yarwanda and Swahili. We also note that the writ ing script is one of the primary factors influenc ing the transfer of knowledge in PLMs with re gard to the languages they were not trained on. For example, mBERT achieves an F1score of 0.0 on Amharic because it has not encountered the script during pretraining. In general, we find the fine tuned XLMRlarge (with 550M parameters) to be better than XLMRbase (with 270M parame ters) and mBERT (with 110 parameters) in almost all languages. However, mBERT models perform slightly better for Igbo, Luo, and Yorùbá despite having fewer parameters. We further analyze the transfer abilities of mBERT and XLMR by extracting sentence em beddings from the LMs to train a BiLSTM model (MeanEBiLSTM) instead of finetuning them end toend. Table 5 shows that languages that are not supported by mBERT or XLMR generally perform worse than CNNBiLSTMCRF model (despite being randomly initialized) except for kin. Also, sentence embeddings extracted from mBERT often lead to better performance than XLMR for languages they both do not support (like ibo, kin, lug, luo, and wol).
Lastly, we train NER models using language BERT models that have been adapted to each of the African languages via languagespecific finetuning on unlabeled text. In all cases, fine tuning language BERT and language XLMR mod els achieves a 1 − 7% improvement in F1score over finetuning mBERTbase and XLMRbase respectively. This approach is still effective for small sized pretraining corpora provided they are of good quality. For example, the Wolof mono lingual corpus, which contains less than 50K sen tences (see Table 10 in the Appendix) still im proves performance by over 4% F1. Further, we obtain over 60% improvement in performance for Amharic BERT because mBERT does not recog nize the Amharic script.   On average, the model that uses gazetteer features performs better than the baseline. In general, languages with larger gazetteers, such as Swahili (16K entities in the gazetteer) and NigerianPidgin (for which we use an English gazetteer with 2M entities), have more improvement in performance than those with fewer gazetteer entries, such as Amharic and Luganda (2K and 500 gazetteer entities respectively). This indicates that having highcoverage gazetteers is important for the model to take advantage of the gazetteer features. Table 7 shows the result for the different transfer learning approaches, which we discuss individu ally in the following sections. We make use of XLMRbase model for all the experiments in this subsection because the performance difference if we use XLMRlarge is small (<2%) as shown in Table 5 and because it is faster to train.

Crossdomain Transfer
We evaluate crossdomain transfer from Wikipedia to the news domain for the five languages that are available in the WikiAnn (Pan et al., 2017) dataset.
In the zeroshot setting, the NER F1score is low: less than 40 F1score for all languages, with Kin yarwanda and Yorùbá having less than 10 F1score. This is likely due to the number of training sen tences present in WikiAnn: there are only 100 sentences in the datasets of Amharic, Igbo, Kin yarwanda and Yorùbá. Although the Swahili cor pus has 1,000 sentences, the 35 F1score shows that transfer is not very effective. In general, cross domain transfer is a challenging problem, and is even harder when the number of training examples from the source domain is small. Finetuning on the indomain news NER data does not improve over the baseline (XLMRbase).

CrossLingual Transfer
Zeroshot In the zeroshot setting we evaluated NER models trained on the English engCoNLL03 dataset, and on the NigerianPidgin (pcm), Swahili (swa), and Hausa (hau) annotated corpus. We ex cluded the MISC entity in the engCoNLL03 corpus because it is absent in our target datasets. Table 7 shows the result for the (zeroshot) transfer per formance. We observe that the closer the source and target languages are geographically, the bet  Table 9: F1 score for two varieties of hardtoidentify entities: zerofrequency entities that do not appear in the training corpus, and longer entities of four or more words.
ter the performance. The pcm model (trained on only 2K sentences) obtains similar transfer perfor mance as the engCoNLL03 model (trained on 14K sentences). swa performs better than pcm and eng CoNLL03 with an improvement of over 14 F1 on average. We found that, on average, transferring from Hausa provided the best F1, with an improve ment of over 16% and 1% compared to using the engCoNLL and swa data respectively. Perentity analysis in Table 8 shows that the largest improve ments are obtained for ORG. The pcm data was more effective in transferring to LOC and ORG, while swa and hau performed better when transferring to PER. In general, zeroshot transfer is most effec tive when transferring from Hausa and Swahili.
Finetuning We use the target language corpus to finetune the NER models previously trained on engCoNLL, pcm, and swa. On average, there is only a small improvement when compared to the XLMR base model. In particular, we see signifi cant improvement for Hausa, Igbo, Kinyarwanda, NigerianPidgin, Wolof, and Yorùbá using either swa or hau as the source NER model.

Regional Influence on NER
We evaluate whether combining different language training datasets by region affects the performance for individual languages. Table 7 shows that all languages spoken in West Africa (ibo, wol, pcm, yor) except hau have slightly better performance (0.1-2.6 F1) when we train on their combined training data. However, for the EastAfrican lan guages, the F1 score only improved (0.8-2.3 F1) for three languages (kin, lug, luo). Training the NER model on all nine languages leads to better performance on all languages except Swahili. On average over six languages (ibo, kin, lug, luo, wol, yor), the performance improves by 1.6 F1.

Error analysis
Finally, to better understand the types of entities that were successfully identified and those that were missed, we performed finegrained analysis of our baseline methods mBERT and XLMR us ing the method of Fu et al. (2020), with results shown in Table 9. Specifically, we found that across all languages, entities that were not con tained in the training data (zerofrequency entities), and entities consisting of more than three words (long entities) were particularly difficult in all lan guages; compared to the F1 score over all entities, the scores dropped by around 5 points when eval uated on zerofrequency entities, and by around 20 points when evaluated on long entities. Future work on lowresource NER or crosslingual repre sentation learning may further improve on these hard cases.

Conclusion and Future Work
We address the NER task for African languages by bringing together a variety of stakeholders to create a highquality NER dataset for ten African languages. We evaluate multiple stateoftheart NER models and establish strong baselines. We have released one of our best models that can rec ognize named entities in ten African languages on HuggingFace Model Hub 7 . We also investi gate crossdomain transfer with experiments on five languages with the WikiAnn dataset, along with crosslingual transfer for lowresource NER using the English CoNLL2003 dataset and other languages supported by XLMR. In the future, we plan to use pretrained word embeddings such as GloVe (Pennington et al., 2014) and fastText (Bo janowski et al., 2017) instead of random initializa tion for the CNNBiLSTMCRF, increase the num ber of annotated sentences per language, and ex pand the dataset to more African languages.

A.1 Annotator Agreement
To shed more light on the few cases where annota tors disagreed, we provide entitylevel confusion matrices across all ten languages in Table 11. The most common disagreement is between organiza tions and locations.

A.2 Model Hyperparameters for Reproducibility
For finetuning mBERT and XLMR, we used the base and large models with maximum sequence length of 164 for mBERT and 200 for XLMR, batch size of 32, learning rate of 5e5, and num ber of epochs 50. For the MeanEBiLSTM model, the hyperparameters are similar to finetuning the LM except for the learning rate that we set to be 5e4, the BiLSTM hyperparameters are: input di mension is 768 (since the embedding size from mBERT and XLMR is 768) in each direction of LSTM, one hidden layer, hidden layer size of 64, and dropout probability of 0.3 before the last lin ear layer. All the experiments were performed on a single GPU (Nvidia V100). Table 10 shows the monolingual corpus we used for the language adaptive finetuning. We pro vide the details of the source of the data, and their sizes. For most of the languages, we make use of JW300 8 and CC100 9 . In some cases CCAligned (ElKishky et al., 2020) was used, in such a case, we removed duplicated sen tences from CC100. For finetuning the language model, we make use of the HuggingFace (Wolf et al., 2019) code with learning rate 5e5. How ever, for the Amharic BERT, we make use of a smaller learning rate of 5e6 since the multilin gual BERT vocabulary was replaced by Amharic vocabulary, so that we can slowly adapt the mBERT LM to understand Amharic texts. All lan guage BERT models were pretrained for 3 epochs ("ibo", "kin","lug","luo", "pcm","swa","yor") or 10 epochs ("amh", "hau","wol") depending on their convergence. The models can be found on HuggingFace Model Hub 10 .