Abstract
This paper introduces mGPT, a multilingual variant of GPT-3, pretrained on 61 languages from 25 linguistically diverse language families using Wikipedia and the C4 Corpus. We detail the design and pretraining procedure. The models undergo an intrinsic and extrinsic evaluation: language modeling in all languages, downstream evaluation on cross-lingual NLU datasets and benchmarks in 33 languages, and world knowledge probing in 23 languages. The in-context learning abilities are on par with the contemporaneous language models while covering a larger number of languages, including underrepresented and low-resource languages of the Commonwealth of Independent States and the indigenous peoples in Russia. The source code and the language models are publicly available under the MIT license.
1 Introduction
The advent of the Transformer architecture (Vaswani et al., 2017) has facilitated the development of various language models (LMs; Liu et al., 2020a). Although the well-established “pretrain & finetune” paradigm has led to rapid progress in NLP (Wang et al., 2019), it imposes several limitations. Finetuning relies on an extensive amount of labeled data. Collecting high-quality labeled data for new tasks and languages is expensive and resource-consuming (Wang et al., 2021). LMs can learn spurious correlations from finetuning data (Naik et al., 2018; Niven and Kao, 2019) and demonstrate inconsistent generalization, catastrophic forgetting, or brittleness to finetuning data order (McCoy et al., 2020; Dodge et al., 2020). Last but not least, finetuning requires additional computational resources and, therefore, aggravates the problem of a large carbon footprint (Bender et al., 2021).
The latest approaches address these limitations with zero-shot and few-shot learning, performing a task with LM scoring or conditioning on a few demonstration examples without parameter updates (Brown et al., 2020). Autoregressive LMs adopted via these paradigms have been widely applied in many NLP tasks (Schick and Schütze, 2021; Perez et al., 2021), notably in cross-lingual knowledge transfer (Winata et al., 2021) and low-resource language scenarios (Lin et al., 2022). However, model development for underrepresented typologically distant and low-resource languages (Wu and Dredze, 2020; Lauscher et al., 2020; Hedderich et al., 2021) and cross-lingual generalization abilities of autoregressive LMs (Erdem et al., 2022) have been left understudied.
This paper presents mGPT, a multilingual version of GPT-3 (Brown et al., 2020) available in 1.3B (mGPT1.3B) and 13B (mGPT13B) parameters. We aim (i) to develop a large-scale multilingual autoregressive LM that inherits the GPT-3’s generalization benefits and (ii) to increase the linguistic diversity of multilingual LMs, making the first attempt to address languages of the Commonwealth of Independent States (CIS) and under-resourced languages of the indigenous peoples in Russia. We pretrain mGPT in 61 languages from 25 language families on Wikipedia and Colossal Clean Crawled Corpus (C4; Raffel et al., 2020). We analyze the mGPT’s performance on various intrinsic and extrinsic tasks and compare it with the contemporaneous generative LMs.
Key Findings
The analysis reveals that (i) mGPT13B is comparable to XGLM1.7B (Lin et al., 2022) while having fewer weights and covering a larger number of languages, (ii) mGPT shows confident performance on Austronesian, Austro-Asiatic, Japonic, Germanic, and Romance languages on multiple tasks and prominent language modeling abilities on the languages of the indigenous peoples in Russia, (iii) adding more demonstrations may result in performance degradation for both mGPT and XGLM, and (iv) hate speech detection is one of the most challenging tasks, receiving random guessing performance in the zero-shot and few-shot evaluation setups. External validation by the NLP community since the release1 shows that mGPT1.3B can outperform large-scale LMs on SuperGLUE tasks and promote strong solutions for multilingual clause-level morphology tasks. We release the model evaluation code,2 the mGPT1.3B3 and mGPT13B4 models. We hope to facilitate research on the applicability of autoregressive LMs in non-English languages and increase the linguistic inclusivity of the low-resource languages.
2 Related Work
Multilingual Transformers
Recent years have featured the development of various monolingual and multilingual LMs initially designed for English. BERT (Devlin et al., 2019) has been replicated in other high-resource languages (Martin et al., 2020; Masala et al., 2020) and language families, e.g., Indian (Kakwani et al., 2020) and Balto-Slavic (Arkhipov et al., 2019). Massively multilingual LMs—mBERT, XLM-R (Conneau et al., 2020), RemBERT (Chung et al., 2021), mBART (Liu et al., 2020b) and mT5 (Xue et al., 2021)—have now pushed state-of-the-art results on various NLP tasks in multiple languages (Kalyan et al., 2021). Such models support more than 100 languages and vary in the architecture design and pretraining objectives. By contrast, our work presents one of the first multilingual autoregressive LMs covering more than 61 languages.
GPT-based Language Models
Large-scale generative LMs (e.g., GPT-3; Brown et al., 2020) are triggering a shift from the “pretrain & finetune” paradigm to prompt-based learning (Liu et al., 2023a). The benefit of balancing the pretraining costs and performing standardized NLP tasks with a few demonstration examples has stimulated the development of open-source autoregressive LMs for English (e.g., Black et al., 2022; Biderman et al., 2023; Dey et al., 2023), Chinese (Zeng et al., 2021), and Russian (Zmitrovich et al., 2023). A few contemporaneous works extend the research on zero-shot and few-shot learning, evaluating the in-context abilities of GPT-based LMs in multilingual scenarios. Winata et al. (2021) report that English GPTs perform significantly better than random guessing with monolingual and multilingual prompts on typologically close languages, such as French, Spanish, and German. Lin et al. (2022) propose XGLM, a multilingual GPT-style LM in 30 languages, and empirically show that it can outperform its monolingual counterparts of the comparable number of parameters. We use XGLM as the main baseline in our experiments and analyze the results of comparing mGPT1.3B with other autoregressive LMs published after our release, such as BLOOM (Scao et al., 2023).
3 Method
3.1 Pretraining Data
Language Selection
Table 1 summarizes the list of languages by their family. The pretraining corpus consists of a typologically weighted set of languages covered by cross-lingual benchmarks, such as XGLUE (Liang et al., 2020) and XTREME (Hu et al., 2020). The motivation behind the language choices is to narrow the gap between the high-resource and low-resource languages (Ducel et al., 2022). To this end, we include 20 languages from the tail of the C4 language list, the list of underrepresented languages of Russia, and the official and resource-lean CIS languages Orekhov et al., 2016.
Language Family . | Languages . |
---|---|
Afro-Asiatic | Arabic (ar), Hebrew (he) |
Austro-Asiatic | Vietnamese (vi) |
Austronesian | Indonesian (id), Javanese (jv), Malay (ms) |
Tagalog (tl) | |
Baltic | Latvian (lv), Lithuanian (lt) |
Basque | Basque (eu) |
Dravidian | Malayalam (ml), Tamil (ta), Telugu (te) |
Indo-European (Armenian) | Armenian (hy) |
Indo-European (Indo-Aryan) | Bengali (bn), Marathi (mr), Hindi (hi), |
Urdu (ur) | |
Indo-European (Germanic) | Afrikaans (af), Danish (da), English (en), |
German (de), Swedish (sv) | |
Indo-European (Romance) | French (fr), Italian (it), Portuguese (pt), |
Romanian (ro), Spanish (es) | |
Indo-European (Greek) | Greek (el) |
Indo-European (Iranian) | Ossetian (os), Tajik (tg), Persian (fa) |
Japonic | Japanese (ja) |
Kartvelian | Georgian (ka) |
Koreanic | Korean (ko) |
Kra-Dai | Thai (th) |
Mongolic | Buryat (bxr), Kalmyk (xal), Mongolian (mn) |
Niger-Congo | Swahili (sw), Yoruba (yo) |
Slavic | Belarusian (be), Bulgarian (bg), Russian (ru), |
Ukrainian (uk), Polish (pl) | |
Sino-Tibetan | Burmese (my) |
Turkic (Karluk) | Uzbek (uz) |
Turkic (Kipchak) | Bashkir (ba), Kazakh (kk), Kyrgyz (ky), |
Tatar (tt) | |
Turkic (Oghuz) | Azerbaijani (az), Chuvash (cv), Turkish (tr), |
Turkmen (tk) | |
Turkic (Siberian) | Tuvan (tyv), Yakut (sax) |
Uralic | Estonian (et), Finnish (fi), Hungarian (hu) |
Language Family . | Languages . |
---|---|
Afro-Asiatic | Arabic (ar), Hebrew (he) |
Austro-Asiatic | Vietnamese (vi) |
Austronesian | Indonesian (id), Javanese (jv), Malay (ms) |
Tagalog (tl) | |
Baltic | Latvian (lv), Lithuanian (lt) |
Basque | Basque (eu) |
Dravidian | Malayalam (ml), Tamil (ta), Telugu (te) |
Indo-European (Armenian) | Armenian (hy) |
Indo-European (Indo-Aryan) | Bengali (bn), Marathi (mr), Hindi (hi), |
Urdu (ur) | |
Indo-European (Germanic) | Afrikaans (af), Danish (da), English (en), |
German (de), Swedish (sv) | |
Indo-European (Romance) | French (fr), Italian (it), Portuguese (pt), |
Romanian (ro), Spanish (es) | |
Indo-European (Greek) | Greek (el) |
Indo-European (Iranian) | Ossetian (os), Tajik (tg), Persian (fa) |
Japonic | Japanese (ja) |
Kartvelian | Georgian (ka) |
Koreanic | Korean (ko) |
Kra-Dai | Thai (th) |
Mongolic | Buryat (bxr), Kalmyk (xal), Mongolian (mn) |
Niger-Congo | Swahili (sw), Yoruba (yo) |
Slavic | Belarusian (be), Bulgarian (bg), Russian (ru), |
Ukrainian (uk), Polish (pl) | |
Sino-Tibetan | Burmese (my) |
Turkic (Karluk) | Uzbek (uz) |
Turkic (Kipchak) | Bashkir (ba), Kazakh (kk), Kyrgyz (ky), |
Tatar (tt) | |
Turkic (Oghuz) | Azerbaijani (az), Chuvash (cv), Turkish (tr), |
Turkmen (tk) | |
Turkic (Siberian) | Tuvan (tyv), Yakut (sax) |
Uralic | Estonian (et), Finnish (fi), Hungarian (hu) |
Data Preparation Pipeline
Pretraining extensive LMs requires large volumes of high-quality data. Despite the explosive growth of web corpora resulting in the pretraining data volume of up to 6T tokens (Xue et al., 2021), the data quality is often unsatisfactory (Kreutzer et al., 2022). General approaches to maximizing the quality are based on manually curated heuristics (Yang et al., 2019b), the perplexity of LMs (Wenzek et al., 2020), and data quality classifiers (Brown et al., 2020). Our data preparation pipeline includes data collection, deduplication, and filtration.
Data Collection
Deduplication
The text deduplication includes 64-bit hashing of each text in the pretraining corpus for keeping texts with a unique hash.
Filtration
We follow Ortiz Suárez et al. (2019) on the C4 data filtration. We also filter the documents based on their text compression rate using zlib.6 The most strongly and weakly compressing deduplicated texts are discarded. The compression range for an acceptable text is empirically defined as ×1.2 to ×8. The texts with an entropy of less than 1.2 contain code junk and entities, while those of more than 8 contain repetitive segments. The next step includes distinguishing between low and high-quality documents with a binary classifier. The classifier is trained with Vowpal Wabbit7 on the Wikipedia documents as positive examples and the filtered C4 documents as negative ones. The remainder is cleaned by a set of language-agnostic heuristics. The size of the pretraining corpus is 46B (Wikipedia), and 442B UTF characters (C4), resulting in 600GB. Figure 1 shows the total number of tokens for each language, and the total number of documents in the pretraining corpus is presented in Figure 2.
3.2 Tokenization
Tokenization Strategies
We considered five tokenization strategies incorporating specific representations of uppercase characters, numbers, punctuation marks, and whitespaces. Table 2 presents examples of the tokenization strategies.
default: BBPE (Wang et al., 2020);
case: Each uppercase character is replaced with a special token <case> followed by the corresponding lowercase character;
arithmetic: The case strategy combined with representing numbers and arithmetic operations as individual tokens;
combined: The arithmetic strategy combined with representing punctuation marks and whitespaces as individual tokens;
char: Character-level tokenization.
Pretraining Details
The models are pretrained on 16 V100 GPUs for 600k training steps with a set of fixed hyperparameters: vocabulary size of 100k, context window of 2048, learning rate of 2e−4, and batch size of 4.
Results
The experiment results are presented in Table 3. The default model achieves the best results, outperforming the rest of the models by up to 2.5 of perplexity score. Based on this experiment, we select the default strategy to pretrain the mGPT1.3B and mGPT13B models.
3.3 Model Architecture
The mGPT architecture is based on GPT-3. We use the architecture description by Brown et al., the GPT-2 code base (Radford et al., 2019) from HuggingFace (Wolf et al., 2020), and Megatron- LM (Shoeybi et al., 2020). Table 4 presents the description of the GPT-2 and GPT-3 architectures of comparable sizes. With all the other hyperparameters equal, GPT-3 has fewer layers (Layers: 48 vs. 24) but a larger hidden size (dmodel: 1600 vs. 2048) as opposed to GPT-2. GPT-3 also alternates the classic dense and sparse attention layers (Child et al., 2019).
3.4 Model Pretraining
The pretraining procedure mostly follows Brown et al. We utilize the DeepSpeed library (Rasley et al., 2020) and Megatron-LM (Shoeybi et al., 2020). We pretrain our LMs with a total batch size of 2048 and a context window of 512 tokens. The total number of the training steps is 600k, and the models have seen 400B tokens during pretraining. The pretraining took 14 days on a cluster of 256 V100 GPUs for mGPT1.3B and 22 days on 512 V100 GPUs for mGPT13B. We report the computational, energy, and carbon costs in §7.2.
4 Experiments
4.1 Language Modeling
Method
We estimate the language modeling performance on the held-out sets for each language. Here, perplexity is computed as described in §3.2, except that perplexity is normalized over the length of the input text t in tokens |t|. We also run statistical tests to analyze the effect of linguistic, dataset, and model configuration criteria:
Language script: We divide the languages into two groups by their scrip—Latin and others (e.g., Cyrillic and Arabic)—and use the Mann-Whitney U test (Mann and Whitney, 1947) to analyze the perplexity distributions in the groups.
Pretraining corpus size: We calculate the Pearson correlation coefficient (Pearson, 1895) to analyze the correlation between the language perplexity and the number of documents in this language in the pretraining corpus.
Model size: We use the Mann-Whitney U test to analyze the effect of the model size.
Results by Language
Figure 3 presents the perplexity scores for each language on the held-out sets. The mGPT13B model achieves the best perplexities within the 2-to-10 score range for the majority of languages, including Dravidian (Malayalam, Tamil, Telugu), Indo-Aryan (Bengali, Hindi, Marathi), Slavic (Belarusian, Ukrainian, Russian, Bulgarian), Sino-Tibetan (Burmese), Kipchak (Bashkir, Kazakh), and others. Higher perplexities up to 20 are for only seven languages from different families. The mGPT1.3B results have similar distribution but are consistently higher than mGPT13B.
Results by Language Family
Analyzing results by the language family (see Figure 4), we find that mGPT13B shows consistently lower perplexities as opposed to mGPT1.3B. Specifically, mGPT1.3B underperforms mGPT13B on Basque, Greek, Kartvelian, and Turkic families.
Correlation Analysis
We present the results in Table 5. We observe that the language modeling performance depends on the language script and model size. In particular, the non-Latin languages receive lower scores on average, while mGPT13B performs better than mGPT1.3B in this setting. However, the positive correlation between the pretraining corpus size and perplexity in particular languages can be attributed to the low diversity of the text domains in the pretraining monolingual corpora for the low-resource languages. Such corpora contain Wikipedia articles on a limited amount of general topics; therefore, the model learns the distribution in the corpora without being able to generalize well. In general, the results align with Scao et al. (2023), who report that the considered criteria can affect the knowledge acquired by BLOOM1B and BLOOM176B.
Criterion . | Model . | Test . | p-value . |
---|---|---|---|
Language script | mGPT1.3B | M-W U test | 0.012 |
mGPT13B | 0.000 | ||
Pretraining corpus size | mGPT1.3B | Pearson | 0.137 |
mGPT13B | 0.307 | ||
Model size | mGPT1.3B | M-W U test | 0.0007 |
mGPT13B |
Criterion . | Model . | Test . | p-value . |
---|---|---|---|
Language script | mGPT1.3B | M-W U test | 0.012 |
mGPT13B | 0.000 | ||
Pretraining corpus size | mGPT1.3B | Pearson | 0.137 |
mGPT13B | 0.307 | ||
Model size | mGPT1.3B | M-W U test | 0.0007 |
mGPT13B |
4.2 Downstream Evaluation
We conduct an extrinsic evaluation of mGPT and baselines on classification and sequence labeling tasks in zero-shot and few-shot settings. In the zero-shot setting, the model is shown a test example formatted as a prompt in natural language, while in the few-shot setting, the model is provided with k demonstrations from the training data specified via prompts. The prompt examples for each task are presented in Table 6.
Task . | Template . | Output Candidates . |
---|---|---|
XNLI | <s> {sentence 1}, right? {label} {sentence 2} </s> | Yes (Entailment); Also (Neutral) No (Contradiction) |
PAWSX | <s> {sentence 1}, right? {label} {sentence 2} </s> | Yes; No |
XWINO | <s> {sentence start}{candidate} {sentence end} </s> | ✗ |
XCOPA | <s> {sentence} because {candidate answer} </s> | ✗ |
<s> {sentence} so {candidate answer} </s> | ||
Hate Speech | <s> The sentence is {label}. {sentence} </s> | sexist, racist, offensive, abusive, hateful (Positive) |
normal, common, ok, usual, acceptable (Negative) | ||
NER | <s>lang: {lang}∖n Tagged sentence: {sentence with tags} | I-LOC, I-MISC, |
I-ORG, I-PER, O | ||
POS | <s>lang: {lang}∖n Tagged sentence: {sentence with tags} | ADJ, ADP, ADV, AUX, |
CCONJ, DET, INTJ, NOUN, | ||
NUM, PART, PRON, PROPN, PUNCT, | ||
SCONJ, SYM, VERB, X |
Task . | Template . | Output Candidates . |
---|---|---|
XNLI | <s> {sentence 1}, right? {label} {sentence 2} </s> | Yes (Entailment); Also (Neutral) No (Contradiction) |
PAWSX | <s> {sentence 1}, right? {label} {sentence 2} </s> | Yes; No |
XWINO | <s> {sentence start}{candidate} {sentence end} </s> | ✗ |
XCOPA | <s> {sentence} because {candidate answer} </s> | ✗ |
<s> {sentence} so {candidate answer} </s> | ||
Hate Speech | <s> The sentence is {label}. {sentence} </s> | sexist, racist, offensive, abusive, hateful (Positive) |
normal, common, ok, usual, acceptable (Negative) | ||
NER | <s>lang: {lang}∖n Tagged sentence: {sentence with tags} | I-LOC, I-MISC, |
I-ORG, I-PER, O | ||
POS | <s>lang: {lang}∖n Tagged sentence: {sentence with tags} | ADJ, ADP, ADV, AUX, |
CCONJ, DET, INTJ, NOUN, | ||
NUM, PART, PRON, PROPN, PUNCT, | ||
SCONJ, SYM, VERB, X |
4.2.1 Classification
Tasks
The classification tasks include commonsense reasoning (XCOPA; Ponti et al., 2020), natural language inference (XNLI; Conneau et al.2018), Winograd schema challenge (XWINO; Tikhonov and Ryabinin, 2021), paraphrase detection (PAWSX; Yang et al., 2019a), and hate speech detection (Davidson et al., 2017).
Method
mGPT utilizes per-token cross-entropy loss, which is reduced to negative log probability due to one-hot encoding of the tokens. We select the target label associated with the prompt that results in the lowest sum of negative log probabilities for its tokens. The few-shot experiments are run five times with different random seeds, while the zero-shot experiments are run only once since the model loss is determined.
Baselines
The XGLM1.7B and XGLM7.5B models are used as the baselines in the classification experiments. We reproduce the XGLM evaluation based on the methodology by Lin et al. (2022) and use the model weights and code available in the fairseq8 library (Ott et al., 2019). We select prompts according to the templates reported by Lin et al. Prompts for non-English languages are automatically translated with Google Translate.
Results
Table 7 presents the classification results averaged across languages. The “✗” tag marks k-shot settings not reported by Lin et al. We do not perform them for reproducibility purposes and fair comparison. The results by Lin et al. are reproduced in the zero-shot setup, and some scores are even slightly higher. However, not all results are reproduced, e.g., PAWSX and XNLI. We attribute this to potential differences in the translated prompts.
Model . | k-shot . | XWINO . | PAWSX . | XCOPA . | XNLI . | Hate Speech . |
---|---|---|---|---|---|---|
mGPT1.3B | 0 | 56.2 | 53.1 | 55.5 | 40.6 | 50.0 |
1 | 57.0 | 51.3 | 54.9 | 36.1 | ✗ | |
4 | 56.8 | 52.2 | 54.8 | 37.4 | 50.8 | |
16 | 54.5 | 52.2 | 54.8 | 37.9 | ✗ | |
mGPT13B | 0 | 59.3 | 51.5 | 58.2 | 42.6 | 53.1 |
1 | 61.0 | 50.6 | 57.9 | 37.5 | ✗ | |
4 | 61.8 | 51.6 | 58.3 | 41.4 | 51.5 | |
16 | 59.2 | 55.1 | 57.3 | 33.3 | ✗ | |
XGLM1.7B | 0 | 54.2 | 50.3 | 55.5 | 42.6 | 50.1 |
1 | 58.0 | 45.9 | 56.8 | 36.4 | ✗ | |
4 | 57.9 | 45.9 | 56.2 | 38.8 | 49.5 | |
16 | ✗ | 44.2 | 56.1 | 36.5 | ✗ | |
XGLM7.5B | 0 | 59.2 | 50.1 | 55.5 | 44.7 | 50.1 |
1 | 63.7 | 46.4 | 60.6 | 36.9 | ✗ | |
4 | 64.2 | 45.3 | 61.4 | 40.1 | 51.8 | |
16 | ✗ | 44.9 | 62.5 | 40.0 | ✗ |
Model . | k-shot . | XWINO . | PAWSX . | XCOPA . | XNLI . | Hate Speech . |
---|---|---|---|---|---|---|
mGPT1.3B | 0 | 56.2 | 53.1 | 55.5 | 40.6 | 50.0 |
1 | 57.0 | 51.3 | 54.9 | 36.1 | ✗ | |
4 | 56.8 | 52.2 | 54.8 | 37.4 | 50.8 | |
16 | 54.5 | 52.2 | 54.8 | 37.9 | ✗ | |
mGPT13B | 0 | 59.3 | 51.5 | 58.2 | 42.6 | 53.1 |
1 | 61.0 | 50.6 | 57.9 | 37.5 | ✗ | |
4 | 61.8 | 51.6 | 58.3 | 41.4 | 51.5 | |
16 | 59.2 | 55.1 | 57.3 | 33.3 | ✗ | |
XGLM1.7B | 0 | 54.2 | 50.3 | 55.5 | 42.6 | 50.1 |
1 | 58.0 | 45.9 | 56.8 | 36.4 | ✗ | |
4 | 57.9 | 45.9 | 56.2 | 38.8 | 49.5 | |
16 | ✗ | 44.2 | 56.1 | 36.5 | ✗ | |
XGLM7.5B | 0 | 59.2 | 50.1 | 55.5 | 44.7 | 50.1 |
1 | 63.7 | 46.4 | 60.6 | 36.9 | ✗ | |
4 | 64.2 | 45.3 | 61.4 | 40.1 | 51.8 | |
16 | ✗ | 44.9 | 62.5 | 40.0 | ✗ |
Overall, we observe that mGPT1.3B is comparable with XGLM1.7B while having fewer weights and is pretrained in twice as many languages. mGPT13B performs better than XGLM7.5B in zero-shot setting on all tasks except XNLI. At the same time, it lags behind in a few-shot setting being better than XGLM7.5B only in XNLI and PAWSX tasks. Comparing the performance across languages, we find that English receives the highest accuracy for all tasks. The mGPT1.3B and mGPT13B models show high accuracy for the Austronesian, Dravidian, Japonic, Germanic, and Romance language families. Only the Afro-Asiatic family gets low accuracy. The mGPT models perform better than the XGLM counterparts for Austronesian, Koreanic, and Romance languages.
Our results on hate speech detection are consistent with Lin et al. The performance is slightly better across the five languages but still close to random guessing (see Table 8). The manual analysis shows that the behavior is sensitive to the input prompts, most notably for Polish. Increasing the number of demonstrations can lead to performance degradation on some classification tasks for both mGPT and XGLM.
Model . | k-shot . | en . | es . | pt . | pl . | it . |
---|---|---|---|---|---|---|
mGPT1.3B | 0 | 55.1 | 52.1 | 42.3 | 50.0 | 50.2 |
4 | 50.1 | 50.2 | 51.7 | 51.5 | 50.4 | |
mGPT13B | 0 | 59.0 | 55.2 | 46.9 | 50.0 | 54.6 |
4 | 52.2 | 50.0 | 50.8 | 53.4 | 51.0 | |
XGLM1.7B | 0 | 54.8 | 51.8 | 52.3 | 50.0 | 54.5 |
4 | 51.0 | 48.8 | 49.2 | 46.7 | 51.0 | |
XGLM7.5B | 0 | 61.7 | 52.4 | 52.3 | 50.0 | 49.0 |
4 | 51.8 | 51.3 | 51.5 | 51.4 | 52.9 |
Model . | k-shot . | en . | es . | pt . | pl . | it . |
---|---|---|---|---|---|---|
mGPT1.3B | 0 | 55.1 | 52.1 | 42.3 | 50.0 | 50.2 |
4 | 50.1 | 50.2 | 51.7 | 51.5 | 50.4 | |
mGPT13B | 0 | 59.0 | 55.2 | 46.9 | 50.0 | 54.6 |
4 | 52.2 | 50.0 | 50.8 | 53.4 | 51.0 | |
XGLM1.7B | 0 | 54.8 | 51.8 | 52.3 | 50.0 | 54.5 |
4 | 51.0 | 48.8 | 49.2 | 46.7 | 51.0 | |
XGLM7.5B | 0 | 61.7 | 52.4 | 52.3 | 50.0 | 49.0 |
4 | 51.8 | 51.3 | 51.5 | 51.4 | 52.9 |
4.2.2 Sequence Labeling
Tasks
The sequence labeling tasks include named entity recognition (NER) and part-of-speech tagging (POS) from the XGLUE benchmark (Liang et al., 2020). To address other medium-resource and resource-lean languages, we use the Universal Dependencies treebanks (UD; Nivre et al., 2016) to evaluate POS-tagging in Armenian, Belarusian, Buryat, Kazakh, Tatar, Ukrainian, and Yakut.
Method
We use a modified approach to the sequence labeling tasks compared to §4.2.1. Given a sentence of n words, we iteratively predict the label for each word xi using the preceding words x <i and their predicted labels l <i as the context using a template “”, where i is the current token index and “_” is a placeholder. The only exception is the first token xi used as the context. The placeholder is filled with each possible target label l ∈ L at each step. We select the label with the lowest sum of losses per token in the resulting string. The experiments are run in the zero-shot and 4-shot settings.9
Example
Consider an example for the POS-tagging task \I [PRON] want [VERB] it [PART] . [PUNCT]”, which requires 4 procedure steps. First, we combine the placeholder in the string \I_” with each possible POS tag and select the most probable candidate. Next, we repeat the procedure for \I_li want_”, and so on.
Baselines
We use results reported in Liang et al. as the baselines: M-BERT, XLM-R, and Unicoder (Huang et al., 2019). Note that the baselines are finetuned on the corresponding training set. The performance is evaluated with the F1-score (NER) and the accuracy score (POS-tagging)10 according to the XGLUE methodology.
NER Results
Table 9 shows counterintuitively that mGPT1.3B outperforms mGPT13B on all languages. 4-shot falls behind finetuned models but significantly outperforms random guessing for both mGPT models. Per-language language analysis shows a large gap between English and other languages (for mGPT13B the F1-score on English is more than twice higher than for any of the other languages), while for German, both models perform the worst. This pattern coincides with the baseline results. In addition, it could be noted that while for mGPT1.3B the F1-score exceeds the 10 percent threshold for all languages, this is not the case for mGPT13B.
Model . | de . | en . | es . | nl . | Avg. . |
---|---|---|---|---|---|
Random | 1.9 | 3.1 | 1.8 | 1.6 | 2.1 |
mGPT1.3B | 12.2 | 22.1 | 12.7 | 13.1 | 15.0 |
mGPT13B | 5.6 | 20.9 | 10.4 | 6.7 | 10.9 |
M-BERTbase | 69.2 | 90.6 | 75.4 | 77.9 | 78.2 |
XLM-Rbase | 70.4 | 90.9 | 75.2 | 79.5 | 79.0 |
Unicoder | 71.8 | 91.1 | 74.4 | 81.6 | 79.7 |
Model . | de . | en . | es . | nl . | Avg. . |
---|---|---|---|---|---|
Random | 1.9 | 3.1 | 1.8 | 1.6 | 2.1 |
mGPT1.3B | 12.2 | 22.1 | 12.7 | 13.1 | 15.0 |
mGPT13B | 5.6 | 20.9 | 10.4 | 6.7 | 10.9 |
M-BERTbase | 69.2 | 90.6 | 75.4 | 77.9 | 78.2 |
XLM-Rbase | 70.4 | 90.9 | 75.2 | 79.5 | 79.0 |
Unicoder | 71.8 | 91.1 | 74.4 | 81.6 | 79.7 |
POS-tagging Results
POS-tagging results for the XGLUE benchmark and resource-lean languages are presented in Table 10. Similarly to the NER task, mGPT1.3B outperforms mGPT13B practically in all languages except for Italian. On average mGPT1.3B achieves accuracy score of 0.24 while mGPT13B only scores 0.21. These results are still far behind fine-tuned models; however, they are significantly higher than random guessing. Analyzing the results for the low-resource languages, it can be seen that mGPT1.3B performance is comparable with its performance on XGLUE, while the mGPT13B scores are lower.
Model . | XGLUE . | CIS & Low-Resource UD . | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ar . | bg . | de . | el . | en . | es . | fr . | hi . | it . | nl . | pl . | pt . | ru . | th . | tr . | ur . | vi . | zh . | Avg. . | be . | bxr . | hy . | kk . | sah . | tt . | uk . | |
Random | 6.5 | 6.5 | 6.0 | 5.2 | 4.4 | 5.7 | 5.5 | 6.7 | 6.6 | 6.6 | 5.9 | 4.7 | 6.0 | 6.4 | 6.8 | 1.2 | 7.0 | 7.1 | 5.8 | 1.3 | 5.7 | 5.9 | 2.6 | 9.6 | 8.7 | 4.8 |
mGPT1.3B | 16.5 | 24.5 | 30.6 | 20.9 | 40.0 | 24.3 | 27.0 | 16.2 | 25.4 | 28.8 | 28.3 | 24.6 | 29.4 | 12.9 | 30.4 | 15.0 | 25.6 | 19.5 | 24.4 | 21.5 | 28.4 | 14.7 | 22.8 | 19.9 | 21.4 | 22.5 |
mGPT13B | 11.7 | 21.8 | 26.8 | 16.1 | 36.0 | 22.2 | 25.0 | 12.3 | 26.5 | 26.5 | 24.2 | 21.8 | 21.8 | 9.5 | 26.8 | 12.7 | 21.5 | 12.5 | 20.9 | 10.6 | 7.7 | 7.3 | 9.4 | 11.8 | 9.2 | 10.9 |
M-BERTbase | 52.4 | 85.0 | 88.7 | 81.5 | 95.6 | 86.8 | 87.6 | 58.4 | 91.3 | 88.0 | 81.8 | 88.3 | 78.8 | 43.3 | 69.2 | 53.8 | 54.3 | 58.3 | 74.7 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
XLM-Rbase | 67.3 | 88.8 | 92.2 | 88.2 | 96.2 | 89.0 | 89.9 | 74.5 | 92.6 | 88.5 | 85.4 | 89.7 | 86.9 | 57.9 | 72.7 | 62.1 | 55.2 | 60.4 | 79.8 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
Unicoder | 68.6 | 88.5 | 92.0 | 88.3 | 96.1 | 89.1 | 89.4 | 69.9 | 92.5 | 88.9 | 83.6 | 89.8 | 86.7 | 57.6 | 75.0 | 59.8 | 56.3 | 60.2 | 79.6 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
Model . | XGLUE . | CIS & Low-Resource UD . | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ar . | bg . | de . | el . | en . | es . | fr . | hi . | it . | nl . | pl . | pt . | ru . | th . | tr . | ur . | vi . | zh . | Avg. . | be . | bxr . | hy . | kk . | sah . | tt . | uk . | |
Random | 6.5 | 6.5 | 6.0 | 5.2 | 4.4 | 5.7 | 5.5 | 6.7 | 6.6 | 6.6 | 5.9 | 4.7 | 6.0 | 6.4 | 6.8 | 1.2 | 7.0 | 7.1 | 5.8 | 1.3 | 5.7 | 5.9 | 2.6 | 9.6 | 8.7 | 4.8 |
mGPT1.3B | 16.5 | 24.5 | 30.6 | 20.9 | 40.0 | 24.3 | 27.0 | 16.2 | 25.4 | 28.8 | 28.3 | 24.6 | 29.4 | 12.9 | 30.4 | 15.0 | 25.6 | 19.5 | 24.4 | 21.5 | 28.4 | 14.7 | 22.8 | 19.9 | 21.4 | 22.5 |
mGPT13B | 11.7 | 21.8 | 26.8 | 16.1 | 36.0 | 22.2 | 25.0 | 12.3 | 26.5 | 26.5 | 24.2 | 21.8 | 21.8 | 9.5 | 26.8 | 12.7 | 21.5 | 12.5 | 20.9 | 10.6 | 7.7 | 7.3 | 9.4 | 11.8 | 9.2 | 10.9 |
M-BERTbase | 52.4 | 85.0 | 88.7 | 81.5 | 95.6 | 86.8 | 87.6 | 58.4 | 91.3 | 88.0 | 81.8 | 88.3 | 78.8 | 43.3 | 69.2 | 53.8 | 54.3 | 58.3 | 74.7 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
XLM-Rbase | 67.3 | 88.8 | 92.2 | 88.2 | 96.2 | 89.0 | 89.9 | 74.5 | 92.6 | 88.5 | 85.4 | 89.7 | 86.9 | 57.9 | 72.7 | 62.1 | 55.2 | 60.4 | 79.8 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
Unicoder | 68.6 | 88.5 | 92.0 | 88.3 | 96.1 | 89.1 | 89.4 | 69.9 | 92.5 | 88.9 | 83.6 | 89.8 | 86.7 | 57.6 | 75.0 | 59.8 | 56.3 | 60.2 | 79.6 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
4.3 Knowledge Probing
Method
We probe our models for factual knowledge in 23 languages using the mLAMA dataset (Kassner et al., 2021). The task is to complete a knowledge triplet <subject, relation, object> converted to templates for querying LMs. Consider an example from the original LAMA (Petroni et al., 2019) for English, where <Dante, born-in, X> is converted to the template “Dante was born in [MASK]”. We follow Lin et al. to design the probing task. As each such query contains hundreds of negative candidates on average, we limit the number of candidates to three, i.e., one is the ground truth candidate and the other two candidates are randomly sampled from the provided knowledge source. The probing performance is evaluated with precision@1 averaged over all relations per language.
Results
Figure 5 outlines the results for mGPT1.3B and mGPT13B. The overall pattern is that the performance is equal to or above 0.6 for Germanic, Romance, Austro-Asiatic, Japonic, and Chinese languages. However, Uralic, Slavic, Koreanic, and Afro-Asiatic languages receive scores of lower than 0.5. We also find that scaling the number of model parameters usually boosts the performance for high-resource languages up to 5 points, while no significant improvements are observed in the other languages. Comparing our results with Lin et al., we conclude that our models achieve lower performance than XGLM7.5B almost in all languages and perform on par with GPT3-Curie6.5B.
4.4 External Evaluation
General Language Understanding
Scao et al. (2023) compared the performance of BLOOM176B, mGPT1.3B, OPT175B (Zhang et al., 2022), GPT-J6B (Wang and Komatsuzaki, 2021), and T011B (Victor et al., 2022) on subset of tasks from the SuperGLUE benchmark (Wang et al., 2019) in the zero-shot and one-shot settings. The results of evaluating the models using five prompts are presented in Figure 6. The mGPT1.3B model has comparable performance despite having fewer weights. In the zero-shot setting, the performance of mGPT1.3B, BLOOM176B, OPT175B, and GPT-J6B on the considered tasks is above random guessing. We also observe the strong performance of mGPT1.3B on the Winogender Schema Diagnostics (Ax-g). In the one-shot setting, mGPT1.3B performs on par with GPT-J6B, and the resulting variability is significantly reduced across all prompts.
Multilingual Clause-level Morphology
The first shared task on Multilingual Clause-level Morphology (Goldman et al., 2022) covers nine languages and includes three sub-tasks: (i) inflection (generating a word form given a lexeme and a set of morphosyntactic features), (ii) reinflection (reinflect an input sentence according to a given set of morphosyntactic features), and (iii) detect a root and its features in an input sentence. Acikgoz et al. (2022) develop a first-place solution based on mGPT1.3B and prefix-tuning method, outperforming other solutions and baselines on the third task.
4.5 Generation Evaluation
Method
We compute seven lexical diversity metrics from Gehrmann et al. (2021) using the mGPT outputs11 on 100 test set samples from the story generation task in five languages: English, French, German, Spanish, and Chinese (Chen et al., 2022). The diversity metrics include the Shannon Entropy over unigrams (Entropy1), the mean segmented type-token ratio over segment lengths of 100 (MSTTR), the ratio of distinct unigrams over the total number of unigrams (Distinct1), and the counter of unigrams that appear once in the collection of generated outputs (Unique1).
Results
The results are presented in Table 11. The diversity metrics scores for Chinese are the highest, while the mean generated text length is the shortest. This is likely due to its logographic writing. The results for the Indo-European languages are similar (French, German, and Spanish), indicating that mGPT1.3B generates diverse texts in these languages. Surprisingly, the metrics are lower for English, with the average text length being longer. Our current natural language generation evaluation approach lacks downstream tasks, which we leave for future work.
ISO . | Avg. length . | Distinct1 . | Vocabulary size . | Unique1 . | Entropy1 . | TTR . | MSTTR . |
---|---|---|---|---|---|---|---|
en | 39.13 ± 22.61 | 0.071 | 387 | 103 | 6.175 | 0.097 | 0.228 |
fr | 23.53 ± 17.92 | 0.128 | 486 | 181 | 6.875 | 0.159 | 0.346 |
de | 30.85 ± 17.33 | 0.113 | 453 | 159 | 6.850 | 0.151 | 0.340 |
es | 12.71 ± 15.54 | 0.102 | 413 | 124 | 6.818 | 0.148 | 0.315 |
zh | 3.157 ± 2.39 | 0.492 | 188 | 124 | 7.055 | 0.525 | 0.526 |
ISO . | Avg. length . | Distinct1 . | Vocabulary size . | Unique1 . | Entropy1 . | TTR . | MSTTR . |
---|---|---|---|---|---|---|---|
en | 39.13 ± 22.61 | 0.071 | 387 | 103 | 6.175 | 0.097 | 0.228 |
fr | 23.53 ± 17.92 | 0.128 | 486 | 181 | 6.875 | 0.159 | 0.346 |
de | 30.85 ± 17.33 | 0.113 | 453 | 159 | 6.850 | 0.151 | 0.340 |
es | 12.71 ± 15.54 | 0.102 | 413 | 124 | 6.818 | 0.148 | 0.315 |
zh | 3.157 ± 2.39 | 0.492 | 188 | 124 | 7.055 | 0.525 | 0.526 |
5 Discussion
Our key takeaways on pretraining and evaluating large-scale multilingual autoregressive LMs are summarized below.
5.1 Model Scaling
Empirical Results
The language modeling results for mGPT1.3B and mGPT13B suggest that the model scaling improves its generation abilities for all given languages (see §4.1). However, it does not improve performance on the downstream and probing tasks (see §4.2; §4.3). Overall, the language modeling performance depends on the model size and the pretraining corpus size in a language, and smaller models may better encode linguistic information than larger ones. These findings align with Scao et al. (2023).
Takeaways
Our work had been conducted a year before the Chinchilla scaling laws were introduced (Hoffmann et al., 2022). According to the advanced methods of scaling LMs, our pretraining corpus can be sufficiently extended to improve the generalization abilities of the mGPT13B model. At the same time, the pretraining corpus design can promote the model underfitting and overfitting on particular languages. We believe it can be accounted for by aggregating the language-specific cross-entropy loss and producing language weights similar to Xie et al. (2023).
5.2 Lack of Data
Empirical Results
Another challenging factor is the lack of high-quality data for the low-resource languages. Although mGPT shows promising results on the language modeling and sequence labeling tasks for the underrepresented languages (see §4.1, §4.2), the low amount of evaluation resources limits the scope of analyzing the model generalization abilities. The correlation between the model performance and the amount of pretraining data in a language (see §4.1, and, e.g., Lauscher et al., 2020; Ahuja et al., 2022) further highlights the need for creating text corpora in such languages.
Takeaways
The question of addressing the discrepancy in data distribution across the world’s languages remains unresolved. Our data collection and filtration approach is equivalent for all considered languages. Extending the language-agnostic heuristics is restrained due to the lack of linguistic expertise. However, we assume that experimenting with the training data for the text quality classifiers can improve the resulting quality of the corpora for the low-resource languages (e.g., training the classifiers on different mixtures of data in the medium and high-resource languages).
As the follow-up work, we release 23 versions of the mGPT1.3B model continuously pretrained with language modeling objective on monolingual corpora for medium-resource and low-resource languages collected through collaboration with the NLP community. Table 12 summarizes the models by language and the language modeling performance on the held-out monolingual test sets. Examples of the corpora include Eastern Armenian National Corpus (Khurshudyan et al., 2022), OpenSubtitles (Lison and Tiedemann, 2016), and TED talks. Continued pretraining on additional data improves the language modeling performance.
5.3 Language Selection
Empirical Results
Results of mGPT1.3B on most of the classification tasks are on par or better than the results of the XGLM1.7B given that mGPT covers twice as many languages (see §4.2). However, mGPT underperforms the baselines on several multi-class classification and probing tasks.
Takeaways
We find that balancing the pretraining corpus by the language family helps improve the language modeling abilities for underrepresented languages due to their typological similarity with the medium and high-resource languages (see §4.1). However, increasing language diversity can lead to performance degradation because of the curse of multilinguality and a limited model capacity (Conneau et al., 2020).
5.4 Tokenization
Empirical Results
We conduct an ablation study to analyze the impact of the tokenization strategy on language modeling performance. We find that the considered strategies do not improve the model’s perplexity. However, the main drawback of the perplexity-based evaluation is that it only partially assesses the model generalization abilities.
Takeaways
The optimal tokenization method and vocabulary size remain an open question, particularly in the multilingual setup (Mielke et al., 2021). There are no established methods for defining the vocabulary size based on the amount of textual data in different languages. Our experiments are limited to a fixed vocabulary size, and we leave further investigation of the tokenization strategies and their configurations for future work.
5.5 Zero-shot and Few-shot Performance
Empirical Results
Increasing the number of demonstrations does not always lead to improvements but decreases the performance on some downstream tasks (see §4.2.1; §4.2.2). This observation aligns with Lin et al. (2022) and Brown et al. (2020).
The zero-shot and few-shot performance may not exceed the random guessing on particular tasks, which points to the failure of a model to follow the guidance in the demonstration examples (see §4.2.1; §4.2.2).
The prompting approach is unstable and hardly universal across languages, as indicated by the model sensitivity to the prompts.
The mGPT models can assign higher probabilities to the most frequent tag in the input for the sequence labeling tasks (see §4.2.2).
Takeaways
The stability of the models with respect to the prompts may be improved using prompt-tuning (Liu et al., 2023b) and contextual calibration (Zhao et al., 2021) as shown in §4.4.
The generalization capabilities of the autoregressive LMs in sequence labeling tasks is an underexplored area. While our LMs achieve results higher than random guessing, the low performance can be attributed to the probability distribution shifts between the pretraining corpora and the prompts. We leave the investigation of the alternative prompt design (Liu et al., 2023a) and structured prediction methods (Liu et al., 2022) for future work.
6 Conclusion
We introduce the mGPT1.3B and mGPT13B models, which cover 61 languages from linguistically diverse 25 language families. Our model is one of the first autoregressive LMs for economically endangered and underrepresented CIS and low-resource languages. The architecture design choices are based on the preliminary tokenization experiments and their perplexity-based evaluation. The model evaluation experiments include language modeling, standardized cross-lingual NLU datasets and benchmarks, world knowledge probing, and social bias tasks. We evaluate the in-context learning abilities in zero and few-shot settings with a negative log-likelihood probability. We present a detailed analysis of the model performance, limitations, and ethical considerations. Despite the space for further quality growth and solving the highlighted limitations, the model shows significant potential and can become the basis for developing generative pipelines for languages other than English, especially the low-resource ones. This initiative has been developed for 23 diverse languages through collaboration with the NLP community. We hope to benefit cross-lingual knowledge transfer, annotation projection, and other potential applications for economically challenged and underrepresented languages and diversify the research field by shifting from the Anglo-centric paradigm.
7 Ethical Statement and Social Impacts
7.1 Low-resource Languages
NLP for resource-lean scenarios is one of the leading research directions nowadays. The topic’s relevance has led to proactive research on low-resource languages. Our work falls under this scope, introducing the first autoregressive LM for 61 languages. To the best of our knowledge, we present one of the first attempts to address this problem for 20 languages of the Commonwealth of Independent States and the indigenous peoples in Russia.
7.2 Energy Efficiency and Usage
The power usage effectiveness (PUE) of our data centers is not more than 1.3, the spent power is 30.6k kWh (mGPT1.3B) and 91.3 kWh (mGPT13B), and the CO2 energy intensity (ICO2) in the region is 400 grams per kWh. The resulting CO2 emission is 15.9k kg (mGPT1.3B) and 47.5k kg (mGPT13B). The emission is comparable with a single medium-range flight of a modern aircraft, which usually releases about 12k kg of CO2 per 1k km. Despite the costs, mGPT can be efficiently adapted to the user needs via few-shot learning, bringing down potential budget costs in the scope of applications in multiple languages, such as generating the content, augmenting labeled data, or summarizing news. The multilingual pretraining saves on data annotation and energy consumption, alleviating the carbon footprint. Model compression techniques, e.g., pruning and distillation, can reduce inference costs.
7.3 Social Risks of Harm
Stereotypes and unjust discrimination present in pretraining corpora lead to representation biases in LMs. LMs can reflect historical prejudices against disadvantaged social groups and reproduce harmful stereotypes about gender, race, religion, or sexual orientation (Weidinger et al., 2022). We have analyzed mGPT’s limitations on social risks of harm involving hate speech on the hate speech detection task. Our results are similar to Lin et al. (2022) in that the performance is close to random guessing. This may indicate a significant bias in the pretraining corpus, a mutual influence of languages during training, or methodological problems in the test set. We do not claim that our evaluation setup is exhaustive, and we assume that other biases can be revealed through a direct model application or an extended evaluation.
7.4 Potential Misuse
The misuse potential of LMs increases with their ability to generate high-quality texts. Malicious users can perform a socially harmful activity that involves generating texts, e.g., spreading propaganda and other targeted manipulation (Jawahar et al., 2020). We recognize that our models can be misused in all supported languages. However, adversarial defense and artificial text detection models can mitigate ethical and social risks of harm. Our primary purpose is to propose multilingual GPT-style LMs for research and development needs, and we hope to work on the misuse problem with other developers and experts in mitigation research in the future.
Notes
As of the time of writing this paper, mGPT1.3B was publicly available. Note that mGPT13B is also now released.
We report the results only in the 4-shot setting since the manual analysis reveals that the models have failed to capture the task, giving constant predictions without any additional examples.
We evaluate the sequence labeling tasks using the XGLUE code: github.com/microsoft/XGLUE.
We use the generation hyperparameters: temperature = 1, max_length = 100, top_k = 5, top_p = 0.9.
References
Author notes
Work done while at SaluteDevices.
Now at University of Oslo.
Action Editor: Miguel Ballesteros