mGPT: Few-Shot Learners Go Multilingual

This paper introduces mGPT, a multilingual variant of GPT-3, pretrained on 61 languages from 25 linguistically diverse language families using Wikipedia and the C4 Corpus. We detail the design and pretraining procedure. The models undergo an intrinsic and extrinsic evaluation: language modeling in all languages, downstream evaluation on cross-lingual NLU datasets and benchmarks in 33 languages, and world knowledge probing in 23 languages. The in-context learning abilities are on par with the contemporaneous language models while covering a larger number of languages, including underrepresented and low-resource languages of the Commonwealth of Independent States and the indigenous peoples in Russia. The source code and the language models are publicly available under the MIT license.


Introduction
The advent of the Transformer architecture (Vaswani et al., 2017) has facilitated the development of various language models (LMs; Liu et al., 2020a).Although the wellestablished "pretrain & finetune" paradigm has led to rapid progress in NLP (Wang et al., 2019), it imposes several limitations.Finetuning relies on an extensive amount of labeled data.Collecting highquality labeled data for new tasks and languages is expensive and resource-consuming (Wang et al., 2021).LMs can learn spurious correlations from finetuning data (Naik et al., 2018;Niven and Kao, 2019) and demonstrate inconsistent generalization, catastrophic forgetting, or brittleness to finetuning data order (McCoy et al., 2020;Dodge et al., 2020).Last but not least, finetuning requires additional computational resources and, therefore, aggravates the problem of a large carbon footprint (Bender et al., 2021).
The latest approaches address these limitations with zero-shot and few-shot learning, performing a task with LM scoring or conditioning on a few demonstration examples without parameter updates (Brown et al., 2020).Autoregressive LMs adopted via these paradigms have been widely applied in many NLP tasks (Schick and Schütze, 2021;Perez et al., 2021), notably in crosslingual knowledge transfer (Winata et al., 2021) and low-resource language scenarios (Lin et al., 2022).However, model development for underrepresented typologically distant and low-resource languages (Wu and Dredze, 2020;Lauscher et al., 2020;Hedderich et al., 2021) and cross-lingual generalization abilities of autoregressive LMs (Erdem et al., 2022) have been left understudied.
This paper presents mGPT, a multilingual version of GPT-3 (Brown et al., 2020) available in 1.3B (mGPT 1.3B ) and 13B (mGPT 13B ) parameters.We aim to (i) develop a large-scale multilingual autoregressive LM that inherits the GPT-3's generalization benefits and (ii) to increase the linguistic diversity of multilingual LMs, making the first attempt to address languages of the Commonwealth of Independent States (CIS) and under-resourced languages of the small peoples in Russia.We pretrain mGPT in 61 languages from 25 language families on Wikipedia and Colossal Clean Crawled Corpus (C4; Raffel et al., 2020).We analyze the mGPT's performance on various intrinsic and extrinsic tasks and compare it with the contemporaneous generative LMs.multiple tasks and prominent language modeling abilities on the languages of the small peoples in Russia, (iii) adding more demonstrations may result in performance degradation for both mGPT and XGLM, and (iv) hate speech detection is one of the most challenging tasks, receiving random guessing performance in the zero-shot and few-shot evaluation setups.External validation by the NLP community since the release1 shows that mGPT 1.3B can outperform large-scale LMs on SuperGLUE tasks and promote strong solutions for multilingual clause-level morphology tasks.We release the model evaluation code 2 , the mGPT 1.3B 3 and mGPT 13B 4 models.We hope to facilitate research on the applicability of autoregressive LMs in non-English languages and increase the linguistic inclusivity of the low-resource languages.

Related Work
Multilingual Transformers Recent years have featured the development of various monolingual and multilingual LMs initially designed for English.BERT (Devlin et al., 2019) has been replicated in other high-resource languages (Martin et al., 2020;Masala et al., 2020) and language families, e.g., Indian (Kakwani et al., 2020) and Balto-Slavic (Arkhipov et al., 2019).Massively multilingual LMs -mBERT, XLM-R (Conneau et al., 2020), RemBERT (Chung et al., 2021), mBART (Liu et al., 2020b) and mT5 (Xue et al., 2021) -have now pushed state-of-the-art results on various NLP tasks in multiple languages (Kalyan et al., 2021).Such models support more than 100 languages and vary in the architecture design and pretraining objectives.By contrast, our work presents one of the first multilingual autoregressive LMs covering more than 61 languages.
GPT-based Language Models Large-scale generative LMs (e.g., GPT-3;Brown et al., 2020) are triggering a shift from the "pretrain & finetune" paradigm to prompt-based learning (Liu et al., 2023a).The benefit of balancing the pretraining costs and performing standardized NLP tasks with a few demonstration examples has stimulated the development of open-source autoregressive LMs for English (e.g., Black et al., 2022;Biderman et al., 2023;Dey et al., 2023), Chinese (Zeng et al., 2021), and Russian (Zmitrovich et al., 2023).A few contemporaneous works extend the research on zero-shot and few-shot learning, evaluating the in-context abilities of GPT-based LMs in multilingual scenarios.Winata et al. (2021) report that English GPTs perform significantly better than random guessing with monolingual and multilingual prompts on typologically close languages, such as French, Spanish, and German.Lin et al. (2022) propose XGLM, a multilingual GPT-style LM in 30 languages, and empirically show that it can outperform its monolingual counterparts of the comparable number of parameters.We use XGLM as the main baseline in our experiments and analyze the results of comparing mGPT 1.3B with other autoregressive LMs published after our release, such as BLOOM (Scao et al., 2023).

Pretraining Data
Language Selection Table 1 summarizes the list of languages by their family.The pretraining corpus consists of a typologically weighted set of languages covered by cross-lingual benchmarks, such as XGLUE (Liang et al., 2020) and XTREME (Hu et al., 2020).The motivation behind the language choices is to narrow the gap between the high-resource and low-resource languages (Ducel et al., 2022).To this end, we include 20 lan-  guages from the tail of the C4 language list, the list of underrepresented languages of Russia, and the official and resource-lean CIS languages (Orekhov et al., 2016).

Data Preparation Pipeline Pretraining extensive
LMs requires large volumes of high-quality data.Despite the explosive growth of web corpora resulting in the pretraining data volume of up to 6T tokens (Xue et al., 2021), the data quality is often unsatisfactory (Kreutzer et al., 2022).General approaches to maximizing the quality are based on manually curated heuristics (Yang et al., 2019b), the perplexity of LMs (Wenzek et al., 2020), and data quality classifiers (Brown et al., 2020).Our data preparation pipeline includes data collection, deduplication, and filtration.

Data Collection
The pretraining corpus represents a collection of documents from Wikipedia and C4.The Wikipedia texts are extracted from the dumps (v.20201101) with WikiExtractor (Attardi, 2015).The C4 data is downloaded using the Tensorflow datasets 5 (Paper, 2021).
Deduplication The text deduplication includes 64-bit hashing of each text in the pretraining corpus 5 tensorflow.org/datasets/catalog/c4for keeping texts with a unique hash.
Filtration We follow Ortiz Suárez et al. ( 2019) on the C4 data filtration.We also filter the documents based on their text compression rate using zlib6 .The most strongly and weakly compressing deduplicated texts are discarded.The compression range for an acceptable text is empirically defined as ×1.2 -×8.The texts with an entropy of less than 1.2 contain code junk and entities, while those of more than 8 contain repetitive segments.The next step includes distinguishing between low and high-quality documents with a binary classifier.The classifier is trained with Vowpal Wabbit7 on the Wikipedia documents as positive examples and the filtered C4 documents as negative ones.The remainder is cleaned by a set of language-agnostic heuristics.The size of the pretraining corpus is 46B (Wikipedia), and 442B UTF characters (C4), resulting in 600GB.Figure 1 shows the total number of tokens for each language, and the total number of documents in the pretraining corpus is presented in Figure 2.

Tokenization
The design of the tokenization method may have a significant impact on learning efficient representations, model memorization, and downstream performance (Mielke et al., 2021;Nogueira et al., 2021;Pfeiffer et al., 2021;Rust et al., 2021).We investigate the effect of the tokenization strategy on the model perplexity.We pretrain five strategy-specific versions of mGPT 163M on a Wikipedia subset of the pretraining corpus.The tokenization strategy is selected based on their perplexity on a held-out Wikipedia sample (approx.10.7MB), which is inferred as Equation 1.
where t is an input text, |t| is the length of the text in tokens, |c| is the length of the text in characters.The perplexity is normalized over the number of characters since the tokenizers produce different numbers of tokens for t (Cotterell et al., 2018).
Tokenization Strategies We considered five tokenization strategies incorporating specific representations of uppercase characters, numbers, punctuation marks, and whitespaces.Table 2 presents examples of the tokenization strategies.
• DEFAULT: BBPE (Wang et al., 2020); • CASE: Each uppercase character is replaced with a special token <case> followed by the corresponding lowercase character; • ARITHMETIC: The CASE strategy combined with representing numbers and arithmetic operations as individual tokens; • COMBINED: The ARITHMETIC strategy combined with representing punctuation marks and whitespaces as individual tokens; • CHAR: Character-level tokenization.Pretraining Details The models are pretrained on 16 V100 GPUs for 600k training steps with a set of fixed hyperparameters: vocabulary size of 100k, context window of 2048, learning rate of 2e −4 , and batch size of 4.

Results
The experiment results are presented in Table 3.The DEFAULT model achieves the best results, outperforming the rest of the models by up to 2.5 of perplexity score.Based on this experiment, we select the DEFAULT strategy to pretrain the mGPT 1.3B and mGPT 13B models.

Model Architecture
The mGPT architecture is based on GPT-3.We use the architecture description by Brown et al., the GPT-2 code base (Radford et al., 2019) from HuggingFace (Wolf et al., 2020) and Megatron-LM (Shoeybi et al., 2020).Table 4 presents the description of the GPT-2 and GPT-3 architectures of comparable sizes.With all the other hyperparameters equal, GPT-3 has fewer layers (Layers: 48 vs. 24) but a larger hidden size (d model : 1600 vs. 2048) as opposed to GPT-2.GPT-3 also alternates the classic dense and sparse attention layers (Child et al., 2019).

Model Pretraining
The pretraining procedure mostly follows Brown et al..We utilize the DeepSpeed library (Rasley et al., 2020) and Megatron-LM (Shoeybi et al., 2020).We pretrain our LMs with a total batch size of 2048 and a context window of 512 tokens.The total number of the training steps is 600k, and the models have seen 400B tokens during pretraining.The pretraining took 14 days on a cluster of 256 V100 GPUs for mGPT 1.3B and 22 days on 512 V100 GPUs for mGPT 13B .We report the computational, energy, and carbon costs in §7.2.

Language Modeling
Method We estimate the language modeling performance on the held-out sets for each language.
Here, perplexity is computed as described in §3.2, except that perplexity is normalized over the length of the input text t in tokens |t|.We also run statistical tests to analyze the effect of linguistic, dataset, and model configuration criteria: • Language script: we divide the languages into two groups by their script -Latin and others (e.g., Cyrillic and Arabic) -and use the Mann-Whitney U test (Mann and Whitney, 1947) to analyze the perplexity distributions in the groups.
• Pretraining corpus size: we calculate the Pearson correlation coefficient (Pearson, 1895) to analyze the correlation between the language perplexity and the number of documents in this language in the pretraining corpus.
• Model size: we use the Mann-Whitney U test  to analyze the effect of the model size.
Results by Language Figure 3 presents the perplexity scores for each language on the held-out sets.The mGPT 13B model achieves the best perplexities within the 2-to-10 score range for the majority of languages, including Dravidian (Malayalam, Tamil, Telugu), Indo-Aryan (Bengali, Hindi, Marathi), Slavic (Belarusian, Ukrainian, Russian, Bulgarian), Sino-Tibetan (Burmese), Kipchak (Bashkir, Kazakh) and others.Higher perplexities up to 20 are for only seven languages from different families.The mGPT 1.3B results have similar distribution but are consistently higher than mGPT 13B .
Results by Language Family Analyzing results by the language family (see Figure 4), we find that mGPT 13B shows consistently lower perplexities as opposed to mGPT 1.3B .Specifically, mGPT 1.3B underperforms mGPT 13B on Basque, Greek, Kartvelian, and Turkic families.Correlation Analysis We present the results in Table 5.We observe that the language modeling performance depends on the language script and model size.In particular, the non-Latin languages receive lower scores on average, while mGPT 13B performs better than mGPT 1.3B in this setting.However, the positive correlation between the pretraining corpus size and perplexity in particular languages can be attributed to the low diversity of the text domains in the pretraining monolingual corpora for the low-resource languages.Such corpora contain Wikipedia articles on a limited amount of general topics; therefore, the model learns the distribution in the corpora without being able to generalize well.In general, the results align with Scao et al. ( 2023), who report that the considered criteria can affect the knowledge acquired by BLOOM 1B and BLOOM 176B .

Downstream Evaluation
We conduct an extrinsic evaluation of mGPT and baselines on classification and sequence labeling tasks in zero-shot and few-shot settings.In the zeroshot setting, the model is shown a test example formatted as a prompt in natural language, while in the few-shot setting, the model is provided with k demonstrations from the training data specified via prompts.The prompt examples for each task are presented in Table 6.
Method mGPT utilizes per-token cross-entropy loss, which is reduced to negative log probability due to one-hot encoding of the tokens.We select the target label associated with the prompt that results in the lowest sum of negative log probabilities for its tokens.The few-shot experiments are run five times with different random seeds, while the zero-shot experiments are run only once since the model loss is determined.
Baselines The XGLM 1.7B and XGLM 7.5B models are used as the baselines in the classification experiments.We reproduce the XGLM evaluation based on the methodology by Lin et al. ( 2022) and use the model weights and code available in the fairseq 8 library (Ott et al., 2019).We select prompts according to the templates reported by Lin et al.. Prompts for non-English languages are automatically translated with Google Translate.
Results Table 7 presents the classification results averaged across languages.The "✗" tag marks kshot settings not reported by Lin et al..We do not perform them for reproducibility purposes and fair comparison.The results by Lin et al. are reproduced in the zero-shot setup, and some scores are even slightly higher.However, not all results are reproduced, e.g., PAWSX and XNLI.We attribute this to potential differences in the translated prompts.
Overall, we observe that mGPT 1.3B is comparable with XGLM 1.7B while having fewer weights and is pretrained in twice as many languages.mGPT 13B performs better than XGLM 7.5B in zeroshot setting on all tasks except XNLI.At the same time, it lags behind in a few-shot setting being better than XGLM 7.5B only in XNLI and PAWSX tasks.Comparing the performance across languages, we find that English receives the highest accuracy for all tasks.The mGPT 1.3B and mGPT 13B models show high accuracy for the Austronesian, Dravidian, Japonic, Germanic, and Romance language families.Only the Afro-Asiatic family gets low accuracy.The mGPT models perform better than the XGLM counterparts for Austronesian, Koreanic, and Romance languages.
Our results on hate speech detection are consistent with Lin et al..The performance is slightly better across the five languages but still close to random guessing (see Table 8).The manual analysis shows that the behavior is sensitive to the input prompts, most notably for Polish.Increasing the number of demonstrations can lead to performance degradation on some classification tasks for both mGPT and XGLM.

Sequence Labeling
Tasks The sequence labeling tasks include named entity recognition (NER) and part-of-speech tagging (POS) from the XGLUE benchmark (Liang et al., 2020).To address other medium-resource and resource-lean languages, we use the Universal Dependencies treebanks (UD; Nivre et al., 2016) to evaluate POS-tagging in Armenian, Belarusian, Buryat, Kazakh, Tatar, Ukrainian, and Yakut.Method We use a modified approach to the sequence labeling tasks compared to §4.2.1.Given a sentence of n words, we iteratively predict the label for each word x i using the preceding words x <i and their predicted labels l <i as the context using a template "x <i l <i ", where i is the current token index and " " is a placeholder.The only exception is the first token x i used as the context.The placeholder is filled with each possible target label l ∈ L at each step.We select the label with the lowest sum of losses per token in the resulting string.The experiments are run in the zero-shot and 4-shot settings9 .

Example Consider an example for the POStagging task "I [PRON] WANT [VERB] IT [PART] .
[PUNCT]", which requires 4 procedure steps.First, we combine the placeholder in the string "I " with each possible POS tag and select the most probable candidate.Next, we repeat the procedure for "I l i WANT " and so on.
Baselines We use results reported in Liang et al. as the baselines: M-BERT, XLM-R, and Unicoder (Huang et al., 2019).Note that the baselines Model XGLUE CIS & Low-Resource UD ar bg de el en es fr hi it nl pl pt ru th tr ur vi zh Avg.be bxr hy kk sah tt uk Random 6.5 6.5 6.0 5.2 4.4 5.7 5.5 6.7 6.6 6.6 5.9 4.7 6.0 6.4 6.8 1.2 7.0 7.1 5.8 1.3 5.7 5.9 2. Table 10: Accuracy scores (%) for XGLUE and Universal Dependencies POS-tagging by language.mGPT models are evaluated in the 4-shot setting.The best score is put in bold, the second best is underlined.

NER Results
Table 9 shows counterintuitively that mGPT 1.3B outperforms mGPT 13B on all languages.4-shot falls behind finetuned models but significantly outperforms random guessing for both mGPT models.Per-language language analysis shows a large gap between English and other languages (for mGPT 13B the F1-score on English is more than twice higher than for any of the other languages), while for German, both models perform the worst.This pattern coincides with the baseline results.In addition, it could be noted that while for mGPT 1.3B the F1-score exceeds the 10 percent threshold for all languages, this is not the case for mGPT 13B .
POS-tagging Results POS-tagging results for XGLUE benchmark and resource-lean languages are presented in Table 10.Similarly to the NER task, mGPT 1.3B outperforms mGPT 13B practically in all languages except for Italian.On average mGPT 1.3B achieves accuracy score of 0.24 while mGPT 13B only scores 0.21.These results are still far behind fine-tuned models; however, they are 10 We evaluate the sequence labeling tasks using the XGLUE code: github.com/microsoft/XGLUE. significantly higher than random guessing.Analyzing the results for the low-resource languages, it can be seen that mGPT 1.3B performance is comparable with its performance on XGLUE, while the mGPT 13B scores are lower.

Knowledge Probing
Method We probe our models for factual knowledge in 23 languages using the mLAMA dataset (Kassner et al., 2021).The task is to complete a knowledge triplet ¡subject, relation, object¿ converted to templates for querying LMs.Consider an example from the original LAMA (Petroni et al., 2019) for English, where ¡Dante, born-in, X¿ is converted to the template "Dante was born in [MASK]".We follow Lin et al. to design the probing task.As each such query contains hundreds of negative candidates on average, we limit the number of candidates to three, i.e., one is the ground truth candidate and the other two candidates are randomly sampled from the provided knowledge source.The probing performance is evaluated with precision@1 averaged over all relations per language.
Results Figure 5 outlines the results for mGPT 1.3B and mGPT 13B .The overall pattern is that the performance is equal to or above 0.6 for Germanic, Romance, Austro-Asiatic, Japonic, and Chinese languages.However, Uralic, Slavic, Ko-  reanic, and Afro-Asiatic languages receive scores of lower than 0.5.We also find that scaling the number of model parameters usually boosts the performance for high-resource languages up to 5 points, while no significant improvements are observed in the other languages.Comparing our results with Lin et al., we conclude that our models achieve lower performance than XGLM 7.5B almost in all languages and perform on par with GPT3-Curie 6.5B .

External Evaluation
General Language Understanding Scao et al. ( 2023) compared the performance of BLOOM 176B , mGPT 1.3B , OPT 175B (Zhang et al., 2022), GPT-J 6B (Wang and Komatsuzaki, 2021), and T0 11B (Victor et al., 2022) on subset of tasks from the SuperGLUE benchmark (Wang et al., 2019) in the zero-shot and one-shot settings.The results of evaluating the models using five prompts are presented in Figure 6.The mGPT 1.3B model has comparable performance despite having fewer weights.In the zero-shot setting, the performance of mGPT 1.3B , BLOOM 176B , OPT 175B , and GPT-J 6B on the considered tasks is above random guessing.We also observe the strong performance of mGPT 1.3B on the Winogender Schema Diagnostics (Ax-g).In the one-shot setting, mGPT 1.3B performs on par with GPT-J 6B , and the resulting variability is significantly reduced across all prompts.
Multilingual Clause-level Morphology The first shared task on Multilingual Clause-level Morphology (Goldman et al., 2022) covers nine languages and includes three sub-tasks: (i) inflection (generating a word form given a lexeme and a set of morphosyntactic features), (ii) reinflection (reinflect an input sentence according to a given set of morphosyntactic features), and (iii) detect a root and its features in an input sentence.Acikgoz et al. ( 2022) develop a first-place solution based on mGPT 1.3B and prefix-tuning method, outperforming other solutions and baselines on the third task.

Generation Evaluation
Method We compute seven lexical diversity metrics from Gehrmann et al. ( 2021) using the mGPT outputs 11 on 100 test set samples from the story generation task in five languages: English, French, German, Spanish, and Chinese (Chen et al., 2022).The diversity metrics include the Shannon Entropy over unigrams (Entropy 1 ), the mean segmented type-token ratio over segment lengths of 100 (MSTTR), the ratio of distinct unigrams over the total number of unigrams (Distinct 1 ), and the counter of unigrams that appear once in the collection of generated outputs (Unique 1 ).

Results
The results are presented in Table 11.
The diversity metrics scores for Chinese are the highest, while the mean generated text length is the shortest.This is likely due to its logographic writing.The results for the Indo-European languages are similar (French, German, and Spanish), indicating that mGPT 1.3B generates diverse texts in these languages.Surprisingly, the metrics are lower for English, with the average text length being longer.
Our current natural language generation evaluation approach lacks downstream tasks, which we leave for future work.

Discussion
Our key takeaways on pretraining and evaluating large-scale multilingual autoregressive LMs are summarized below.

Model Scaling Empirical Results
The language modeling results for mGPT 1.3B and mGPT 13B suggest that the model scaling improves its generation abilities for all given languages (see §4.1).However, it does not improve performance on the downstream and probing tasks (see §4.2; §4.3).Overall, the language modeling performance depends on the model size and the pretraining corpus size in a language, and smaller models may better encode linguistic information than larger ones.These findings align with Scao et al. ( 2023).
Takeaways Our work had been conducted a year before the Chinchilla scaling laws were introduced (Hoffmann et al., 2022).According to the advanced methods of scaling LMs, our pretraining corpus can be sufficiently extended to improve the generalization abilities of the mGPT 13B model.At the same time, the pretraining corpus design 11 We use the generation hyperparameters: temperature = 1, max length = 100, top k = 5, top p = 0.9.can promote the model underfitting and overfitting on particular languages.We believe it can be accounted for by aggregating the language-specific cross-entropy loss and producing language weights similar to Xie et al. (2023).

Lack of Data
Empirical Results Another challenging factor is the lack of high-quality data for the low-resource languages.Although mGPT shows promising results on the language modeling and sequence labeling tasks for the underrepresented languages (see §4.1, §4.2), the low amount of evaluation resources limits the scope of analyzing the model generalization abilities.The correlation between the model performance and the amount of pretraining data in a language (see §4.1, and, e.g., Lauscher et al., 2020;Ahuja et al., 2022) further highlights the need for creating text corpora in such languages.
Takeaways The question of addressing the discrepancy in data distribution across the world's languages remains unresolved.Our data collection and filtration approach is equivalent for all considered languages.Extending the language-agnostic heuristics is restrained due to the lack of linguistic expertise.However, we assume that experimenting with the training data for the text quality classifiers can improve the resulting quality of the corpora for the low-resource languages (e.g., training the classifiers on different mixtures of data in the medium and high-resource languages).
As the follow-up work, we release 23 versions of the mGPT 1.3B model continuously pretrained with language modeling objective on monolingual corpora for medium-resource and low-resource languages collected through collaboration with the NLP community.Table 12 summarizes the models by language and the language modeling performance on the held-out monolingual test sets.Examples of the corpora include Eastern Armenian National Corpus (Khurshudyan et al., 2022), OpenSubtitles (Lison and Tiedemann, 2016), and TED talks.Continued pretraining on additional data improves the language modeling performance.

Language Selection
Empirical Results Results of mGPT 1.3B on most of the classification tasks are on par or better than the results of the XGLM 1.7B given that mGPT covers twice as many languages (see §4.2).However, mGPT underperforms the baselines on several multi-class classification and probing tasks.
Takeaways We find that balancing the pretraining corpus by the language family helps improve the language modeling abilities for underrepresented languages due to their typological similarity with the medium and high-resource languages (see §4.1).However, increasing language diversity can lead to performance degradation because of the curse of multilinguality and a limited model capacity (Conneau et al., 2020).

Tokenization Empirical results
We conduct an ablation study to analyze the impact of the tokenization strategy on language modeling performance.We find that the considered strategies do not improve the model's perplexity.However, the main drawback of the perplexity-based evaluation is that it only partially assesses the model generalization abilities.
Takeaways The optimal tokenization method and vocabulary size remain an open question, particularly in the multilingual setup (Mielke et al., 2021).There are no established methods for defining the vocabulary size based on the amount of textual data in different languages.Our experiments are limited to a fixed vocabulary size, and we leave further investigation of the tokenization strategies and their configurations for future work.

Empirical results
• Increasing the number of demonstrations does not always lead to improvements but decreases the performance on some downstream tasks (see §4.2.1; §4.2.2).This observation aligns with Lin et al. ( 2022) and Brown et al. (2020).
• The zero-shot and few-shot performance may not exceed the random guessing on particular tasks, which points to the failure of a model to follow the guidance in the demonstration examples (see §4.2.1; §4.2.2).
• The prompting approach is unstable and hardly universal across languages, as indicated by the model sensitivity to the prompts.
• The mGPT models can assign higher probabilities to the most frequent tag in the input for the sequence labeling tasks (see §4.2.2).

Takeaways
• The stability of the models with respect to the prompts may be improved using prompttuning (Liu et al., 2023b) and contextual calibration (Zhao et al., 2021) as shown in §4.4.
• The generalization capabilities of the autoregressive LMs in sequence labeling tasks is an underexplored area.While our LMs achieve results higher than random guessing, the low performance can be attributed to the probability distribution shifts between the pretraining corpora and the prompts.We leave the investigation of the alternative prompt design (Liu et al., 2023a) and structured prediction methods (Liu et al., 2022) for future work.

Conclusion
We introduce the mGPT 1.3B and mGPT 13B models, which cover 61 languages from linguistically diverse 25 language families.Our model is one of the first autoregressive LMs for economically endangered and underrepresented CIS and low-resource languages.The architecture design choices are based on the preliminary tokenization experiments and their perplexity-based evaluation.The model evaluation experiments include language modeling, standardized cross-lingual NLU datasets and benchmarks, world knowledge probing, and social bias tasks.We evaluate the in-context learning abilities in zero and few-shot settings with a negative log-likelihood probability.We present a detailed analysis of the model performance, limitations, and ethical considerations.Despite the space for further quality growth and solving the highlighted limitations, the model shows significant potential and can become the basis for developing generative pipelines for languages other than English, especially the low-resource ones.This initiative has been developed for 23 diverse languages through collaboration with the NLP community.We hope to benefit cross-lingual knowledge transfer, annotation projection, and other potential applications for economically challenged and underrepresented languages and diversify the research field by shifting from the Anglo-centric paradigm.
7 Ethical Statement and Social Impacts To the best of our knowledge, we present one of the first attempts to address this problem for 20 languages of the Commonwealth of Independent States and the small peoples in Russia.

Energy Efficiency and Usage
Pretraining large-scale LMs requires many computational resources, which is energy-intensive and expensive.To address this issue, we used the sparse attention approach suggested by Brown et al. (2020) and reduced the computational resources required to achieve the desired performance.The CO2 emission of pretraining the mGPT models is computed as Equation 2 (Strubell et al., 2019): The power usage effectiveness (P U E) of our data centers is not more than 1.3, the spent power is 30.6kkWh (mGPT 1.3B ) and 91.3 kWh (mGPT 13B ), and the CO2 energy intensity (I CO2 ) in the region is 400 grams per kWh.The resulting CO2 emission is 15.9k kg (mGPT 1.3B ) and 47.5k kg (mGPT 13B ).The emission is comparable with a single mediumrange flight of a modern aircraft, which usually releases about 12k kg of CO2 per 1k km.Despite the costs, mGPT can be efficiently adapted to the user needs via few-shot learning, bringing down potential budget costs in the scope of applications in multiple languages, such as generating the content, augmenting labeled data, or summarizing news.
The multilingual pretraining saves on data annotation and energy consumption, alleviating the carbon footprint.Model compression techniques, e.g., pruning and distillation, can reduce inference costs.

Social Risks of Harm
Stereotypes and unjust discrimination present in pretraining corpora lead to representation biases in LMs.LMs can reflect historical prejudices against disadvantaged social groups and reproduce harmful stereotypes about gender, race, religion, or sexual orientation (Weidinger et al., 2022).We have analyzed the mGPT's limitations on social risks of harm involving hate speech on the hate speech detection task.Our results are similar to Lin et al. (2022) in that the performance is close to random guessing.This may indicate a significant bias in the pretraining corpus, a mutual influence of languages during training, or methodological problems in the test set.We do not claim that our evaluation setup is exhaustive, and we assume that other biases can be revealed through a direct model application or an extended evaluation.

Potential Misuse
The misuse potential of LMs increases with their ability to generate high-quality texts.Malicious users can perform a socially harmful activity that involves generating texts, e.g., spreading propaganda and other targeted manipulation (Jawahar et al., 2020).We recognize that our models can be misused in all supported languages.However, adversarial defense and artificial text detection models can mitigate ethical and social risks of harm.Our primary purpose is to propose multilingual GPT-style LMs for research and development needs, and we hope to work on the misuse problem with other developers and experts in mitigation research in the future.

Figure 1 :
Figure 1: Number of tokens for each language in the pretraining corpus on a logarithmic scale.

Figure 2 :
Figure 2: Number of documents for each language in the pretraining corpus on a logarithmic scale.

Figure 4 :
Figure 4: Family-wise perplexity results.The scores are averaged over the number of languages within each family.

Figure 5 :
Figure 5: Knowledge probing results for 23 languages.The performance of a random baseline is 0.33.

Figure 6 :
Figure 6: The SuperGLUE evaluation results in the zero-shot and one-shot settings (Scao et al., 2023).

7. 1
Low-resource LanguagesNLP for resource-lean scenarios is one of the leading research directions nowadays.The topic's relevance has led to proactive research on low-resource languages.Our work falls under this scope, introducing the first autoregressive LM for 61 languages.

Table 4 :
Comparison of GPT-2 and GPT-3.The mGPT architecture replicates the parameters of GPT-3 1.3B and GPT-3 13B , and uses sparse attention in alternating dense and sparse layers.

Table 6 :
Prompt examples for each downstream task.The examples are in English for illustration purposes.

Table 8 :
Accuracy scores (%) on hate speech detection by language.The best score is put in bold, the second best is underlined.

Table 9 :
F1-scores for NER by language.The mGPT models are evaluated in the 4-shot setting.The best score is put in bold, the second best is underlined.

Table 11 :
).The results for lexical diversity of generated texts on the GEM story generation task.
Table 12: A list of the mGPT 1.3B models continuously pretrained on monolingual corpora for 23 languages.