The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low quality because they are constructed using semi-automatic procedures. In this work, we introduce the Flores-101 evaluation benchmark, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains. These sentences have been translated in 101 languages by professional translators through a carefully controlled process. The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations are fully aligned. By publicly releasing such a high-quality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.


Introduction
Machine translation (MT) is one of the most successful applications in natural language processing, as exemplified by its numerous practical applications and the number of contributions on this topic at major machine learning and natural language processing venues.Despite recent advances in translation quality for a handful of language pairs and domains, MT systems still perform poorly on lowresource languages, i.e. languages without a lot of training data.In fact, many low-resource languages are not even supported by most popular translation engines.Yet, the majority of the world's population speak low-resource languages and would benefit * Indicates equal contribution † Indicates equal contribution from improvements in translation quality on their native languages.As a result, the field has been shifting focus towards low-resource languages.
Over the past decade, the research community has made a lot of recent progress on models for lowresource machine translation.Approaches like iterative backtranslation (Sennrich et al., 2015), multilingual machine translation (Johnson et al., 2016;Tang et al., 2020;Fan et al., 2020), and even unsupervised machine translation (Lample et al., 2018;Artetxe et al., 2018) have shown promising results.Beyond modeling, a major challenge for research in low-resource machine translation is evaluation.Low-resource evaluation is critical to the scientific progress of the field, because evaluation enables proper comparison of approaches and ultimately, a better understanding of what needs further investigation and improvement.Unfortunately, finding high-quality data suitable for the evaluation process is even more difficult in low-resource scenarios.
At present, there are very few benchmarks on low-resource languages.These often have very low coverage of low-resource languages (Riza et al., 2016;Thu et al., 2016;Guzmán et al., 2019;Barrault et al., 2020b;∀ et al., 2020;Ebrahimi et al., 2021;Kuwanto et al., 2021), limiting our understanding of how well methods generalize and scale to a larger number of languages with a diversity of linguistic features.There are some benchmarks that have high coverage, but these are often in specific domains, like COVID-19 (Anastasopoulos et al., 2020) or religious texts (Christodouloupoulos and Steedman, 2015;Malaviya et al., 2017;Tiedemann, 2018;Agić and Vulić, 2019); or have low quality because they are built using automatic approaches (Zhang et al., 2020;Schwenk et al., 2019Schwenk et al., , 2021)).As a result, it is difficult to draw firm conclusions about research efforts on low-resource MT.In particular, there are even fewer benchmarks that are suitable for evaluation of many-to-many mul-tilingual translation, as these require multi-lingual alignment (i.e.having the translation of the same sentence in multiple languages), which hampers the progress of the field despite all the recent excitement on this research direction.As an additional challenge, there are no established practices for how to build such benchmark.Working with professional translators in low-resource languages is difficult because of their scarce availability, and because it is non-trivial to check the quality of their work (Guzmán et al., 2019).
We present the FLORES-101 benchmark, consisting of 3001 sentences sampled from English Wikipedia and professionally translated in 101 languages.With this dataset, we make several contributions.First, we provide the community with a high-quality benchmark that has much larger breadth of topics and coverage of low resource languages than any other existing dataset ( §4).Second, FLORES-101 is suitable for many-to-many evaluation, meaning that it enables seamless evaluation of 10,100 language pairs.This enables the evaluation of popular multilingual MT systems as well as the evaluation of regionally-relevant language pairs like Spanish-Aymara and Vietnamese-Thai, for example.Third, we thoroughly document the annotation process we followed ( §3), helping the community build institutional knowledge about how to construct MT datasets.Fourth, we release not only sentences with their translation but also rich meta-data that will support other kinds of evaluations and tasks, such as document level translation, multimodal translation and text classification.Fifth, we propose a BLEU metric based on sentence piece tokenization (Kudo and Richardson, 2018) ( §5) that enables evaluation of all languages in the set in a unified and extensible framework.Finally, we publicly release both data and baselines used in our experiments ( §6), to foster research in low-resource machine translation and related areas.This paper is organized as follows: In Section 2, we describe related work to construct evaluation benchmarks in machine translation.In Section 3, we detail the construction process of FLORES-101, from sourcing sentences to translate to defining the translation workflow.Section 4 gives a detailed overview of the sentences, languages, and quality of FLORES-101.In Section 5, we describe our proposed SentencePiece BLEU metric which unifies and simplifies evaluation.Section 6 uses FLORES-101 to evaluate various public translation models, and breaks down model performance by amount of training data, domain, sentence length, and language family.We present our conclusions in Section 7.

Related Work
A major challenge in machine translation, particularly as the field shifts its focus to low-resource languages, is the lack of availability of evaluation benchmarks.Much recent work has focused on the creation of training corpora (Auguste Tapo et al., 2021;Ali et al., 2021;Adelani et al., 2021;Gezmu et al., 2021;Nyoni and Bassett, 2021;Chauhan et al., 2021) and development of models (Koneru et al., 2021;Nagoudi et al., 2021;Aulamo et al., 2021), but evaluation is critical to being able to assess and improve translation quality.
Traditionally, the yearly Workshop on Machine Translation (WMT) and its associated shared tasks have provided standardized benchmarks and metrics to the community, fostering progress by providing means of fair comparison among various approaches.Over recent years, the main translation task at WMT has challenged participants with lowresource languages, but the evaluation has been limited to a handful of languages -for example, Latvian in 2017 (Ondrej et al., 2017), Kazakh in 2018(rej Bojar et al., 2018), Gujarati and Lithuanian in 2019 (Barrault et al., 2019), and Inuktitut, Khmer, Pashto, and Tamil in 2020 (Barrault et al., 2020a).Moreover, these tasks have considered translation to and from English only, while the field has been recently focusing on large-scale multilingual models (Johnson et al., 2016;Aharoni et al., 2019;Freitag and Firat, 2020;Fan et al., 2020).
To date, the largest resource of parallel data which can also be used for evaluation purposes is OPUS (Tiedemann, 2012), which is itself a collection of publicly available parallel datasets.While OPUS has by far the largest coverage of languages, particularly to and from English, it consists of a mixture of manually translated and mined data, which results in a large variety of datasets and domains with varying level of quality.For instance, OPUS contains parallel data translated by humans coming from operating system handbooks like Ubuntu, or parallel data from religious documents (Liu et al., 2021) like Jehovah's Witness magazines (Agić and Vulić, 2019) and the Bible.These have recently been expanded to include more languages (Nicolai et al., 2021) as well.OPUS also (1) sourcing sentences to translate from English Wikipedia, (2) designing pilot studies to define efficient and effective translation and evaluation processes, (3) launching the actual translation across all languages.The last stage is iterative, as translations may go through additional rounds of re-translation if the evaluation indicates that quality is insufficient; see Fig. 2 for further details.contains a variety of other automatically-aligned datasets, such as various versions of TED talks, which are usually of lower quality (Ye et al., 2018;Zhang et al., 2020;Fan et al., 2020).Similarly, OPUS contains large parallel datasets generated via automatic filtering and alignment methods, such as WikiMatrix (Schwenk et al., 2021), ccMatrix (Schwenk et al., 2019), ccAligned (El-Kishky et al., 2020), and ParaCrawl (Esplà et al., 2019), which contain noisy translations.While these may be utilized for training, they are clearly unsuitable for evaluation purposes due to automatic alignment.
There are other datasets for evaluation purposes, such as Flores v1.0 (Guzmán et al., 2019), LORELEI (Strassel and Tracey, 2016), ALT (Thu et al., 2016;Riza et al., 2016;Ding et al., 2016) and TICO-19 (Anastasopoulos et al., 2020), as well as datasets for specific languages such as Igbo (Ezeani et al., 2020) and Fon (Dossou and Emezue, 2020).These are similar to FLORES-101 because they focus on low-resource languages.However, the language coverage of these datasets is much smaller.Among these, only TICO-19 is suitable for multilingual machine translation, but its content is centered around COVID-19, unlike the much broader coverage of topics offered by FLORES-101.
Lastly, the current literature in low-resource translation provides very scarce guidance in terms of best practices and methodology to construct parallel datasets and perform quality assurance.The much lower number of translators is problematic because it makes the annotation process much more susceptible to variance in the proficiency level of such annotators.In Flores v1.0 (Guzmán et al., 2019), a mixture of human and automatic checks were used to filter and rework problematic translations.In TICO-19 (Anastasopoulos et al., 2020), a two-step translation and quality assurance process was followed.Despite its technical complexity, the annotation process for benchmark sets is in fact seldom documented in technical reports.This is still largely an uncharted territory.However, there are a lot of practical questions related to setting up and ensuring the quality of large-scale translation campaigns targeting low-resource languages which may have very few annotators.For example: What guidelines should be considered for translators and evaluators?What workflow is most efficient and effective?What automatic checks should be put in place to minimize human intervention?When can a dataset be declared to have reached a sufficient level of quality to be released?In this study, we document our choices and processes in a hope to build and consolidate best practices of dataset construction for the machine translation community.

Dataset Construction
The construction of FLORES-101 is intended to accomplish several goals: (i) to enable the evaluation of many-to-many multilingual models, meaning the evaluation of translations from any language to any other language including very long-tail languages; (ii) to enable other kinds of evaluation beyond machine translation, such as document-level translation, multi-modal translation, multilingual classification, and so on; (iii) most importantly, to build a high-quality evaluation benchmark.
To achieve the above goals, the overall construction process consisted of three phases, as outlined in Fig 1 .First, we extracted sentences to translate from English Wikipedia.Second, we designed and ran pilot experiments to determine the translation process, and finally we launched the actual translation workflow for over 100 languages.In this section, we describe the process in detail.The reader who is more curious about general statistics can safely skip this section and go to Section 4.

Sourcing Sentences
We describe how the domains and sentences in FLORES-101 were selected.A high-level summary of the dataset can be found in Table 1.
Original Source.All source sentences were extracted from multiple Wikimedia sources, as this is a repository of text that is public and freely available under permissive licensing, and covers a broad range of topics.Although Wikipedia is currently supported in more than 260 languages 1 , several low-resource languages have relatively few articles containing well structured sentences.Moreover, translating a few hundred sentences for several thousand different language pairs would be infeasible, at the very least because of the lack of qualified professional translators that could read both the source and target side.
Instead, we opted to source all sentences from English Wikipedia, while considering a broad set of topics that could be of general interest regardless of the native language of the reader.In particular, we collected a third of the sentences from Wikinews 2 , which is a collection of international news articles, a third from Wikijunior 3 , which is a collection of age-appropriate nonfiction books for children from birth to age 12, and a third from WikiVoyage 4 which is a travel guide with a collection of articles about travel tips, food and destinations around the globe.By translating the same set of English sentences in more than hundred languages, we enable evaluation of multilingual MT with the only caveat that source sentences not in English are produced by human translators.While translationese (or overly literal or awkward translations) has known idiosyncrasies (Zhang and Toral, 2019), we conjecture that these effects are rather marginal when evaluating models in low-resource languages, where current MT systems produce many severe mistakes.We believe the benefits of many-to-many evaluation, which supports the measurement of traditionally neglected regionallyrelevant pairs such as Xhosa-Zulu, Vietnamese-Thai, and Spanish-Aymara, largely outsize the risk of evaluating translationese.
1 https://en.wikipedia.org/wiki/Wikipedia:Multilingual_statistics 2 https://en.wikinews.org/wiki/Main_Page 3 https://en.wikibooks.org/wiki/Wikijunior 4 https://en.wikivoyage.org/wiki/Main_Page Sentence Selection.The sentence selection process consisted of selecting an article at random from each source, and then manually selecting a few (typically between 3 and 5) contiguous sentences from each article, avoiding segments with very short or malformed sentences.To avoid bias coming from the document structure, we carefully selected one paragraph per document, from either the beginning, middle or end of the article.We balanced the location selection to be equally distributed across the whole corpus -roughly one third of paragraphs were sampled from the beginning of the article, one third from the middle, and so on.For each sentence, we also extracted the Wikipedia URL, topic, and noted boolean flags to indicate whether the sentence contained entities linked to other Wikipedia pages and images.The selection process was performed by 10 different annotators in our lab, 6 male and 4 female; with different roles in research: researchers (scientists and engineers) and program/project managers; originally coming from different regions of the world: East Asia, South Asia, Southern Europe, Latin America and North America.We manually labeled all sentences by a more detailed sub-topic, one of 10 possibilities: crime, disasters, entertainment, geography, health, nature, politics, science, sports, and travel.Table 1 reports basic statistics of the originating English sentences.
Since several contiguous sentences are extracted from the same article and since we also provide the corresponding URL, we support evaluation of machine translation at the document level.With the additional meta-data, we also enable evaluation of multimodal machine translation.

Pilot Experiments
Obtaining high translation quality in low-resource languages is difficult because the translation job relies on the skill of a small set of translators.If one translator is not perfectly fluent or uses a different local convention for that language, this could render the quality of the dataset insufficient or inconsistent for that language.Here, we describe the process we followed to define an efficient and high quality translation workflow.To this end, we report two pilot experiments we used to determine how we should proceed with the creation of this large-scale evaluation dataset.

Providers and Workflows
To ensure the best possible level of quality for our translations, we designed two pilots aimed to determine the best workflow to follow for translating hundreds of languages.The first pilot experiment was meant to select the best translation providers for each language and the second, to determine the best translation-quality assurance workflow.
Language Service Providers.As a starting point, let us assume that each language can be trans-lated by K different Language Service Providers (LSPs) and that they all charge the same price for translating a sentence.We randomly selected 100 sentences and 8 language pairs, and assigned each language to at least two LSPs.We then used another LSP to evaluate all translations.
Based on human assessment of translation quality, we selected the two LSPs that produced the highest quality translations.We chose two translation LSPs to make our translation process not rely entirely on a single-party, while reducing the communication overhead created by working with too many external parties.
Translation and Quality Assurance Workflow.Despite having reliable translation LSPs, we need to ensure that each translation conforms to the highest level of quality required by a benchmark.Therefore, we split our workflow into two parts: translation (which includes editing), performed by an initial LSP, and quality assurance (QA) performed by an independent LSP.After the QA process, a translation might need re-translation or minimal editing to improve its quality.Here, we explored the best workflow for when re-translation is needed.Assuming there are two translation LSPs, A and B, and a separate QA LSP C, we can have two possible workflows: (i) A-C-B we can have B re-translate translations flagged by C that were produced by A; (ii) A-C-A an alternative and simpler workflow is to have the same LSP take care of both translation and re-translation for a given language, and to have each translation LSP handle half of the languages.
The advantage of workflow (i) is the retranslation process is the least biased, particularly on low-resource languages where the re-translator and the translator could be the same person.On the other hand, the re-translator has less context and the workflow has higher complexity because data comes in and out of LSPs at different times, making the whole process more error prone.
We tested both workflows and observed negligible differences between the two workflows, and therefore, we chose workflow (ii) , i.e.A-C-A, with the same LSP taking care of both translation and re-translation as it is operationally simpler.

Automatic Translation Quality
The second pilot experiment aimed to investigate how to assess translation quality automatically.
Implemented Checks.We implemented several checks to ensure the first round of translations were of acceptable quality: (i) language identification, (ii) checking whether the translation is a copy of the source sentence; (iii) checking whether the translation has significantly different length, (iv) checking translation fluency according to a language model, (v) and checking whether the translation is a copy of the translation produced by publicly available translation engines.Among all checks, we found that (v) was the most significant issue, despite formulating clear guidelines forbidding the use of translation engines.This is important, as we want to our translations to be as unbiased as possible.Relying on verbatim or post-edited translations from online engines would be misguiding, and give them unfair advantage when using a reference-based automatic metric for comparison.
To address the issue, we proposed a heuristic to detect and reject professional translations that are likely copies from translation engines based on their sentence-level similarity.Moreover, when the translations of two different engines are available, we check whether a sentence is very similar to the output of a specific translation engine while being different from the output produced by other translation engines.
The heuristic is as follows: let x be the translation produced by a translation LSP, y A and y B be translations produced by translation engine A and B; and let spBLEU(x, y) be the sentence-level SentencePiece BLEU (cfr.Sec. 5) between sentence x and the reference y.Then, we declare a that a translation was a copy if spBLEU(x, y A ) − spBLEU(x, y B ) > 20 and spBLEU(x, y A ) > 50, when the language in question is supported by both engines A and B; and if spBLEU(x, y A ) > 50 when the translation is only supported by translation engine A. In gray, we show the sentence-level spBLEU of a language that displays indication of copy from a commercial translation engine.A large number of sentences have very high BLEU scores, mainly in the 80 to 100 BLEU bucket.In contrast, the bars in blue indicate a language that does not experience this issue.We discuss the sp-BLEU metric in greater detail in Section 5.  To ensure high quality, our translation workflow includes translation and re-translation steps.We break down the amount of re-translation required, and summarize that to complete one language, it takes on average two months.
The values of 50 and 20 were based on the analysis of the distribution of scores for translations of tens of languages, where we used clustering techniques to determine the right cutoff values.
Moreover, we established that any set of translations with more than 10% of the sentences violating the above criteria condition would need to be retranslated prior to perform any subsequent human evaluation.We show in Figure 3 an example of a language that passed this test and one which did not.Thanks to these automatic checks, we reduced the amount of copying from popular translation engines, streamlined the translation workflow before human evaluators assessed quality, and fully automated the process of translation, evaluation, and re-translation.This is described in the next section.

Translation and Evaluation
We describe the final workflow for collecting data for all languages in FLORES-101.We start with how we decide when a language is ready to be included in FLORES-101: the final translation quality score.Then, we detail the full translation process, including automatic and human quality checks.
Translation Quality Score.How do we know if the translations are good enough to include in FLORES-101, and how do we know when a language has completed translation?Before we summarize the workflow to produce translations, we briefly discuss how we measure translation quality.We assess translation quality through a Translation Quality Score, calculated per language on a 0 to 100 scale.The translation quality score is determined based on the number of identified errors by the evaluation LSPs.The following errors are examined: grammar, punctuation, spelling, capitalization, addition or omission of information, mistranslation, unnatural translation, untranslated text, and register.Each error is also associated with a severity level, between minor, major, and critical.Based on tallying these different error types, the overall final score is determined.We encouraged evaluators to pay particularly high attention to unnatural translation errors.Based on our pilot experiments, we set the acceptable translation quality score to 90%.
Translation Workflow.The overall translation workflow is depicted in Figure 2.For each language, all source sentences are sent to a certain translation LSPs.Once sentences are translated, the data is sent to different translators within the LSP for editing and then moves on to automated quality control steps.An additional verification step is added to this specific workflow with comparison of the translated data to translations from commercial engines as previously mentioned.If any of the checks fail, the LSP has to re-translate until all verification is passed.Afterwards, translations are sent to an evaluation LSP that performs quality assessment, providing a translation quality score and constructive linguistic feedback both on the sentence and language levels.If the score is below the accepted threshold, translations together with the assessment report are sent back to the translation LSP for re-translation.If the initial score is below another certain threshold (that is associated to good translation quality), the re-translated translations are evaluated by humans one more time.We summarize in Table 2 the overall statistics around the translation process.We include the guidelines used for quality evaluation in the Appendix.

FLORES-101 At a Glance
In this section, we analyze FLORES-101.We provide a high level comparison of FLORES-101 with existing benchmarks, then discuss the sentences, languages, and translation quality in detail.

Comparison with Existing Benchmarks
We compare FLORES-101 with several existing benchmarks, summarized in Table 3. FLORES-101 combines large language coverage with topic diversity, support for many-to-many evaluation, and high quality human translations (e.g.produced with no automatic alignment).Further, FLORES-101 adds document-level evaluation and support multimodal translation evaluation.

Sentences in FLORES-101
Table 1 provides an overview of FLORES-101.The total dataset translates 3001 sentences into 101 languages.On average, sentences contain around 20 words.These sentences originate from 1,175 different articles in three domains: WikiNews, WikiJunior, and WikiVoyage.On average, 3 sentences are selected from each document, and then documents are divided into dev, devtest, and test sets.The articles are rich in metadata: 40% of articles contain hyperlinks to other pages, and 66% of articles contain images.We manually classify the content of the sentences into one of 10 broader topics, and display the distribution.Overall, most sentences are about world travel (sourced from WikiVoyage), though there are also a large number of sentences about science, politics, and crime.

Languages in FLORES-101
We summarize all 101 languages in FLORES-101 and their scripts and language families in Table 4.We note that language classification is a complex task with different classification hierarchies.We chose language families at a reasonable level of detail, i.e. fine enough such that languages can be grouped with a few other languages but not so fine that each language is in its own group.Overall, our selected languages cover a large percentage of people all over the world, with a large diversity of scripts and families.Most of these languages are spoken by millions of people, despite being considered low-resource in the research community.(Ebrahimi et al., 2021) 10 ALT (Riza et al., 2016) 13 Europarl (Koehn, 2005) 21 TICO-19 (Anastasopoulos et al., 2020) 36 OPUS-100 (Zhang et al., 2020) 100 M2M (Fan et al., 2020) 100

FLORES-101 101
Table 3: Comparison of Various Evaluation Benchmarks.We compare FLORES-101 to a variety of popular, existing translation benchmarks, indicating language coverage, topic diversity, whether many-to-many translation is supported, if the translations are created by humans, and if the tasks of document-level translation or multimodal translation are supported.
Figure 4: Translation Quality Score across Languages.We require the final translation quality score to be above 90% before the translation is of sufficient quality to include in FLORES-101.We depict the score distribution for all languages in FLORES-101.
In Table 4, we depict all FLORES-101 languages by their resource level.The amount of data available for a language is difficult to accurately denote, in part because quality is very important and thus, the amount of data does not necessary reflect its usefulness.Further, some data may be proprietary, and new datasets for new languages are actively being created by the research community.Thus, we report the amount of data to/from English and the amount of monolingual data available in OPUS, a public repository for multilingual data.OPUS is a heavily used resource, and itself is a collection of a large number of research datasets produced by the community over decades.The majority of languages have both bilingual data through English and monolingual data, though a number of languages have less than 100K sentences through English.Many of those also have no monolingual data available, making these truly low-resource.Examples include Shona and Nyanja.

Translation Quality
The translation quality score across all languages is depicted in Figure 4.All 101 languages in FLORES-101 meet our initial threshold of 90% quality based on human evaluation.Note that several languages were considered beyond our set of 101, but were unable to meet the bar after rounds of re-translation.Overall, about 50% of languages have fairly high quality (above 95%), with few near the 90% threshold boundary.Even low-resource languages like Lao and Zulu can score well on the quality metric.
We breakdown the main translation errors observed based on the quality assessments and retranslations.The largest error category across all languages was mistranslation, a broad error category that generally notes that the source text was not translated faithfully and the translation has rendered an incorrect meaning in the target language.Examples of mistranslation include (but not limited to) incorrect interpretation of the source text, literal translations and mistranslations of phrasal verbs and lack of disambiguation of ambiguous terms.For example, writing ...recommends hand washing with hand sanitizer rubs instead of ...recommends hand washing over hand sanitizer rubs would represent a mistranslation.Error categories with few errors include register, grammar, and punctuation.
We also examined if certain domains are more difficult to translate than others.Within a language, we did identify variation in the percentage of errors contributed by domain (often one domain could  We include the ISO 639-3 code, the language family, and script.Next to each language family, we include more fine-grained subgrouping information.We also include the amount of resources available in OPUS at the time this report was written.The parallel datasets were used to train the baseline described in §5, the monolingual datasets were only used to calculate SentencePiece, see Section §5. contribute up to 10% more errors than the others), but across languages, there was no clear trend.
Overall, it appears that all domains are challenging to translate for human translators.
5 Metric: SentencePiece BLEU How do we evaluate the performance of translation models at the scale of 101 languages?In this section, we propose the SentencePiece BLEU metric and analyze its performance across languages compared to various alternatives.

Motivation
Automatic evaluation of translation quality is an active field.Each year, the WMT Metrics shared task seeks to determine the automatic metric that better correlates with human evaluations (Mathur et al., 2020).While many metrics have been proposed through the years, the analysis has only included a handful of low-resource languages.Further, despite the progress in automatic metrics, the common practice is to use BLEU (Papineni et al., 2002) when reporting results.Unfortunately, using BLEU directly is suboptimal, as it relies on n-gram overlap which is heavily dependent on the particular tokenization used, i.e. tokenizing more aggressively can artificially raise the score and make it difficult to compare across reported results.
The challenge of making BLEU comparable by using equivalent tokenization schemes has been challenging for the translation community and has been partially addressed by sacrebleu (Post, 2018).Previous standards usually leverage the mosestokenizer5 , which is the default tool in sacrebleu.However, for many languages, these existing tools and tokenizers are not sufficient.
For example, mosestokenizer supports a limited number of languages (often activated with the -tok flag).While its default tokenization rules might operate reasonably for European languages, they do not extend to global support.For example, white-space tokenization is insufficient for some languages like Burmese or Khmer, which do not segment words with white space.Other languages like Arabic are morphologically rich, which has incentivized the creation of BLEU variants (Bouamor et al., 2014).To further complicate matters, some languages like Hindi and Japanese already have custom tokenizers that are used when computing BLEU, although these appear scattered in various publication footnotes, while for others, no such special tokenizers have been developed yet.Further, developing tokenizers for each language of interest is a challenging effort (Dossou and Emezue, 2021;Li et al., 2021) that is difficult to scale.
Ideally, we would like an automatic evaluation process that is robust, simple and that can be applied to any language without the need to specify any particular tokenizer, as this will make it easier for researchers to compare against each other.We would like our automatic evaluation to also support future languages -as translation quality continues to improve, the community will naturally produce models for more and more languages.

SentencePiece BLEU
Towards this goal, we have trained a SentencePiece (SPM) tokenizer (Kudo and Richardson, 2018) with 256,000 tokens using monolingual data (Conneau et al., 2020;Wenzek et al., 2019) from all the FLORES-101 languages.SPM is a system that learns subword units based on training data, and does not require tokenization.The logic is not dependent on language, as the system treats all sentences as sequences of Unicode.Given the large amount of multilingual data and the large number of languages, this essentially provides a universal tokenizer, that can operate on any language.
Training SPM.One challenge is that the amount of monolingual data available for different languages is not the same -an effect that is extreme Table 5: Spearman Correlation of spBLEU, BLEU, and char-BLEU.We evaluated on three sets of languages (En-XX).Models evaluated are derived from our baselines (discussed in Section 6).In the top section, we evaluate languages that often use the standard mosestokenizer.In the bottom section, we evaluate languages that have their own custom tokenization.
when considering low-resource languages.Languages with small quantities of data may not have the same level of coverage in subword units, or an insufficient quantity of sentences to represent a diverse enough set of content.To address this, we train our SPM model with temperature upsampling similar to (Conneau et al., 2020), so that lowresource languages are represented.In the future if a new language is added to FLORES-101 and this tokenizer does not support its script, we can easily add new tokens to encode it as desired.
Computing spBLEU.Given this SPMtokenizer, we compute BLEU by tokenizing the system output and the reference, and then calculate BLEU in the space of sentence-pieces.We dub this metric as sentence-piece BLEU, and we denote it as spBLEU.It is integrated into sacrebleu for ease of use6 as the spm tokenizer.

Experiments and Analysis
We want to validate the spBLEU metric, to examine Spanish, Italian, and French.As shown in Table 5 (top), we find that spBLEU correlates very well with BLEU on these languages.
spBLEU is better than char-BLEU when custom tokenization is needed.Next, we examine the performance of spBLEU on languages where custom tokenizers are often used, or special rules are written for tokenization.We look at three languages: Chinese, Hindi, and Tamil.Chinese is supported by mosetokenizer with special rules.
Hindi and Tamil have a popularly used tokenizer in the community from IndicNLP7 .While these language-specific tokenizers are excellent, the challenge of scale and comparability exists: there are often different competing tokenizers and tokenizers need to be developed for each language we want to evaluate.A possible alternative is instead to eliminate tokenization, and evaluate characters directly and compute character-level BLEU (char-BLEU).
We examine if spBLEU is a better alternative to char-BLEU for languages that currently use special tokenizers.As shown in Table 5 (bottom), spBLEU correlates more strongly with custom tokenizer BLEU compared to the correlation between char-BLEU and custom tokenizer BLEU.While the development of custom tokenizers for specific languages produces much more accurate tokenization, spBLEU is a good alternative for comparability and scalability across a large number of languages.
spBLEU has similar performance as BLEU for model selection.Next, we turn to verifying that spBLEU can be used compare the quality of models for model selection purposes.This is important, as oftentimes automatic metrics are used in an outer loop of the training process, to select various hyper-parameters, such as model size, dropout rate, learning rate and so on.Thus, we replicate the selection of various models using spBLEU instead of BLEU, experimenting on three language directions: English to Pashto, English to Russian, and English to Chinese.We choose these languages because they were part of WMT2020 human evaluations, and thus we know the ground-truth ranking.We evaluate five models for Pashto, eight for Russian, and seven for Chinese.We focus on evaluation of models in directions out of English, as mosestokenizer works well on English.
For each of the three language directions, we compute the spBLEU and BLEU with languagespecific tokenizers between different systems and the reference translation.Overall results are shown in Table 6.We first compare if the ranking of systems produced by spBLEU matches that of systems ranked by human evaluation and BLEU.We calculate exact match ranking accuracy using the Kendall τ coefficient, which ranges between -1 and 1.The ranking of systems by spBLEU matches human evaluation and BLEU perfectly for Pashto, and has strong correlation with both human evaluation and BLEU for Chinese and Russian.
However, the exact ranking may not be the most important part of a metric.Oftentimes, we want to use automatic metrics to understand which model improvement is the most effective -for example, which model to submit to WMT? Thus, it is important to check whether the best scoring model according to spBLEU matches the best scoring model according to BLEU.We find that spBLEU and BLEU indeed select the same best model on all three languages.Note that BLEU has no guarantee of selecting the same model as human evaluators8 .
Takeaway.Overall, we conclude that sp-BLEU functions fairly similarly to BLEU, especially on languages that usually default to the mosestokenizer.On languages that use custom tokenization, spBLEU correlates more strongly Table 7: Many-to-Many Performance by available Bitext data through English.We show spBLEU on devtest of FLORES-101 for M2M-124 615M parameter model.We group languages into 4 bins.spBLEU is worse for low-resource languages compared to high resource languages, and translating into low-resource languages is harder than translating out of a low-resource language.
Figure 5: spBLEU for directions in and out of English.We compare performance against amount of available bitext data.For the same amount of data, translation into English is often stronger.
with BLEU than other alternatives, such as char-BLEU.Further, spBLEU often produces a very similar ranking of models and selects the same best model as BLEU and human evaluation.
For the vast majority of languages without custom tokenizers, spBLEU provides the ability to quantify performance in the same way, with one model.We believe that having a single model to perform tokenization will help the research community to make progress on low-resource research, while opening the door for improved versions of spBLEU that treat low-resource languages more fairly.In the subsequent rest of the work, we use spBLEU to evaluate model performance.

Evaluating Baselines on FLORES-101
In this section, we present evaluation of various models on FLORES-101.We describe the dev, devtest, and test splits and how we intend them to be used.We then analyze the performance of a many-to-many model based on Fan et al. (2020) and break down performance by resource level, sentence length, and language family.Finally, we compare various model variants.

Data Splits
FLORES-101 is divided into three splits: dev, devtest, and test.Unless otherwise stated, we report results on the devtest portion of FLORES-101.The dev set is meant to be used for hyper-parameter tuning.The devtest is meant to be used for testing purpose during the development phase.The test set will not be released, but will be available via a publicly available evaluation server, while the dev and devtest are publicly downloadable9 .Through the evaluation server, the test set can be used by various evaluation campaigns, such as the WMT 2021 Large-Scale Multilingual Task10 .The primary motivation for keeping the test set available only through an evaluation server is to guarantee equivalent assessment of models and reduce overfitting to the test set.Further, as the dataset is many-to-many, if the source sentences are released, the target sentences would also be released.

Baselines
We evaluate on different models to provide baselines for researchers who may be interested in the performance of certain directions, and to understand which languages and directions need substantial research improvements.We analyze if sentence length has an effect on performance across directions in English, Chinese, Spanish, Hindi, Arabic, and when averaging across all language directions.We find that length does not have a strong effect on performance.
language pairs not going through English.The original M2M-100 model does not have full coverage of the languages in FLORES-101.We extended their mined data with data from OPUS for the FLORES-101 languages not present in mined data, extending to 124 total languages.Note that for the additional languages added, OPUS does not contain a large quantity of data, and the OPUS data is rather noisy; see Table 4 for further details.On this parallel data, we trained two different sizes of models, namely a model with 615M and one with 175M parameters.Unless otherwise stated, we will report results using the 615M parameter model; this is our default throughout the rest of this paper.

Generation
We generate from all models with beam size 5, setting the max generation length to 200.Given the large number of directions covered in FLORES-101, we do not tune the beam size, length penalty, or minimum/maximum generation length.

Results
In this section, we report the results of the evaluation of the baseline approaches described above on the FLORES-101 devtest using spBLEU.

Findings From Evaluation on All
Directions All Directions.We evaluated our M2M-124 model with 615M parameters on all language pairs and report spBLEU scores in Figure 8.In Figure 8, the languages are organized alphabetically by language code, while in Figure 9, the rows and columns have been organized via spectral clustering.The spBLEU metric scores are used as affinity scores between each pair of languages.This produces clusters ordered by easiness to translate.
English-Centric Translation.Across the board, performance of translation into English is strong, Table 9: Many-to-Many Performance by Domain.We show spBLEU on three partitions of the FLORES-101 devtest according to the originating domains.We compute the corpus spBLEU for each language in each domain, and then average across languages in that direction.We compute the performance into and out of English, Chinese, Spanish, Hindi, and Arabic, as well as average across all many-to-many directions.Overall, the News domain has slightly improved performance, but the domain effect is not strong.
with only a few languages with spBLEU below 10.Performance out of English is much worse.We display this graphically in Figure 5, where we show that performance into English (circle markers) is has higher spBLEU than performance out of English (triangle markers).Further, performance is overall heavily correlated with amount of training data, which we discuss in greater detail later.
Many-to-Many Translation.Across non-English-Centric directions, performance requires improvement -translation in and out of most African languages, for example, struggles to reach 5 spBLEU.In contrast, translation into many European languages, even low-resource languages such as Occitan, have much better performance (over 10 spBLEU for many directions).This result highlights the importance of both the amount of data and transfer learning from related languages.For instance, translation to and from Occitan can naturally borrow from related high-resource languages like French, Italian and Spanish.However, the same cannot be said about most African languages, for which related languages are also low resource and difficult to translate.
Performance by Resource Level.A challenge of analyzing performance of various language families is that performance is often closely tied to the amount (and quality) of available training data.Certain language families have much less data.For example, almost every single African language is considered a low-resource translation direction.
Thus, we next evaluate performance based on resource level.We classify languages into four bins based on resource level of bitext data with English: high-resource languages, with more than 100M sentences of training data, mid-resource with between 1M and 100M sentences, low-resource with between 100K and 1M sentences, and finally very low-resource with less than 100K sentences.
Our results are summarized in Table 7.As hypothesized, performance increases with greater quantity of training data, in a clear pattern.sp-BLEU increases moving from left to right, as well as from top to bottom.Translation between mid and high resource languages produces spBLEU scores around 20, whereas translating between very low and low-resource languages yields a mere sp-BLEU score of less than 5.Even translation between high resource and low-resource languages is still quite low, indicating that lack of training data strongly limits performance of current MT systems.
Performance by Sentence Length.In the previous paragraphs we have found that translation quality is affected by the amount of training data and the properties of the language.Next, we examine if translation quality is also affected a property of the sentences themselves.In particular, we calculate if the sentence length affects model performance, based on the hypothesis that longer sentences may be more complex and difficult to translate (Sutskever et al., 2014).The results in Table 10: Many-to-Many Performance on Family Groups.We display the spBLEU on the devtest of FLORES-101 for the M2M-124 615M parameter model.We group languages into 11 families.Each cells represent the average performance for translating from all the languages in the source group (row) into the each language of the target group (column).We highlight in gray the cells that correspond to within group evaluation.In bold we show the best performance per target group and underline the best performance per source group.
length is determined by the number of tokens in the original English sentence.The bucket with short sentences collect all sentences with up to 15 tokens (in English), the medium bucket has sentences with a number of tokens in the range 16 to 25, and the last bucket has sentences with more than 25 tokens.
The table shows that spBLEU is rather constant with respect to the sentence length, and in fact it slightly increases with the length of the sentence, contrary to our initial conjecture.
Performance by Domain.We analyze if model performance is affected by domain, to check whether certain domains are more difficult to translate than others.FLORES-101 contains three domains: WikiNews, WikiJunior, and WikiVoyage.We report results of translating in and out of five languages, namely English, Chinese, Spanish, Hindi, and Arabic, as well as the average across all of the 10,000 possible directions.
The results in Table 9 demonstrate that the factor that affects the most translation quality is the language we translate in and out of (with Arabic being the most challenging and English having the highest scores) rather than the domain.WikiNews is the easiest domain with slightly higher spBLEU, and WikiVoyage is the hardest domain, with an average spBLEU score lower by one point compared to WikiNews.We hypothesize that news-related content is often written in a certain fairly consistent, journalistic style, which could ease the challenge of translation, while WikiVoyage might be a little harder because it has more named entities of local regions of the world which might be harder to translate correctly.However, overall, there are no large differences in performance between domains.
Performance by Language Family.We also group languages into eleven families based on general language families13 and report in Table 4 the average spBLEU for translating from and into each family.Our results indicate that Bantu, Dravidian, Indo-Aryan, and Nilotic are the language families where M2M-124 struggles the most, attaining an average spBLEU below 5 points.In fact, even translation within the language family (see values in the diagonal) works very poorly.For these languages, translating to/from Germanic and Romance languages works better.In general, Germanic, Romance, and Balto-Slavic are the language families that yield the largest spBLEU scores (above 10 spBLEU points in average).For these latter languages translation within the language family works the best.In this case, M2M-124 obtains  and M2M-124 on several one-to-many and many-to-one translation tasks using five languages: English, Chinese, Spanish, Hindi, and Arabic.In each case, spBLEU is averaged over all languages in the set.Since the open-source OPUS-100 model covers only 80 languages of FLORES-101, we restrict the evaluation to only these languages in order to make a fair comparison.
Figure 7: Full results of M2M-124 Models on several one-to-many and many-to-one translation tasks using five languages: English, Chinese, Spanish, Hindi, and Arabic.In each case, spBLEU is averaged over all languages in the set (all the remaining 100 languages of FLORES-101).an spBLEU score above 20.Overall, translation between all languages in a many-to-many fashion requires improvement, as evidenced by the overall quite low average scores.

Comparison of Various Systems.
We end by comparing various baseline systems, to understand the performance of some existing models on FLORES-101.
Comparison to OPUS-100.We evaluate OPUS-100 (Zhang et al., 2020) with 254M parameters and the two versions of M2M-124 (Fan et al., 2020) with 175 and 615M parameters.We calculate sp-BLEU in and out of five languages: English, Chinese, Spanish, Hindi, and Arabic.
Results are shown in Figure 6.Note that OPUS-100 only covers 80 languages in FLORES-101, so this figure is on the subset of 80 languages covered by all models, for comparability.Overall, we see a consistent trend across models and directions: the larger M2M-124 has the best performance, followed by the smaller M2M-124 and OPUS-100.For all systems we evaluated, translation to and from English works the best, while translation to and from Chinese and Arabic struggles the most.In general, spBLEU scores are relatively low, suggesting ample room for improvement and need for further research in this area.
We next display results of M2M-124 175M parameters and 615M parameters on the full set of FLORES-101 languages.This is shown in Figure 7. Comparing results with Figure 6, it is evident that the average performance in these language groupings has decreased, indicating that the additional languages in FLORES-101 are likely very difficult.We see the same consistent trend that the larger M2M-124 model has stronger performance.

Comparison with Selected Masakhane Models.
The comparison with OPUS-100 compares M2M-124 with another multilingual model.However, various researchers in the low-resource translation community have developed models for specific languages.Many of these models are created by people who speak these languages.Further, focusing on specific directions of interest rather than multilingual models could produce specialized models with potentially higher quality.Masakhane is a participatory research effort that focuses on African NLP.We end by comparing our M2M-124 model with several publicly available models from the Masakhane-MT repository.We evaluate models from English to the following languages: Yoruba 14 , Zulu 15 , Swahili 16 , Shona 17 , Nyanja 18 and Luo 19 .The Masakhane models are trained on the JW300 dataset.
Results are shown in Table 11.We observe that for two languages -Zulu and Luo -Masakhane's open sourced models have stronger performance on FLORES-101 than the M2M-124 model.The remaining languages we assess have either similar or worse performance than M2M-124.Overall, all languages besides Swahili require significant improvement.Note that in many African countries, a large number of local and regional languages are spoken.We hope that FLORES-101 can be used to develop non-English-centric models that directly translate between African languages.
FLORES-101 supports many-to-many evaluation, meaning any of 10,100 language directions can be evaluated.With rich metadata, it also supports multimodal translation via images, and document-level translation.Unlike many other multilingual datasets, FLORES-101 is fully translated by humans using a detailed process with numerous quality control checks, including human evaluation during dataset creation.
Beyond translation, FLORES-101 can be used to evaluate tasks such as sentence and document classification, language identification, and multilingual domain adaptation.We hope that the release of this dataset and our baseline M2M models will be useful for the community.
We hope to continue to expand the number of languages covered in FLORES-101 and make the test set available to various community efforts to improve translation systems in shared tasks such as those from the Workshop on Machine Translation.

Acknowledgments
We'd like to thank Michael Auli and Sergey Edunov for enlightening discussions and advice.We'd like to thank Mona Diab and Denise Diaz for consulting on specific languages and providing invaluable guidance on translation quality.We'd like to thank Xian Li and Yuqing Tang for helping select the original sentences for translation as part of FLORES.Finally, we'd like to thank Brian Bui for helping with the organization of the data collection effort.We thank all of the translators and human evaluators, as well as the translation and quality assurance agencies we worked with, for helping create FLORES-101. Figure 8: spBLEU of the M2M MMT model on all the language pairs of FLORES-101 dev-test set.Cell (i,j) reports spBLEU for translating from language i to language j.Therefore, each column shows spBLEU for translating in the same target language but using various source languages.Vice versa, each row shows spBLEU for translating into various target languages when starting from the same source language.

A Translation Quality Guidelines
Severities: • Critical Errors are issues that render the content unfit for use.An error is only critical if it seriously distorts the meaning of the source, in such a way that it becomes completely incomprehensible or that the essence of the message is lost.
• Major Errors may confuse or mislead the user or hinder proper use of the product/service due to significant change in meaning or appear in a visible or important part of the content.
• Minor Errors don't lead to loss of meaning and wouldn't confuse or mislead the user but would be noticed, would decrease stylistic quality, fluency or clarity, or would make the content less appealing.
Error Categories: 1. Grammar -Noncompliance with target language's grammar rules.Grammar errors may be at the word or sentence level.Types of grammar errors may include: • Incorrect person, number or case: the person, number or in the translation does not match the person, number or case of the source text.• Incorrect tense: the tense used in the translation does not correspond to the tense used in the source.• Incorrect subject/verb agreement: The subject and verb of a sentence must agree with one another in number whether they are singular or plural.If the subject of the sentence is singular, its verb must also be singular; and if the subject is plural, the verb must also be plural.• Incorrect use of singular or plural: if a noun in the source text is plural, the corresponding noun and its qualifiers must be plural in the translation.• Incorrect word order: word order of the translation is non-standard in the target language, or the translator has made a preferential change to the word order of the source.
2. Punctuation -Punctuation is missing, nonstandard in the target language or inconsistent with the source punctuation.
3. Spelling -Incorrect spelling in the target language.Types of spelling errors may include: • Use of the wrong homophone for the context e.g.'bare with me' • Typos • Incorrect use of accents 4. Capitalization -Noncompliance with target language rules e.g.not capitalising the start of a sentence or a proper noun.
5. Addition/Omission -An essential element from the source text is missing in the translation or unnecessary/superfluous elements are present in the translation but were not originally present in the source text.
6. Mistranslation -Source text has not been translated faithfully.Types of mistranslation errors may include: • Incorrect interpretation of the source text • Literal translations and mistranslations of phrasal verbs, rendering incorrect meaning in the target • Lack of disambiguation of ambiguous terms • Using a subpar word 7. Unnatural Translation -Translation does not sound natural to a native speaker of the target language.Source text is translated word for word, rendering the translation unnatural, or is grammatically correct but unnatural to a native speaker.
8. Untranslated Text -Words are left untranslated from the source text.This is when there are words in the source language present in the translation which should have been translated into the target language.
9. Register -The style or register of the translation is inconsistent with the source and context.

B Additional Results
Details of Spectral Clusters The list of clusters formed by spectral clustering on spBLEU scores is shown in Table 12.
Comparison of Many-to-Many with English-Centric Pivoting.We compare the need of evaluation in a truly many-to-many sense.Instead of creating multilingual models that can translate directly Afrikaans, Amharic, Arabic, Welsh, Danish, English, Farsi, Armenian, Hebrew, Georgian, Kurdish, Maltese Table 12: Language clusters after applying spectral clustering on the full spBLEU matrix: Interestingly, the spectral clustering identifies several clusters that are reminiscent of world regions, where these languages are often spoken together.between any pair of languages, pivoting through English is also possible.Pivoting works by first translating from language X into English, then from English to language Y, instead of translating from X to Y. FLORES-101 supports the evaluation and comparison of these strategies.Unlike previous work such as Fan et al. (2020), which was unable to evaluate all directions of their many-to-many model, FLORES-101 enables evaluation of all 101 x 101 pairs.In Figure 10, we compare direct translation with English-Centric Pivoting for 10 Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Nepali, Oriya, Punjabi, Sinhala, and Urdu.The spBLEU difference between direct translation and English pivoting is displayed in the heatmap.Overall, we see gains through 80% of the directions by translating directly in a many-to-many fashion.Some directions have gains of more than 3 spBLEU, while a majority of the quality decrease from pivoting is less than 1 spBLEU.tSNE of Model Embeddings.We examine the similarity of various languages by visualizing the tSNE of language embedding of the trained M2M-124 615M model.Unlike spectral clustering, this examination is a reflection of the model embeddings, rather than the spBLEU score.Figure 11 shows that the languages belonging to the same language family are often grouped together, clustered next to each other.

Figure 1 :
Figure 1: Dataset Construction.The workflow used to construct FLORES-101 has three phases: (1) sourcing sentences to translate from English Wikipedia, (2) designing pilot studies to define efficient and effective translation and evaluation processes, (3) launching the actual translation across all languages.The last stage is iterative, as translations may go through additional rounds of re-translation if the evaluation indicates that quality is insufficient; see Fig. 2 for further details.

Figure 2 :
Figure 2: Depiction of Overall Translation Workflow.For each target language, sentences are translated by a translation Language Service Provider (LSP).The resulting translations are automatically checked.If these checks fail, translations are sent back for re-translation.If the automatic checks pass, translations are sent to another LSP for human evaluation.If the quality is not sufficient, translations are sent back to the original translation LSP for re-translation.Depending on the human score, the process can repeat for multiple rounds of human evaluation.

Figure 3 :
Figure 3: Importance of Automatic Checks.In gray, we show the sentence-level spBLEU of a language that displays indication of copy from a commercial translation engine.A large number of sentences have very high BLEU scores, mainly in the 80 to 100 BLEU bucket.In contrast, the bars in blue indicate a language that does not experience this issue.We discuss the sp-BLEU metric in greater detail in Section 5.

Figure 6 :
Figure 6: Comparison between OPUS-100and M2M-124 on several one-to-many and many-to-one translation tasks using five languages: English, Chinese, Spanish, Hindi, and Arabic.In each case, spBLEU is averaged over all languages in the set.Since the open-source OPUS-100 model covers only 80 languages of FLORES-101, we restrict the evaluation to only these languages in order to make a fair comparison.

Figure 9 :
Figure9: This is the same data as in Fig.8, except that rows and columns have been organized according to spectral clustering with 8 clusters (identified by the color, see first row and first column).

Figure 10 :
Figure 10: Performance between Many-to-Many direction translation and English-Centric Pivoting.We compare the difference in spBLEU (positive indicates direct translation has stronger performance) for 10 Indic languages.The results are computed using the M2M-124 615M model.

Figure 11 :
Figure 11: tSNE plot of Language Embeddings.We embed the data of various languages with our model and examine by language subgrouping.Oftentimes, languages in the same subgrouping cluster together.

Table 1 :
Statistics of FLORES-101.FLORES-101 contains 3001 sentences selected from 842 articles, divided into three splits: dev, devtest, and test.The articles are sourced from three domains, breaking down into 10 sub-topic classifications.

Table 6 :
spBLEU Compared to Human Evaluation and BLEU ranking.We analyze translation into Pashto, Russian, and Chinese.We indicate the Kendall τ between spBLEU and Human Eval, spBLEU and BLEU, and Human Eval and BLEU, as well as if the different metrics result in the selection of the same best model.

Table 8 :
Many-to-Many Performance by Sentence Length.We show spBLEU on devtest of FLORES-101 for M2M-124 615M parameter model.