Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

Curating such datasets relies on the websites giving clues about the language of their contents (e.g. a language identifier in the URL) and on automatic language classification (LangID).It is commonly known that these automatically crawled and filtered datasets tend to have overall lower quality than hand-curated collec-tions (Koehn et al., 2020), but their quality is rarely measured directly, and is rather judged through the improvements they bring to downstream applications (Schwenk et al., 2021).
Building NLP technologies with automatically crawled datasets is promising.This is especially true for low-resource languages, because data scarcity is one of the major bottlenecks for deep learning approaches.However, there is a problem: There exists very little research on evaluating both data collections and automatic crawling and filtering tools for low-resource languages.As a result, although many low-resource languages are covered by the latest multilingual crawl data releases, their quality and thus usability is unknown.
To shed light on the quality of data crawls for the lowest resource languages, we perform a manual data audit for 230 per-language subsets of five major crawled multilingual datasets:1 CCAligned (El- Kishky et al., 2020), ParaCrawl (Esplà et al., 2019;Bañón et al., 2020), WikiMatrix (Schwenk et al., 2021), OSCAR (Ortiz Suárez et al., 2019;Ortiz Suárez et al., 2020) and mC4 (Xue et al., 2021).We propose solutions for effective, low-effort data auditing (Section 4), including an error taxonomy.Our quantitative analysis reveals surprisingly low amounts of valid in-language data, and identifies systematic issues across datasets and languages.In addition, we find that a large number of datasets is labeled with nontransparent or incorrect language codes (Section 5).This leads us to reflect on the potential harm of low-quality data releases for lowresource languages (Section 6), and provide a set of recommendations for future multilingual data releases (Section 7).
As the scale of ML research grows, it becomes increasingly difficult to validate automatically collected and curated datasets (Biderman and Scheirer, 2020;Birhane and Prabhu, 2021;Bender et al., 2021).Several works have focused on advancing methodologies and best practices to address these challenges.Bender and Friedman (2018) introduced data statements, a documentary framework for NLP datasets that seeks to provide a universal minimum bar for dataset description.Similar work has been done on systematizing documentation in other areas in data science and machine learning, including work focusing on online news (Kevin et al., 2018), data ethics (Sun et al., 2019), and data exploration (Holland et al., 2018), as well as generalist work such as Gebru et al. (2018).Data quality is also implicitly documented by successes of filtering methods.There is a large literature on filtering data for various NLP tasks, e.g.Axelrod et al. (2011); Moore and Lewis (2010); Rarrick et al. (2011); Wang et al. (2018); Kamholz et al. (2014); Junczys-Dowmunt (2018); Caswell et al. (2020).
Closest to our work is the analysis of a highly multilingual (non-publicly available) web-crawl and LangID related quality issues by Caswell et al. (2020).They perform a brief analysis of the quality of OSCAR with the focus only on the presence of in-language content.Dodge et al. (2021) automatically documented and analyzed the contents and sources of C4 (Raffel et al., 2020), the English counterpart of mC4, which surfaced the presence of machine-translated contents and NLP benchmark data.

Multilingual Corpora
Table 1 provides an overview of the corpora of interest in this work.We selected the corpora for their multilinguality and the inclusion of understudied languages in NLP.With the exception of WikiMatrix and ParaCrawl, all corpora are derived from CommonCrawl (CC). 2CAligned (El-Kishky et al., 2020) is a parallel dataset built off 68 CC snapshots.Documents are aligned if they are in the same language according to FastText LangID (Joulin et al., 2016(Joulin et al., , 2017)), and have the same URL but for a differing language code.These alignments are refined  1: Comparison of parallel and monolingual corpora extracted from web documents, including their downstream evaluation tasks.All parallel corpora are evaluated for machine translation (BLEU).TED-6: da, cr, sl, sk, lt, et; TED-45: 45-language subset of (Qi et al., 2018); WMT-5: cs, de, fi, lv, ro.POS/DEP-5: part-of-speech labeling and dependency parsing for bg, ca, da, fi, id.
with cross-lingual LASER embeddings (Artetxe and Schwenk, 2019).For sentence-level data, they split on newlines and align with LASER, but perform no further filtering.Human annotators evaluated the quality of document alignments for six languages (de, zh, ar, ro, et, my) selected for their different scripts and amount of retrieved documents, reporting precision of over 90%.The quality of the extracted parallel sentences was evaluated in a machine translation (MT) task on six European (da, cr, sl, sk, lt, et) languages of the TED corpus (Qi et al., 2018), where it compared favorably to systems built on crawled sentences from WikiMatrix and ParaCrawl v6.
Multilingual C4 (mC4) (Xue et al., 2021) is a document-level dataset used for training the mT5 language model.It consists of monolingual text in 101 languages and is generated from 71 CC snapshots.It filters out pages that contain less than three lines of at least 200 characters and pages that contain bad words. 3Since this is a documentlevel dataset, we split it by sentence and deduplicate it before rating.For language identification, it uses CLD3 (Botha et al., 2017),4 a small feed-forward neural network that was trained to detect 107 languages.The mT5 model pre-trained on mC4 is evaluated on 6 tasks of the XTREME benchmark (Hu et al., 2020) covering a variety of languages and outperforms other multilingual pretrained language models such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020).
OSCAR (Ortiz Suárez et al., 2019;Ortiz Suárez et al., 2020) is a set of monolingual corpora ex-tracted from CC snapshots, specifically from the plain text WET format distributed by CC which removes all the HTML tags and converts the text to UTF-8.It is deduplicated and follows the approach by (Grave et al., 2018) of using FastText LangID (Joulin et al., 2016(Joulin et al., , 2017) ) on a line-level. 5o other filtering was applied.For five languages (bg, ca, da, fi, id) OSCAR was used by its original authors to train language models which were then evaluated on parsing and POS tagging (Ortiz Suárez et al., 2020).OSCAR has also been used in independent studies to train monolingual or multilingual language models (ar, as, bn, de, el, fr, gu, he, hi, kn, ml, mr, nl, or, pa, ro, ta, te) and subsequently evaluate them on various downstream tasks (Antoun et al., 2021;Kakwani et al., 2020;Wilie et al., 2020;Chan et al., 2020;Koutsikakis et al., 2020;Martin et al., 2020;Chriqui and Yahav, 2021;Seker et al., 2021;Delobelle et al., 2020;Dumitrescu et al., 2020;Masala et al., 2020).
ParaCrawl v7.1 is a parallel dataset with 41 language pairs primarily aligned with English (39 out of 41) and mined using the parallel-datacrawling tool Bitextor (Esplà et al., 2019;Bañón et al., 2020) which includes downloading documents, preprocessing and normalization, aligning documents and segments, and filtering noisy data via Bicleaner. 6ParaCrawl focuses on European languages, but also includes 9 lower-resource, non-European language pairs in v7.1.Sentence alignment and sentence pair filtering choices were optimized for five languages (mt, et, hu, cs, de) by training and evaluating MT models on the resulting parallel sentences.An earlier version (v5) was shown to improve translation quality on WMT benchmarks for cs, de, fi, lv, ro.
WikiMatrix (Schwenk et al., 2021) is a public dataset containing 135M parallel sentences in 1620 language pairs (85 languages) mined from Wikipedia.Out of the 135M parallel sentences, 34M are aligned with English.The text is extracted from Wikipedia pages, split into sentences, and duplicate sentences are removed.FastText LangID is used before identifying bitext with LASER's distance-based mining approach.The margin threshold is optimized by training and evaluating downstream MT models on four WMT benchmarks (de-en, de-fr, cs-de, cs-fr).
The final dataset is used to train translation models that are then evaluated by automatically measuring the quality of their translations against human translations of TED talks in 45 languages, with highest quality for translations between English and e.g.pt, es, da, and lowest for sr, ja, mr, zh_TW.In the audit we focus on language pairs with English on one side.

Auditing Data Quality
None of the above datasets has been evaluated for quality on the sentence level (exception: several languages in ParaCrawl v3), and downstream evaluations are centered around a small fraction of higher-resource languages.This is insufficient for drawing conclusions about the quality of individual or aligned sentences, and about the entirety of languages.In addition, there might be a publication bias preventing negative results with any of the above corpora with lower quality being published.
To close this gap, we conduct a human data quality audit focused on the lowest-resource and most under-evaluated languages, but also covering mid-and high-resource languages for comparison.

Auditing Process
Participants We recruited 51 volunteers from the NLP community, covering about 70 languages with proficient language skills.7Each sentence is annotated by one rater.To verify our hypothesis that those annotations can largely done by nonnative speakers, we repeat a set of language expert annotations by a non-expert, and measure the accuracy of the non-expert.
Sample selection For each language in each dataset, we took a random sample of 100 lines, which may be anywhere from single words to short paragraphs depending on segmentation.We manually annotated them according to the error taxonomy described below.For WikiMatrix and CCAligned, we selected those languages that are paired with English, and for ParaCrawl, we also included those paired with Spanish ("total" counts in Table 3).We did not annotate all languages, but focused on the ones with the least number of sentences in each dataset (at least the smallest 10) and languages for which we found proficient speakers.Since we annotate the same maximum number of sentences8 across all chosen languages regardless of their total number of sentences, the annotated samples are not an unbiased sample from the whole dataset.
Non-expert labeling strategies Although many of the volunteers were familiar with the languages in question or spoke related languages, in cases where no speaker of a relevant language could be found, volunteers used dictionaries and internet search to form educated guesses.We discuss this deeper in Appendix C to highlight how much of this low-resource focused evaluation can actually be done by non-proficient speakers with relatively low effort.In general, we aim to find an upper bound on quality, so we encouraged annotators to be forgiving of translation mistakes when the overall meaning of the sentence or large parts thereof are conveyed, or when most of the sentence is in the correct language.
Effort The individual effort was dependent on the quality and complexity of the data, and on the annotator's knowledge of the language(s), e.g., it took from less than two minutes for an English native speaker to pass through 100 well-formed English sentences (or similarly to annotate languages with 0% in-language sentences), to two hours of "detective work" for well-formed content in languages for an annotator without familiarity.Table 2: Annotation codes for parallel data with sentence pair examples.The language code before each sentence indicates the language it is supposed to be in.
Taxonomy In order to quantify errors, we developed a simple error taxonomy.Sentences and sentence pairs were annotated according to a simple rubric with error classes of Incorrect Translation (X, excluded for monolingual data), Wrong Language (WL), and Non-Linguistic Content (NL).
Of correct sentences (C), we further mark single words or phrases (CS) and boilerplate contents (CB).In addition, we asked annotators to flag offensive or pornographic content.Table 2 provides examples for parallel data, and Appendix B contains detailed annotation instructions.

Interpretation of Results
For each language, we compute the percentage of each label within the 100 audited sentences.Then, we either aggregate the labels across languages with equal weights (macro-average), or weight them according to their presence in the overall dataset (microaverage).Results are shown in Table 3.The statistics for the correct codes (CC, CB, CS) are combined as C. The number of languages, the numbers of sentences per language and the choice of languages differ across datasets, both in the orig-inal release and in the selection for our audit, so the comparison of numbers across datasets has to be taken with a grain of salt.Since the numbers are based on a small sample of sentences that were partially annotated by non-experts, the error statistics are only rough estimates.Our audit captures a decent ratio of languages (25-55%, 2nd row in Table 3), but only a tiny fraction of the overall number of sentences (0.00004-0.002%).When we speak of "low-" and "high"-resource languages, we mean languages with smaller or larger representation in the datasets at hand.When reporting language-specific results we use the original language identifiers of the datasets.

Which datasets have quality issues?
The macro-averaged results show that the ratio of correct samples (C) ranges from 24% to 87%, with a large variance across the five audited datasets.Particularly severe problems were found in CCAligned and WikiMatrix, with 44 of the 65 languages that we audited for CCAligned containing under 50% correct sentences, and 19 of the 20 in WikiMatrix.In total, 15 of the 205 language specific samples (7.3%) contained not a Table 3: Averages of sentence-level annotations across datasets and selected languages.Macro-avg: Each language is weighted equally in the aggregation, regardless of its size.Micro-avg: Each label is weighted by the fraction of sentences for that language in the overall annotated corpus, i.e., the annotations for higher-represented languages are upweighted, and annotations for lower-represented languages are downweighted.The bottom rows contain the number of languages that have 0% labeled C etc. Note that these are not true expectations since the languages audited were not randomly sampled.
single correct sentence.For the parallel datasets we are also interested in the quantity of misaligned/mistranslated sentences (X).For WikiMatrix, two-thirds of the audited samples were on average misaligned.We noticed that sentences were often similar in structure, but described different facts (see Table 6).This might originate from the nature of the underlying Wikipedia articles, since they are often comparable rather than parallel (Schwenk et al., 2021).
Figure 1 illustrates per-corpus correctness more completely, showing for each dataset what percent of audited corpora are under each possible threshold of correctness.
Why haven't these problems been reported before?The findings above are averaged on a perlanguage basis (i.e.macro-average), and therefore give low and high-resource languages equal weight.If we instead estimate the quality on a persentence basis, i.e. down-weight lower-resource languages in the computation of the average, the numbers paint a more optimistic picture ("micro" block in Table 3).This is especially relevant for the monolingual datasets because they contain audits for English, which makes up for 43% of all sentences in OSCAR and 36% in mC4.To illustrate the effect of this imbalance: A random sample from the entire mC4 dataset with over 63% chance will be from one of the 8 largest languages (en, ru, es, de, fr, it, pt, pl, >100M sentences each), of which all have near perfect quality.Analogously, evaluation and tuning of web mining pipelines and resulting corpora in downstream applications focused largely on higher-resource languages (Section 3), so the low quality of underrepresented languages might go unnoticed if there is no dedicated evaluation, or no proficient speakers are involved in the curation (∀ et al., 2020).
How much content is nonlinguistic or in the wrong language?Nonlinguistic content is a more common problem than wrong-language content.Among the parallel datasets, CCAligned contains the highest percentage of nonlinguistic content, at 31.42% on average across all rated corpora, and also the highest percent of wronglanguage content, at 9.44%.Among the monolingual datasets, mC4 contains the highest ratio both of sentences in incorrect languages (15.98% average) and nonlinguistic content (11.40% average), with 4 of the 48 audited languages having more than 50% contents in other languages.The low amount of wrong language in ParaCrawl shows the benefits of selecting domains by the amount in-language text, but the dataset also covers the smallest amount of languages.The low ratio of wrong language samples in OSCAR may reflect the success of line-level LangID filtering.These numbers provide evidence that more research in LangID could improve the overall quality, especially with respect to nonlinguistic content.
Which languages got confused?The languages that were confused were frequently related higherresource languages.However, there were also a significant number of "out-of-model cousin" cases, where languages not supported by the LangID model ended up in a similar-seeming language.For instance in mC4, much of the Shona (sn, Bantu language spoken in Zimbabwe and Mozambique) corpus is actually Kinyarwanda (rw, Bantu language spoken in mostly in Rwanda and Uganda)-and, peculiarly, much of the Hawaiian (haw, Polynesian language spoken in Hawaii) is actually Twi (tw/ak, Central Tano language spoken mostly in Ghana).Do low-resource languages have lower quality?Low-resource datasets tend to have lower human-judged quality.The Spearman rank correlation between quality (%C) and size is positive in all cases.The trend is strongest for mC4 (r = 0.66), and gradually declines for CCAligned (r = 0.53), WikiMatrix (r = 0.49), ParaCrawl (r = 0.43), and OSCAR (r = 0.37).Figure 2 compares the number of sentences for each language against the proportion of correct sentences: Not all higher-resource languages (> 10 6 sentences) have high quality, in particular for CCAligned (e.g.Javanese (en-jv_ID) with 5%C, or Tagalog (en-tl_XX) with 13%C).For mid-resource languages (10 4 -10 6 sentences) the picture is inconclusive, with some languages having high quality, and others having extremely low quality, even within the same datasets, e.g.Urdu in CCAligned en-ur_PK has 100%C vs. its romanized counterpart en-ur_PK_rom 0.5% C. For individual error codes trends are less clear (not depicted).Which languages have the lowest quality?Across datasets we observe that the quality is particularly poor for languages that are included in romanized script (_rom/_latn), but are more commonly written in other scripts, e.g., Urdu (ur), Japanese (ja), Arabic (ar).These are not transliterations of other scripts, but mostly contain nonlinguistic material or wrong languages (e.g. the romanized Japanese corpus in mC4 (ja_latn) contains Spanish, French, English, Portuguese, amongst others).In terms of geography, the poorest quality is found for African languages (Bambara (bm), Fula (ff), Kikongo (kg), Luganda (lg), Lingala (ln), Norther Sotho (nso), Oromo (om), Shona (sn), Somali (so), Tswana (tn), Wolof (wo)), minority languages in Europe and the Middle East that are closely related to higherresource languages (Azerbaijani (az-IR), North Frisian (frr), Neapolitan (nap), Silesian (szl), Zaza (zza)), lesser spoken Chinese languages sharing a script with Mandarin (Yue (yue), Wu (wuu)), four major Austronesian (Central Bikol (bcl), Chavacano (cbk), Javanese (jv), Sundanese (su)), and some South-Asian languages, in particular Sinhala (si).Appendix D contains the detailed per-language statistics for all corpora.
What is the incidence of offensive and pornographic content?Overall, the sampled sentences did not contain a large amount of offensive contents.However, there were notable amounts of pornographic content (> 10%) found in CCAligned for 11 languages.
Annotation quality For a subset of audited languages from CCAligned and OSCAR we measure the accuracy (Acc) of the labels assigned by nonproficient speakers against the labels assigned by proficient speakers for all audited sentences.This can be understood as a directed measure of annotator agreement for the special case where one rater is an expert and the other is not.Results for varying label granularity are reported in Tables 4 and  5.For n = 6 all classes of the taxonomy were distinguished, for n = 4 the C subclasses were combined, and for n = 2 it is binary decision between C and the rest of the error classes.With the full 6-class taxonomy (Acc-6) we find a mean accuracy of 0.66 for CCAligned audits, and 0.98 for OSCAR audits.With a binary taxonomy (Acc-2) distinguishing C from the rest, the accuracy further increases to 0.79 for CCAligned.This provides strong evidence that good quality annotations are not limited to those proficient in a language.
However, the significant drop of accuracy for finer-grained labels hints at that our taxonomy can be further improved, especially for parallel sentences.The error taxonomy lacks at least one category of error, namely "correct/in-language but unnatural".Similarly, the definition of "correctshort" and "correct-boilerplate" were not understood equally by all annotators and the concept of "correct-short" has potential issues for agglutinative languages like Turkish.Finally, it was unclear what to do with related dialects, e.g. when a sentence is "almost correct but wrong dialect" or when it is unclear which dialect a sentence belongs to.We recommend including these categories for future audits.

Automatic Filtering
Given the frequency of WL and NL annotations, it might be tempting to use open-source LangID models to post-filter data on a per-sentence(-pair) level, as OSCAR does.Unfortunately, this turns out to have its own issues.
Sentence-level n-gram LangID filtering We classify all sentence pairs of CCAligned with CLD3, an n-gram based LangID model.By comparing its predictions to the audit labels, we evaluate its quality on the subset of annotated samples: the classifier should detect both correct languages when the pair is annotated as C and X, and should detect incorrect languages in the pair when WL and NL.On this task, the CLD3 classifier achieves an average precision of only 40.6%.
Sentence-level Transformer LangID filtering N-gram LangID models like CLD3 have known problems.
However, Caswell et al. ( 2020) demonstrate that semi-supervised Transformerbased LangID models strongly out-perform them.We train a comparable Transformer-based LangID model and apply it to our annotated CCAligned data.We find that filtering noisy corpora (< 50% correct) on LangID for both source and target leads to gains in median precision, rising from 13.8% pre-filter to 43.9% post-filter.However, this comes at a steep cost of 77.5% loss in recall.The biggest winners were Lingala, whose precision climbs from 8% to 80%, and Oromo, which soars from 2% to 33% in-language.Both of these, however, come at the cost of losing 50% of the correct in-language sentences, being reduced from 22k sentences to 3k and 1k sentences respectively, which would likely be too small for building downstream models.The moral is that, at least at the current stage, there is no one-size-fits-all approach for sentence-level LangID filtering.

Dataset Mis-labeling
Standardized and unambiguous representations of language codes are important for practical data use and exchange.The standard used by most academic and industry applications is BCP-47 (Phillips and Davis, 2005), which builds off the two-letter ISO639-2 codes and three-letter ISO639-3 codes, but also allows to add subtags for scripts (e.g.Hindi in Latin script: hi-Latn) or regional varieties (e.g.French spoken in Canada: fr-CA).It would enhance transparency and interoperability if adopted consistently, especially with growing language diversity in NLP.
We find a variety of errors and inconsistencies in language code usage, ranging from serious mislabelings to small transgressions against standard conventions.For this analysis, we also include the JW300 (Agić and Vulić, 2019) dataset, a multilin-gual dataset crawled from jw.org.In summary, we find 8 nonstandard codes in CCAligned, 3 in OSCAR, 1 in mC4, 1 in WikiMatrix, and 70 in JW300, for 83 in total.This does not include the 59 codes affected by superset issues.Full details are given in Appendix A.
Inconsistent Language Codes One common issue is simply using nonstandard or invented codes.For example, CCAligned uses only two-letter codes, so when the BCP-47 code for a language is three letters it is either shortened (e.g.zza → zz) or invented (shn → qa).Similarly, OSCAR contains data labeled as als (BCP-47 for Tosk Albanian) that is actually in gsw (Allemannic).922 additional language codes in JW300 have similar issues, including 12 codes that start with jw_ but are not Javanese.
False Sign Languages 12% (48/417) of JW300 carry language codes for sign languages.Instead of sign language transcripts they are texts in another high resource language, mostly English or Spanish-for example, the en-zsl (Zambian sign language) data is actually English-English parallel data (copies), details in Appendix A. This was likely caused by videos with sign language interpretation embedded on the crawled websites. 10ysterious supersets When datasets contain language codes that are supersets of other language codes, it is difficult to determine which particular language the text contains.WikiMatrix has Serbian (sr), Croatian (hr), Bosnian (bs), and Serbo-Croatian (sh)-their superset. 11The issue of codes that are supersets of others is common enough to include a small table dedicated to it (Appendix Table 7).In some cases this may not be an issue, as with Arabic, where ar conventionally refers to Modern Standard Arabic, even though the code technically encompasses all dialects.In many cases, the nature of the data in the superset code remains a mystery.Deprecated codes Finally, there are several deprecated codes that are used: sh in WikiMatrix, iw in mC4, sh and eml in Oscar, and daf in JW300.
6 Risks of Low-Quality Data Low quality in downstream applications Text corpora today are building blocks for many downstream NLP applications like question answering and text summarization-for instance, a common approach is to first train translation models on such data and then automatically translate training data for downstream models (Conneau et al., 2018).If the data used for the original systems is flawed, derived technology may fail for those languages far down the line without knowing the causes.This risk of undesired downstream effects calls for future studies with a careful treatment of intertwined effects such as data size and domain, languagespecific phenomena, evaluation data and metric biases.To give the reader a brief glimpse of the impact of data quality for the example of translation, we compare the C% metric from our audit with the translation quality (sentencepiece-BLEU, spBLEU) of the multilingual translation model M2M124 for 124 languages (Goyal et al., 2021).It was trained on WikiMatrix and CCAligned, and similar data collected with the same tools, which we expect to show similar biases.Translation quality is evaluated on the trusted, humantranslated FloReS benchmark (Goyal et al., 2021).For the 21 languages present in both the audit and the FloReS benchmark, we found a positive correlation (Spearman) between the data quality scores and spBLEU of ρ = 0.44 (p = 0.041).This is not as large as the correlation with data size (ρ = 0.66, p = 0.00078), but it nonetheless helps to explain translation quality-the correlation between the product of C% and data size (in other words, the expected total number of good sentences in the dataset), is the highest yet, with a value of ρ = 0.73 (p = 0.00013).12Representation washing Since there are datasets which contain many low-resource languages, the community may feel a sense of progress and growing equity, despite the actual quality of the resources for these languages.Similarly, if low-quality datasets are used as benchmarks they may exaggerate model performance, making low-resource NLP appear more solved than it is-or conversely, if models perform poorly when trained with such data, it may be  Table 6: Examples of "parallel" data where the translation has a different meaning than the source, but the form looks the same.(We added translations of the non-English side.)Such data may encourage hallucinations of fake "facts".wrongly assumed that the task of learning models for these languages is harder than it actually is or infeasible given current resources.These effects could result in productive effort being redirected away from these tasks and languages.
Trust in incorrect "facts" We found many instances of parallel-looking sentences that are structurally and semantically similar, but not factually correct translations (Table 6).They can cause models to produce plausible "translations" that are factually wrong, but users may still trust them (algorithmic trust) without verifying the information.Similarly, automation bias (Skitka et al., 1999), referring to humans favoring decisions made by automated systems over decisions made by humans, might amplify the issues of inaccurate translations caused by misalignments.

Future Work and Recommendations
Of the five multilingual corpora evaluated, we consistently found severe issues with quality, especially in the lower-resource languages.We rated samples of 205 languages, and found that 87 of them had under 50% usable data, with a full 15 languages at 0% in-language.We furthermore found consistent issues with mislabeled data and nonstandard language codes, particularly in the JW300 dataset, and identified 83 affected corpora, at least 48 of which were entirely spurious (Section 5).While there might have been anecdotal evidence of insufficient quality for some of the datasets, the majority of these quality issues had not been reported, nor been investigated in depth.
These issues might go unnoticed for languages that are not represented in the evaluation of the crawling methods, and cause harm in downstream applications (Khayrallah and Koehn, 2018).
There are a variety of ways to improve both the ease and accuracy of human evaluation, as well a few classes of issues we ignored in this paper, like close dialects.Ideally we would like to build a standard suite of automatic metrics for datasets, but more research is necessary to determine what the appropriate metrics would be.One important area missing from our analyses however is the estimated portion of a dataset which has been generated by MT (Rarrick et al., 2011), LM systems, or bots/templates, as for example in the analysis of C4 (Dodge et al., 2021).The information captured in machine-generated content might still be useful for modeling, but might falsely overrepresent typical generation patterns and introduce linguistic errors or unnatural artifacts.
We therefore strongly recommend looking at samples of any dataset before using it or releasing it to the public.As we have shown, one does not need to be proficient in a language to see when there are serious quality issues, and a quick scan of 100 sentences can be sufficient to detect major problems.Moreover, going through and annotating a small sample of data can bring actionable insights about new ways to filter or use it.
If data quality issues are found, a wide variety of techniques can be explored, like filtering on length-ratio, LangID, TF-IDF wordlists (Caswell et al., 2020) or dictionaries (Kamholz et al., 2014); to neural approaches like LM scoring (Axelrod et al., 2011;Moore and Lewis, 2010;Wang et al., 2018).Unfortunately, none of these provides a quick and easy fix, especially for low-resource languages-data cleaning is no trivial task!Noisy datasets are by no means useless, at least if they contain some desirable content.Therefore an alternative to filtering can be documentation (Bender et al., 2021).This can take the form of a per-language quality score and notes about known issues, a datasheet (Gebru et al., 2018) or nutrition label (Holland et al., 2018).However, we suggest researchers not release corpora with nearzero in-language content, as this may give the mistaken impression of usable resources.
Finally, we encourage the community to continue conducting evaluations and audits of public datasets-similar to system comparison papers.

A Details on Language Code Issues
Table 7 provides a complete lists of the corpora where one code is defined as a superset of the other by the ISO standard, and in Table 8 we provide a complete list of the language codes in JW300 which purport to be sign language but are actually unrelated high-resource languages.Special attention needs to be given to the JW300 dataset, which, in addition to the sign languages and superset code issues, has a variety of other peculiarities.These problems seem to originate in the codes used by jw.org,13 which were apparently not checked in the creation of the JW300 dataset.An overview is provided in Table 9, and the following paragraphs give specifics.
Twelve languages in JW300 have codes starting in jw_, suggesting they are varieties of Javanese (ISO639-1 jw), but are instead attempts to represent language dialects for which there are no BCP-47 codes.These codes seem to have been updated Table 8: There are 48 languages in the JW300 corpus with language codes that correspond to sign languages, but in reality are unrelated highresource languages (usually the most spoken language in the country of origin of the sign language).This table shows the actual language of the data corresponding to each sign language code. in jw.org to appropriate BCP-47 private-use extensions in the form <supercode>_x_<tag>, which are provided in Table 9. Twelve languages have codes starting in jw_, suggesting they are varieties of Javanese, but are instead mis-parsed private-use extensions.Three codes appear in addition to equivalent ISO codes, making it unclear which languages they are.One language uses a deprecated ISO code.Four languages use the ISO639-3 code instead of the ISO639-2 code, and therefore are not BCP-47.
In addition to the jw_ tags, there are two other mis-used private subtags: hy_arevmda, which in addition to lacking the mandatory _x_ appears to represent standard Western Armenian (hyw); and rmy_AR, which, rather than being Romany from Argentina, is Kalderash Romany.
There are also a few anomalies where private use extensions should have been used but other Table 9: Language code issues in the JW300 datasets for 22 language varieties not covered by Tables 7 and 8. Private use extensions are given as they appear in jw.org, and specified as '?' if they are absent from jw.org.methods were found to convey the distinctions.Three codes appear in addition to equivalent ISO codes, making it unclear which languages they are.Two of these are equivalencies between ISO639-2 and ISO639-3 (nya and ny are both Chichewa, qu and que are both Quechua), and one is a script equivalency (kmr and kmr_latn are both in Latin script).In these three cases the two codes do represent different languages-so a private use extension would have been appropriate.
Finally, there is the more minor issue that three languages use the ISO639-3 code instead of the ISO639-2 code, and therefore are not BCP-47.
In addition to the JW300-specific errors, Table 10 summarizes miscellaneous errors in CCAligned and OSCAR that were detailed in Section 5.

B Complete Error Taxonomy and Instructions
In addition to the examples given in Table 2, raters were provided with the following verbal notes on the error codes: • CC: Correct translation, natural sentence: It's OK if it's a sentence fragment instead of a whole sentence, as long as it is not too short (about 5 words or greater).The translation does not have to be perfect.
• CS: Correct Translation, but single word or short phrase: Also includes highly repeated short phrases, like "the cat the cat the cat the cat the cat ..." • CB: Correct translation, but boilerplate: This can be auto-generated or formulaic content, or content that one deems "technically correct but generally not very useful to NLP models".Unfortunately, it's often not clear what should be counted as boilerplate...do your best.
• X: Incorrect translation [for parallel sentences] both source and target are in the correct language, but they are not adequate translations.
• WL: Wrong language For short sentences, especially with proper nouns, there is often a fine line between "Wrong language" and "Not language".Do your best.
• NL: Not language At least one of source and target are not linguistic content.Any sentence consisting only of a proper noun (e.g."Tyrone Ping") should be marked as NL.
• U: Unknown for sentences that need verification by a native speaker.This is an auxiliary label that is resolved in most cases.

C Methodological Notes
A surprising amount of work can be done without being an expert in the languages involved.The easiest approach is simply to search the internet for the sentence, which usually results in finding the exact page the sentence came from, which in turn frequently contains clues like language codes in the URL, or a headline like News in X language, sometimes with references to a translated version of the same page.However, for the cases where this is insufficient, here are a few tips, tricks, and observations.
No Skills Required: Things that do not require knowledge of the language(s) in question.
1. "Not language" can usually be identified by anyone who can read the script, though there are tricky cases with proper nouns.
2. Frequently, "parallel" sentences contain different numbers in the source and target (especially autogenerated content), and are easy to disqualify.
3. Errors tend to repeat.If a word is mistranslated once, it will often be mistranslated many more times throughout a corpus, making it easy to spot.
Basic Research Required: Things that do not require knowledge of the language(s) in question but can be done with basic research.
1.If it's written in the wrong script it's considered wrong language.(Sometimes the writing system is indicated in the published corpus, e.g.bg-Latn, but usually the language has a "default" script defined by ISO.) 2. Some types of texts come with inherent labels or markers, such as enumerators or verse numbers.
3. When all else fails, search the internet for the whole sentence or n-grams thereof!If the whole sentence can be found, frequently the language is betrayed by the web page (the language's autonym is useful in this case).
For each annotation label, we report the ratio of the annotated sentences (of max 100 sentences) that were assigned that label by the primary annotator.Repeated annotations done for agreement measurement are not included.The C column aggregates all correct sub-codes (CC, CS, CB).We also report the total number of sentences that each dataset contains for each language and the average sentence length for the audited sentences to illustrate differences across languages.The original language codes as they are published with the datasets are maintained for the sake of consistency (but should be handled with care in future work, see Section 5), and those with less than 20% correct sentences are highlighted.

Figure 1 :
Figure 1: Fraction of languages in each dataset below a given quality threshold (percent correct).

Figure 2 :
Figure 2: Percentage of sentences labeled as correct vs. log N sentences for all audited languages.

en:
The highway near Inglstod was completed in 1938.

Table 4 :
Rater evaluation for a subset of audits from CCAligned (translated from English) measured by the accuracy (Acc-n) of annotations by non-proficient speaker against annotations by proficient speakers.

Table 5 :
Rater evaluation for a subset of audits from OSCAR measured by the accuracy (Acc-n) of annotations by non-proficient speaker against annotations by proficient speakers.
enThe prime minister of the UK is Boris Johnson.nl De minister-president van Nederland is Mark Rutte.The prime minister of the Netherlands is Mark Rutte.The local time in Miami is 86 minutes.en In 1932 the highway was extended north to LA. bar 1938 is de Autobahn bei Inglstod fertig gstellt.

Table 7 :
Situations where two language codes are represented, but one is a superset of another by the ISO standard, leading to unclarity about the data in the supercode dataset.
*The als dataset is actually in gsw.

Table 10 :
Miscellaneous errors in language codes.

Table 11 :
Audit results for a sample of 100 sentences from CCAligned for each language pair, compared to the number of sentences available in the dataset.If fewer than 100 sentences were available, all sentences were audited.Language codes are as originally published.The length is measured in number of characters and averaged across the audited portion of each corpus.Languages with less than 20% correct sentences are boldfaced.

Table 12 :
Audit results for a sample of 100 sentences from WikiMatrix for each language pair, compared to the number of sentences available in the dataset.Language codes are as originally published.The length is measured in number of characters and averaged across the audited portion of each corpus.Languages with less than 20% correct sentences are boldfaced.

Table 13 :
Audit results for a sample of 100 sentences from ParaCrawl for each language pair, compared to the number of sentences available in the dataset.Language codes are as originally published.The length is measured in number of characters and averaged across the audited portion of each corpus.

Table 14 :
Audit results for a sample of 100 sentences from mC4 for each language, compared to the number of sentences available in the dataset.Language codes are as originally published.The length is measured in number of characters and averaged across the audited portion of each corpus.Languages with less than 20% correct sentences are boldfaced.

Table 15 :
Audit results for a sample of 100 sentences from OSCAR for each language, compared to the number of sentences available in the dataset.If fewer than 100 sentences were available, all sentences were audited Language codes are as originally published.Length is measured in number of characters.Languages with less than 20% correct sentences are boldfaced.