Page 1 | Search Results | Transactions of the Association for Computational Linguistics

Skip Nav Destination

1-1 of 1

Sort by

Sort Order Select

Journal Articles

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch ...

Publisher: Journals Gateway

Transactions of the Association for Computational Linguistics (2022) 10: 50–72.

DOI: https://doi.org/10.1162/tacl_a_00447

Published: 31 January 2022

FIGURES

Abstract

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

View article

NARROW

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Email alerts

A product of The MIT Press

MIT Press Direct

Information

MIT Press

Contact Us

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Email alerts

A product of The MIT Press

MIT Press Direct

Information

MIT Press

Contact Us

This Feature Is Available To Subscribers Only