"Garbage In, Garbage Out"Revisited: What Do Machine Learning Application Papers Report About Human-Labeled Training Data?

Supervised machine learning, in which models are automatically derived from labeled training data, is only as good as the quality of that data. This study builds on prior work that investigated to what extent 'best practices' around labeling training data were followed in applied ML publications within a single domain (social media platforms). In this paper, we expand by studying publications that apply supervised ML in a far broader spectrum of disciplines, focusing on human-labeled data. We report to what extent a random sample of ML application papers across disciplines give specific details about whether best practices were followed, while acknowledging that a greater range of application fields necessarily produces greater diversity of labeling and annotation methods. Because much of machine learning research and education only focuses on what is done once a"ground truth"or"gold standard"of training data is available, it is especially relevant to discuss issues around the equally-important aspect of whether such data is reliable in the first place. This determination becomes increasingly complex when applied to a variety of specialized fields, as labeling can range from a task requiring little-to-no background knowledge to one that must be performed by someone with career expertise.


Introduction
Supervised machine learning (ML) is now widely used in many fields to produce models and classifiers from training data, which allows for automation of tasks such as: diagnosing medical conditions (Ye et al., 2003;Shipp et al., 2002), identifying astronomical phenomena (Ball and Brunner, 2010;Fluke and Jacobs, 2020), classifying environmental zones (Lary et al., 2016;Ma et al., 2017), or distinguishing positive versus negative sentiment in documents (Prabowo and Thelwall, 2009;Thelwall et al., 2010;Ravi and Ravi, 2015). Applying supervised ML requires labeled training data for a set of entities with known properties (called a "ground truth" or "gold standard"), which is used to create a classifier that will make predictions about new entities of the same type.
"Garbage In, Garbage Out" is a classic saying in computing about how problematic input data or instructions will produce problematic outputs (Mellin, 1957;Babbage, 1864), which is especially "Garbage In, Garbage Out" Revisited by Geiger et al. Quantitative Science Studies, Volume 2, Issue 2 (2021) relevant in ML. Yet data quality is often less of a concern in ML research and education, with these issues often passed over in major textbooks (e.g. Friedman et al., 2009;James et al., 2013;Goodfellow et al., 2016). Instead, the focus is typically on the domain-independent mathematical foundations of ML, with ML education and research often using clean, tidy, and pre-labeled "toy" datasets. While this may be useful for theoretically-oriented basic ML research, those applying ML in any given domain must also understand how low-quality or biased training data threatens the validity of the model (Buolamwini and Gebru, 2018;Dastin, 2018;Obermeyer et al., 2019;Geiger et al., 2020).
In this paper, we empirically investigate and discuss a wide range of issues and concerns around the production and use of training data in applied ML. Our team of seven labelers systematically examined published papers that applied supervised machine learning to a particular domain, sampling from three sets of academic fields: life and biomedical sciences; physical and environmental sciences; and social sciences and humanities. For each paper, we asked up to fifteen questions about how the authors reported using supervised ML and how they reported obtaining the labeled training data used to produce the model or classifier. We particularly focus on human-labeled or human-annotated training data, in which one or more individuals make discrete assessments of items. Given that many issues and biases can emerge around human labeling, we examine whether papers reported following best practices in human labeling.
Our project is based on the methodology of structured content analysis, which seeks to systematically turn qualitative phenomenon into categorical and quantitative data (Riff et al., 2013). We draw on and situate our study within the growing efforts to bridge the fields of qualitative and quantitative science studies (Leydesdorff et al., 2020;Bowker, 2020;Cambrosio et al., 2020;Kang and Evans, 2020). Quantitative science studies often examines the outputs of science, such as analyzing bibliometrics and other already-quantitative trace data to understand how scientists' final products have been received within science and other institutions. In contrast, qualitative science studies often examines the research process "in action" (Latour, 1987) to investigate case studies how science is produced, such as using more ethnographic or historical methods. This project is in between these two traditions: our method involves systematically quantifying information from qualitative sources, rather than using already-quantitative trace data; we examined a broad set of publications from across domains, rather than more in-depth case studies; and we analyzed and quantified information about research practices, rather than how publications are cited.
As our research project was a human-labeling project studying other human-labeling projects, we took care in our own practices. Before the research project began, we detailed all questions and valid responses, developed instructions with examples, and had a discussion-based process of reconciling disagreements. Another key issue in data labeling are issues of construct validity and operationalization (Jacobs and Wallach, 2019): is the labeling process actually capturing the theoretical construct that the authors are claiming to capture? In our study, we only have access to the paper reporting about the study and not the actual study or dataset itself. This means our fundamental unit of analysis must be what the papers report, even though our broader intent is to understand what the study's authors and labelers actually did. Many papers either do not discuss such details at all or without sufficient detail to make a determination. For example, many papers did note that the study involved the creation of an original human-labeled dataset, but did not specify who labeled it. For some of our items, one of the most common labels we gave was "no information." This is a concerning issue, given how crucial such information is in understanding the validity of the training dataset, and by extension, the validity of the classifier.

The problem with low-quality and biased training data
Curating high-quality training datasets for machine learning involves skill, expertise, and care, especially when items are individually labeled by humans. There can be disastrous results if training "Garbage In, Garbage Out" Revisited by Geiger et al. datasets are taken as a gold standard when they should not be. Supervised ML models are typically evaluated exclusively using a held-out subset of the original training dataset, making systematic flaws in a training dataset difficult to identify or audit within the traditional paradigm of ML. These concerns are particularly pressing when ML is used for deeply subjective and politicized decisions, like in finance, hiring, welfare, and criminal justice. Many ML training datasets have been found to be systematically biased along various axes, including race and gender, which impacts the accuracy of those ML models (e.g. Buolamwini and Gebru, 2018). In other cases, more subtle issues arise around labels, such as a paper claiming to have produced an ML classifier distinguishing criminals from non-criminals using only facial images, with allegedly overwhelmingly high accuracy (Wu and Zhang, 2016). As Bergstrom and West (2020) critique, their labels were problematically derived from the source of the photos: criminals were taken from prison mugshots, while non-criminals were taken from professional social network profiles. Because people generally do not smile in mugshots but do smile in profile photos, Bergstrom and West argue that the original team effectively built a smile classifier, but claimed it was a criminality classifier.
In another domain, an exposé (Dastin, 2018) reported that Amazon built an internal ML system for hiring that was later scrapped after it was determined to have substantial gender biases. The training dataset used was based on hiring managers' past decisions, where resumes from those hired were given one label and those not hired were given another label. The classifier was thus trained to approximate years of past decisions, and given that Amazon has had significant gender gaps in their workforce (like many tech companies), this meant such systematic biases were reinforced and rationalized through ML. This is the case even though gender was intentionally excluded as a feature in the model, as the classifier used other features that were a proxy for gender to more closely approximate the biases in the training data. Had the training data been a new dataset labeled by a diverse team of trained HR professionals tasked with evaluating resumes with a focus on non-discrimination, this may have produced a quite different classifier.
Machine learning in the field of medicine is poised for explosive growth, although critics raise similar concerns about training data. Medical privacy risks arise for patients whose health care records may be used in formulating a training dataset (Vayena et al., 2018). Furthermore, there is evidence of biases in health care applications of ML, and in some instances, the consequences of biases may directly impact patients' survival. One study in the U.S. labeled patients medical records with their severity of illness, using a proxy variable that ostensibly required little human judgment: the cost of the patient's healthcare. Yet when this data was used to train a classifier, it caused significant bias against African American patients, who historically have had differential access to medical care (Obermeyer et al., 2019). The medical field itself is encountering new questions surrounding human labeling and annotation. For example, one widely used application is the interpretation of medical imaging. The human who labels MRI images as cancerous or notcancerous must have specific expertise, versus someone who labels product reviews as positive or negative. Meta-research in radiology has found practicing radiologists have about a 3-5% error rate (Brady, 2016), which raises the question about whether radiology training datasets should be independently labeled by multiple experts to ensure data quality. Finally, as with many fields, the introduction of ML using pre-existing data from a particular environment and setting has the distinct potential to reproduce and perpetuate existing systemic biases, especially when that classifer is deployed to a different environment and setting (DeCamp and Lindvall, 2020).

"Garbage in, garbage out" version 1
This project is heavily based on a prior study (Geiger et al., 2020), which similarly had a team of labelers examine issues around training data in a random sample of published papers. That study examined a narrow subset of peer-reviewed and preprint papers in a specific field: applied ML papers trained on Twitter data. They looked for 13 pieces of information in each paper, which they argued were important to understanding the validity of the training data labeling process. This included if the data was human or machine labeled, who the labelers were, how many labelers rated "Garbage In, Garbage Out" Revisited by Geiger et al.
each item, and rates of inter-rater reliability (if multiple labelers rated each item). The study found a wide divergence both in the level of information reported and in adherence to best practices in human labeling. For example, of papers reporting a new human-labeled training dataset, about 75% gave some information about who the labelers were, 55% specified the number of labelers, 11% released the training dataset itself, and 0% reported how much crowdworkers were paid for their work.
We expanded on Geiger et al's study, drawing heavily from their published questions and protocols. We followed the same general process of having labelers rate each item independently, then reconciled disagreements through a discussion led by the team leader. We made some small modifications and extensions to the questions, which were recommended by the original authors for future work or were better suited to the expanded scope. We added questions about the field/domain of the paper and about the reconciliation process when multiple labelers labeled each item. We also rewrote some of the labeling instructions, label categories, and provided examples, often to clarify ambiguities. Geiger et al (2020) gives a substantial review of existing work around human labeling of training data, including an extensive discussion of best practices in this work. They argue that much of labeling work for ML is a form of structured content analysis, which is a methodology long used in the humanities and social sciences to turn qualitative or unstructured data into categorical or quantitative data. This involves teams of "coders" (also called "annotators", "labelers", or "reviewers") who "code", "annotate", or "label" items individually. (Note that we use label/labeler in this paper, although we began with using annotate/annotator, which is still present in some of our data and protocols.) One textbook describes content analysis as a "systematic and replicable" (Riff et al., 2013, p. 19) method with established best practices, as Geiger et al summarizes:

Best practices in human labeling of training data
A "coding scheme" is defined, which is a set of labels, annotations, or codes that items in the corpus may have. Schemes include formal definitions or procedures, and often include examples, particularly for borderline cases. Next, coders are trained with the coding scheme, which typically involves interactive feedback. Training sometimes results in changes to the coding scheme, in which the first round becomes a pilot test. Then, labelers independently review at least a portion of the same items throughout the entire process, with a calculation of "inter-rater reliability" (IRR) or "inter-annotator agreement" (IAA). Finally, there is a process of "reconciliation" for disagreements, which is sometimes by majority vote without discussion and other times discussion-based. (Geiger et al., 2020, p. 2-3) Structured content analysis is a difficult task, requiring both domain-specific expertise about the phenomenon to be labeled and domain-independent expertise to manage teams of labelers. Historically, undergraduate students have often performed such work for academic researchers. With the rise of crowdwork platforms like Amazon Mechanical Turk, crowdworkers are often used for data labeling tasks. New software platforms have been developed to support more micro-level labeling and annotation or labeling at scale, including in citizen science (Chang et al., 2017;Pérez-Pérez et al., 2015;Bontcheva et al., 2013;Nakayama et al., 2018). For example, the Zooniverse (Simpson et al., 2014) provides a common platform for citizen science projects across domains, where volunteers label data under scientists' direction.

Meta-research and methods papers in linguistics and NLP
We also draw inspiration from meta-research and standardization efforts in Linguistics and Natural Language Processing (NLP) (Bender and Friedman, 2018;McDonald et al., 2019). These fields have developed extensive literatures on standardization and reliability of linguistic labels, including best practices for corpus annotation (e.g. Hovy and Lavid, 2010; Doddington et al., 2004; Linguistic Data "Garbage In, Garbage Out" Revisited by Geiger et al. Consortium, 2008). In Geiger et al's 2020 study, the publisher with the highest information score was the Association for Computational Linguistics. There has been much work in linguistics and NLP around these issues, such as Sap et al's study of racial bias among labelers (Sap et al., 2019). Blodgett et al conducted a content analysis of how 146 NLP researchers discuss "bias" and found that while this has become a prominent topic in NLP, papers' discussions of motivations and methods around bias "are often vague, inconsistent, and lacking in normative reasoning" (Blodgett et al., 2020, p. 5454). There is also related work in methods papers focused on identifying or preventing "low-effort" responses from crowdworkers (Mozetič et al., 2016;Soberón et al., 2013;Raykar and Yu, 2012), which raise issues around fair labor practices and compensation (Silberman et al., 2018).

The open science, reproducibility, and research integrity movements
Two related movements in computationally-supported knowledge production have surfaced issues around documentation. First, open science is focused on broader availability to the products of research and research infrastructure, including open access to publications, software tools, datasets, and analysis code (Fecher and Friesike, 2014). The related reproducibility movement calls for researchers to make protocols, datasets, and analysis code public, often focusing on what others need to replicate the original study (Wilson et al., 2017;Kitzes et al., 2018). Such requirements have long been voluntary, with few incentives to be a first mover, but funding agencies and publications are increasingly establishing such requirements (Goodman et al., 2014;Gil et al., 2016).
One notable effort is around formally specifying what each author of a paper actually did, which has long been standard in medical journals (Rennie et al., 2000). Author role documentation has gained popularity with the more recent Contributor Roles Taxonomy Project (or CRediT) (Brand et al., 2015). CRediT declarations are increasingly required by journals, which has led to novel quantitative science studies research (Larivière et al., 2020). We also draw inspiration from work about capturing information in ML data flows and supply chains (Singh et al., 2019;Schelter et al., 2017;Gharibi et al., 2019) and developing tools to support data cleaning (Schelter et al., 2018;Krishnan et al., 2016). We note that this work has long been part of library and information science, particularly in Research Data Management (Schreier et al., 2006;Borgman, 2012;Medeiros and Ball, 2017;Sallans and Donnelly, 2012). There is much more work to be done on quantitatively studying issues around research integrity (Zuckerman, 2020;Silberman et al., 2018), which institutionally has often been limited to more egregious and blatant cases of plagiarism and fabrication.

Fairness, Accountability, and Transparency in Machine Learning
Within the field of machine learning, there is a growing movement in the Fairness, Accountability, and Transparency (or FAccT) sub-field, with many recent papers proposing training data documentation in the context of ML. Various approaches and metaphors have been taken in this area, including "datasheets for datasets" , "model cards" (Mitchell et al., 2019), "data statements" (Bender and Friedman, 2018), "nutrition labels" (Holland et al., 2018), a "bill of materials" (Barclay et al., 2019), "data labels" (Beretta et al., 2018), and "supplier declarations of conformity" (Hind et al., 2018). Many go far beyond the concerns we have raised around human-labeled training data, as some are also (or primarily) concerned with documenting other forms of training data, model performance and accuracy, bias, considerations of ethics and potential impacts, and more. Our work is strongly aligned with this movement, as we seek to include data labeling within these areas of concern. However, as we discuss in our conclusion, a single one-size-fits-all standard may be necessary but not sufficient to address concerns of fairness and bias.
We also call attention to those developing methods for "de-biasing" machine learning, which is a fast-moving and contentious research area (for surveys and comparative work, see (Mehrabi et  However, other work has approached these issues more as a problem of dataset pre-processing "Garbage In, Garbage Out" Revisited by Geiger et al. (Calmon et al., 2017) or database repair (Salimi et al., 2020). Critics note that domain-independent approaches may fall into what Selbst et al identify as "abstraction traps" (Selbst et al., 2019, p.60), such as failing to account for the particularities of different kinds and qualities of discrimination in a given social context -a critique Hanna et al (2020) make of fairness research that treats race as a single fixed attribute. We did not ask any questions about how papers discuss de-biasing or data cleaning due to the large number of questions we were already asking and the novelty of such approaches, but these concerns are deeply related.

Data: machine learning papers performing classification tasks
Our goal was to find a corpus of papers using supervised ML across disciplines and application domains, including papers producing an original labeled dataset using human labeling. We used the Scopus bibliographic database (Baas et al., 2020), which contains about 40,000 publications that a review board has verified for various qualities, including being peer reviewed, regularly published for at least 2 years, and governed by a named editorial board of experts. We searched for journal articles and conference proceedings from 2013 to 2018 where the title, abstract, or keywords included "machine learning" and either "classif*" or "supervi*" (case insensitive). We ran three stratified samples across Scopus's Subject Area classifcations 1 : Physical Sciences (which includes engineering and earth/ecological sciences); Social Sciences & Humanities (a single category); and Life Sciences & Health Sciences (two categories, which we combined).

Labeling team, training, and workflow
Our labeling team included one research scientist who led the project (RSG) and undergraduate research assistants, who worked 6-10 hours per week for course credit as part of a universitysponsored research experience program (DC, JI, ML, AS, JW, and RT). The project began with six students for one semester, five of whom continued on the project for the second semester. All students had some coursework in computer science and/or data science, with a range of prior experience in machine learning in both a classroom and applied setting. Students' majors and minors included Electrical Engineering & Computer Science, Data Science, Statistics, Economics, Linguistics, and Biology. For the first four weeks, the team leader trained the students in both broader ML concepts and the specific questions to be answered for this project. The team first labeled and discussed a practice set of 40 papers sampled from across the three corpora, which were not included in the final dataset. In these initial weeks, the team learned the coding schema and the reconciliation process, which were further refined. Following this training, the labeling workflow was that each week, a set of papers were randomly sampled from one corpus, typically between 10-15 papers. The students independently reviewed and labeled the same papers, using different web-based spreadsheets to record labels. The team leader synthesized labels and identified disagreement. The team met in person or by videochat to discuss the week's cases of disagreement. The team leader explained various issues in question and built a consensus about the proper label (as opposed to purely majority vote). The team leader had the final say when a consensus could not be reached.
All 200 papers were labeled by at least four labelers; one labeled 137 items and another labeled 100 items. Following the first round of labeling and reconciliation, we conducted a second round of verification. Where there was any initial disagreement on labels in the first round, each paper was re-examined and discussed by at least two labelers and the team leader. The second round began multiple months after the first round, meaning that there was at least one month between when each paper was examined and re-examined. If there was still disagreement, the final decision was made by the team leader. The team leader did a final check to review every label for all 200 papers.

Inter-rater reliability and labeled data quality
All human labeling projects that involve multiple labelers should evaluate the intersubjective reliability of the labeling process (Tinsley and Weiss, 1975). We present inter-rater reliability (IRR) metrics using three metrics. For all metrics, we re-coded "unsure" and blank responses to both be blank (NaN), but treated "N/A" answers as a distinct judgment. First, we calculated mean total agreement, or the proportion of items where all labelers initially gave the same label before reconciliation, but not counting blank & unsure responses. As Table 2 shows, this is a more stringent metric: all non-blank/unsure responses must be the same for an item to have a 1 score, otherwise the score is 0. Second, we present the mean percent correct rate, which is the proportion of labelers who initially gave the same label arrived after discussion & reconciliation, but also not counting blank & unsure responses. As Table 2 shows, this is a more forgiving metric: if 5 out of 6 labelers give the same final correct label, the score is 0.83 for that item. For these two metrics, we calculated per-question scores by taking the mean of all scores for an item. We also present the widely-used Krippendorff's alpha (Krippendorff, 1970) metric, although we strongly advise against relying on it. Our data does not meet the statistical assumptions for both Fleiss's kappa and Krippendorf's alpha, which are popular because they support missing labels for 3+ labelers and take into account the possibilities that raters made decisions based on random chance. However, this requires assuming a uniform prior possibility of such a random distribution, which generally only applies if each possible response by raters is equally likely. Rates can be dramatically lower when there is a highly skewed distribution of response categories (Quarfoot and Levine, 2016;Oleinik et al., 2014). Our dataset has highly skewed distributions, especially for many of the more specialized questions, which lead to miniscule scores for some questions with especially skewed distributions (e.g. prescreening for crowdwork; reported inter-rater reliability). Table 3 presents both our custom metrics and Krippendorff's alpha for all questions. Mean total agreement rates ranged from 37.5% to 66%, with an average of 48.0% across all questions. Mean percent correct rates ranged from 65.4% to 85.8%, with an average of 73.1% across all questions. Some questions that had lower rates (especially for mean total agreement) were due to a labeler making an incorrect assessment on an earlier question, which determines whether they answer subsequent questions or mark them as 'N/A'.
In interpreting these metrics, we note that the standard approach of human labeling checked "Garbage In, Garbage Out" Revisited by Geiger et al. by IRR metrics treats individual humans as scientific instruments that turn complex phenomena into discrete structured data. If there is a high degree of inter-rater reliability, then reconciliation can easily take place through a majority vote process involving no discussion, or if rates are quite high, then many researchers assume they can use just one of those human labelers per item in future work. These rates were not high enough for us to have confidence that we could have a purely quantitative / majority-vote reconciliation process, much less a process of only using one labeler per item. However, these rates are sufficient to show there is enough agreement to proceed to a discussion-based reconciliation process and a final check of all items by the team leader. As McDonald et al (2019) discuss, standardized IRR metrics like Krippendorf's alpha are useful in highlystructured labeling projects that do not have a discussion-based reconciliation process, as they only evaluate the agreement of independent initial labels. Such metrics would be more essential to the validity of our study if we were conducting a quantitative, majority-rule reconciliation process or if only a subset of items were reviewed by multiple labelers. We included mean percent correct rates to partially include the reconciliation and verification process. Furthermore, our approach was largely focused on identifying the presence or absence of various kinds of information within long-form publications. This is a different kind of human judgment than is typically involved in common tasks using human labeling for ML (e.g. labeling a single social media post for positive/negative sentiment) or traditional social science and humanities content analysis (e.g. categorizing newspaper articles by topic). Our items were full research publications with many pages of detail, which followed many different field-specific conventions and genres. Our labelers were looking for up to 15 different kinds of information per paper, each of which could be found anywhere in the paper. We reflected that in our reconciliation process, most of the time when labelers disagreed, it was because some had caught a piece of information in the paper that others had not seen. Once that information was brought to the group, it was most often the case that some labelers said that they had missed that information and changed their response. It was less common for our team to have disagreements arising from two labelers differently interpreting the same text, especially after the first few weeks. For such reasons, we are relatively "Garbage In, Garbage Out" Revisited by Geiger et al.
confident that if, after our process, no individual member of our team has identified the presence of such information, then it is quite likely it is not present in the paper.

Software, datasets, and research materials
We used Google Sheets to enter labels. For computational analysis and scripting for corpus collection, data management, and data analysis, we used Python 3.7 (van Rossum, 1995), using the following libraries: Pandas dataframes (McKinney, 2010) for data parsing and transformation; SciPy (Jones et al., 2001) and NumPy (van der Walt et al., 2011) for quantitative computations; Matplotlib (Hunter, 2007) and Seaborn (Waskom et al., 2018) for visualization; and simpledorff (Perry, 2020) for IRR calculations. Analysis was conducted in Jupyter Notebooks (Kluyver et al., 2016) using the IPython (Pérez and Granger, 2007) kernel.
Datasets, analysis scripts, labeling instructions, and other supplementary information can be downloaded from GitHub 2 and Zenodo. 3 Datasets include all labels from all labelers for the first round of independent labeling and the consolidated set of final labels and scores for all items. Paper URLs/DOIs have been anonymized with a unique salted hash. Analysis scripts are in Jupyter Notebooks and can be explored and modified in any modern web browser using the cloud-based MyBinder.org (Project Jupyter et al., 2018). 4 Figure 1 shows a summary of results. For this figure, we recoded (or consolidated) some questions with many answers to reflect whether the paper reported an answer to that question. For example, for "original human labeling source," any answer that specified a source is "yes," while "no information" is "no." This is also how we calculated paper information scores in section 5. Figure 1 illustrates how we asked more detailed questions for papers based on answers to prior questions. For example, 103 papers used labels from human judgment -either "yes" or "unsure (assumed yes)" -and the next three questions were answered for those 103 papers. The remaining 10 questions were answered for the 45 papers that could be assumed to use original human labeling, with

Original ML classification task
The first question was whether the paper was conducting an original classification task using supervised machine learning. Our keyword-based process of generating the corpus included some papers that used ML keywords but were not actually presenting a new ML classifier. However, defining the boundaries of supervised ML and classification tasks is difficult, particularly for papers that are long, complex, and ambiguously worded. We defined machine learning broadly: any automated process that does not exclusively rely on explicit rules, in which the performance of a task increases with additional data (Mitchell, 1997, p.2). We decided this can include simple linear regressions, although there is much debate about if and when simple linear regressions are a form of ML. However, as we were also looking for classification tasks, linear regressions were only included if it is used to make a prediction in a set of defined classes. We defined an "original" classifier to mean a classifier the authors made based on new or old data, which excludes the exclusive use of pre-trained classifiers or models. We found that some papers claimed to be using ML, but when we examined the details, these did not fall under our definition. As table 4 shows, the majority of papers in our dataset were involved in an original classification task. We placed 10 papers in the "N/A" category -meaning they did not give enough detail for us to determine, were not in English, were not able to be accessed, or were complex boundary cases.

Classifier area/domain
The next question categorized the paper into one of eight fields/areas of study. We had sampled three broad disciplinary categories (Social Sciences & Humanities, Biomedical & Life Sciences, and Physical & Environmental Sciences), which are determined by Scopus on a per-journal/conference level. We made these area/domain determinations based on the paper's content, without consulting the Scopus-provided category. As table 5 shows, our data set contained a wide variety of ML application fields. Medical papers had the plurality of responses, followed by Linguistic, then papers from Biological, Physical, Soft/hardware, and Geo/ecological had similar sizes.

Labels from human judgment
While all approaches to curating training data involve some kind of human judgment, this question focused on cases where humans made discrete judgments about a set of specific items, which were then turned into labels for training data. More than a quarter of the papers in our corpora used some form of automation, scripting, or quantitative thresholds to label items. For example, one boundary case used medical records to label patients with or without high blood pressure (hypertension). We decided that if a medical practitioner made a diagnosis that researchers used as the label, it was human-labeled. If the researchers set a quantitative threshold for high blood pressure, then parsed medical records for blood pressure readings with a script, it was not humanlabeled. In addition, individual human labeling could be done for all of the paper's training data (the typical case) or only a portion. For example, some authors reported using scripts or thresholds to label some items (e.g. the 'easy' cases) then labeled the remaining items manually.
"Garbage In, Garbage Out" Revisited by Geiger et al. In some instances, we determined the answer could be an "implicit yes" if ample evidence indicated a particular labeling method that most likely used humans at some point, but it was not explicitly stated by the authors. For example, many medical papers reported using diagnoses from a patient's medical records as labels. Some of these papers gave substantial detail about who originally made the diagnosis and even what diagnostic criteria were used, while others generated labels based on medical records and did not explicitly state that a human (e.g. a medical practitioner) made the diagnosis. If we could reasonably assume a human was involved in the original diagnosis, we generally labeled the second type of papers as "no information (implicit yes)." One paper was far less clear about the source of the data than other "implicit yes" papers, such that we labeled it 'Unsure." However, we included the paper in subsequent questions because felt we could answer subsequent questions about it, which re-used externally-obtained data for labeling.
As table 6 shows, the second highest response are papers that do not clearly state whether their labeling was performed by a human or a machine, but contained enough contextual details for us to be reasonably confident in assuming that human labeling was used. Note that this question was originally titled "Labels from human annotation" throughout the labeling and reconciliation process, but was renamed in the analysis stage to better reflect the instructions.

Human labeling for training versus evaluation
This question and all subsequent questions were only applicable to papers that involved human labeling, which had "yes" or "implicit" designations to the previous question. This allowed for further specification of human labeled data usage within each publication. Human labeling for training data is the typical case, where labels are created and then used to train the classifier. Often "Garbage In, Garbage Out" Revisited by Geiger et al.
part of this data is held out as a test set to evaluate the classifier. Human labeling for evaluation only is when the authors of the paper train the classifier using non-human-labeled data, but use humans to either evaluate the validity of that dataset or the classifier. The overwhelming majority of papers took the more standard approach of using labels as training data, but a few did have human evaluation of classifiers trained with machine-labeled data. This question had lower rates of "unsure," where the paper did not give enough information to make a determination.

Original and/or external human-labeled data
Our next question was about whether papers that used human labeling used original human labeling, which we defined as a process in which the paper's authors obtained new labels from human judgments for items. This is in contrast to externally-obtained data, which involves re-using existing private or public datasets of human judgments. Most of the papers in our corpus that used labels from human judgment were re-using externally-labeled data. Our assumption behind this question is that papers which rely on existing datasets may have less of a burden to discuss the details around the labeling process in the paper itself, as readers could review the cited paper for such details. In some cases, external and original human labeling were combined, such as if authors re-used a existing labeled dataset and then further labeled it for additional information.
Like the prior question, this question had lower rates of "unsure / no information" where the paper did not give enough information to make a determination. We note that for all of the papers we labeled as "unsure / no information" we had enough contextual or implicit information to assume that it was not a re-used / externally-labeled dataset. This means that the total number of papers we assume to include at least some original human labeling is 45.

Summary of ML papers' approaches to training data
We synthesized responses to the prior questions to summarize the general breakdown of applied ML publications' approach to their data. Out of the 141 papers in our sample that presented an orig-"Garbage In, Garbage Out" Revisited by Geiger et al.  inal ML classifier, 27% used machine-labeled data (either machine-labeled by the authors or from a re-used dataset), 41% used an existing human-labeled dataset, 27% produced a novel humanlabeled dataset, and 5% did not provide enough information for us to answer. Table 9 and Figure  2 present these results by corpus, which show few differences at this level.

Original human labeling source
Our next question asked who the labelers were, for the 45 papers that used original human labeling. As table 10 shows, we found a diversity of approaches to the recruitment of human labelers. The plurality of papers gave no information about who performed their labeling task. The "survey/selfreported" category refers to papers that have individuals label data they generated, which included surveys as well as studies like those using motion tracking, where subjects recorded performing different physical gestures. In contrast to Geiger et al's prior findings about papers that used Twitter data, none of the papers in our dataset reported using crowdworking platforms. We did not consider volunteer citizen science crowdsourcing platforms to be crowdworking.
"Garbage In, Garbage Out" Revisited by Geiger et al.

Labeler compensation
The next question inquired as to if and what type of compensation was offered to labelers for their work. Our labels for compensation included money or gift cards, class credit, paper authorship, other compensation, explicitly stating no compensation was given (or volunteers), and no information. We observed that most publications did not provide this information, and therefore the label of "no information" was given to the majority of papers for this question.

Training for human labelers & formal instructions
The next two questions focused on how labelers were prepared for their work. We defined training as practicing the labeling task with interactive feedback (e.g. being told what they got right or wrong, or being able to ask questions) prior to starting the main labeling work for the study. Formal instructions are documents or videos containing guidelines, definitions, and examples that the labelers could reference as an aid. In two cases, the paper gave enough detail for us to know that no definitions or instructions were given to labelers beyond the text of the question, but about half of papers did not give enough information to make a determination.

Multiple labeler overlap
Our next three questions were all about using multiple labelers to review the same items. Having multiple independent labelers is typically a foundational best practice in structured content analysis, so that the integrity of the labels and the schema can be evaluated (although see (McDonald et al., 2019)). For multiple labeler overlap, our definitions required papers state whether all or some of the items were labeled by multiple labelers, otherwise "no information" was recorded. We can reasonably assume that papers which did not mention whether multiple labelers were used for each item did not engage in this more intensive process, although we cannot be certain. As table 14 shows, very few papers mentioned using multiple labelers per item, with the overwhelming majority not giving any indication.

Synthesis of labeler overlap & reported inner-rater reliability
The next two questions built off of the previous question, which were only answered if the paper had been given the label of "yes for all items" or "yes for some items." For these papers that had multiple labeler overlap, we examined the method by which labeler disagreement was reconciled and whether any inter-rater reliability (IRR) or inter-annotator agreement (IAA) metric was reported. We did not record what kind of IRR/IAA metric was used, such as Cohen's kappa or Krippendorff's alpha, but many different metrics were used. We also did not record what the exact statistic was, although we did notice a wide variation in what was considered an acceptable score.

Total and median number of human labelers
We then asked two final questions regarding how many individuals completed a paper's labeling task. Because this information can be presented differently based on the labeling process, we divided this into two. The total number of human labelers referred to all human labelers involved in the project at any time. The median number of human labelers per item referred to how many labelers evaluated each item in a publication's dataset, which were greater than one in the case of papers that had multiple labelers per item. Eight papers specified that there was only one labeler per item, which matches with the data in the first question about multiple labeler overlap. The majority of the papers did not provide enough information to answer the question.

Link to dataset available
Our final question was about whether the paper contained a link to the dataset containing the original human-labeled training dataset. Note that this question was only answered for papers involving some kind of original or novel human labeling, and papers that were exclusively re-using an existing open or public dataset were left blank to avoid double-counting. We did not follow such "Garbage In, Garbage Out" Revisited by Geiger et al.
links or verify that such data was actually available. As table 19 shows, the overwhelming majority of papers did not include such a link, with 5 papers (11.11%) using original human-labeled training datasets linking to such data. Given the time, labor, expertise, and funding in creating original human labeled datasets, authors may be hesitant to release such data until they feel they have published as many papers as they can, especially junior scholars. Data sharing also requires specific expertise in data formats, documentation, and platforms, which may not be equally distributed across academic disciplines.

Paper information scores
After finalizing the labels, we quantified the information each paper provided about training data, based on how many questions we could answer for each paper. We developed a total and normalized information score, as different studies demanded different levels of information. For example, our questions about whether inter-rater reliability metrics and reconciliation methods were reported are only applicable for papers involving multiple labelers per item. However, all other questions are relevant for any project involving original human labeling. As such, papers involving original human labeling without multiple labelers per item had a maximum of 11 points, while those with multiple labelers per item had a maximum of 13 points. The normalized score is the total score divided by the maximum score possible. Figure 3 shows histograms for total and normalized information scores, which show that scores varied substantially. As Geiger et al (2020) also found, this roughly suggests two overlapping distributions and thus populations of publications: one centered around total scores of 3-5 and normalized scores of 0.3 and another centered around total scores of 9 and normalized scores of 0.7. The normalized information score ranged from 0 to 1, with 1 paper having a normalized score of 0 and 3 papers with a full score of 1. The total information score ranged from 0 to 11, with no paper receiving a full score of 13, which would have required a study involving multiple labeler overlap that gave answers to all questions, including IRR metrics and reconciliation method. Overall, the mean total score was 5.4, with a median of 5 and a standard deviation of 3.2. The mean normalized information score was 0.472, with a median of 0.455 and a standard deviation of 0.268. This is quite similar to the findings by Geiger et al (2020) for their normalized scores, which had a mean of 0.441, a median of 0.429, and a standard deviation of 0.261.

Information scores by corpus and application areas
We analyzed information scores by corpus for all papers using original human labeling. Figure 4 is a boxplot illustrating the distribution of normalized information scores by corpus. 5 There was a lower median score ( The middle red line is the median, the black is the mean. The outer whiskers are the highest and lowest data points in a range of 1.5 times the IQR from the median. Grey diamonds are outliers beyond 1.5 times the IQR from the median. "Garbage In, Garbage Out" Revisited by Geiger et al.  papers had a lower mean (0.428) than social science & humanities papers (0.482) and life science & biomedical papers (0.519). We ran a 1-way analysis of variance (ANOVA) of normalized information scores by corpus. No statistically significant difference was found ( = 0.65, = 0.43). Because we run 3 statistical tests in this paper, we apply a Bonferroni correction to address the multiple comparisons problem (Dunn, 1961) Figure 5. These were not stratified random samples, and we ended up with far more papers in some categories than others, with only 1 item for Physical and Other. The small sample size makes formal statistical tests difficult to interpret, and the assumption of homoscedasticity is not fulfilled due to the wide range in standard deviations between these groups (e.g. 0.13 for Geo/ecological to 0.39 for Demographic). We recommend against making generalizable statistical tests or generalizations based on this analysis, but we report these scores to inform future work. Most groups' mean and median scores were between 0.4 and 0.6, with papers in the Linguistic category having lower medians (0.318). The most common categories -Linguistic, Medical, and Biological -also had much wider distributions and IQRs, but similar means. Activities & actions was the highest scoring category in terms of the mean, median, and upper and lower IQR. In these studies, it is generally the case that the data are recordings of a person performing an activity, and each label is the activity they are asked to perform. This research design may lead authors to more concretely detail such methods.

Normalized information scores by document type
For the 45 papers using original human labeling, 33 were journal articles and 12 were conference papers. We conducted an analysis of normalized information scores by document type, which showed larger differences. As figure 6 shows, articles have a higher mean (0.53 vs 0.31) and median (0.45 vs 0.27). We ran a 2-tailed Welch's unequal variances t-test (Welch, 1947) (variances differed by 0.024) and found a statistically significant difference ( = 0.0086, = 2.86). We applied a Bonferroni correction to the p-value threshold to address the problem of multiple comparisons, but the the "Garbage In, Garbage Out" Revisited by Geiger et al. p-value is well below our adjusted target of 0.0166. This means that in our sample, we can assume that articles generally provide more information about training data than conference papers.  Figure 6. Boxplot of normalized information scores for papers using original labeling, by document type.

Label source information scores
Finally, because of the relatively small number of papers involving original human labeling (n=45) that lead to low statistical power for paper information scores, we examined all papers that presented an original ML classifier (n=141) based on whether they gave information sufficient to determine if their dataset's labels were derived from original human labeling. As discussed in section 4.3, we gave many papers the answer "no information (implicit yes)," which means we could reasonably assume labels were made by humans, but the paper never explicitly said humans were involved. Papers with answers "Yes for all items," "Yes for some items," and "No / machine-labeled" were scored 1. Papers with answers "No information (implicit yes)" and "Unsure (but assumed yes)" were scored 0. N/A papers that did not present an original classifier were excluded. Figure 7 shows the label source reporting rates by corpus, which shows strikingly similar rates. Social Science & Humanities papers had a rate of 72.7%, compared to rates of 71.1% for the other two corpora. Figure 8 shows the label source reporting rates by application area, which shows a much wider range. Activities & actions also has the highest rate at 100% (likely for the same reasons hypothesized earlier), with the lowest rate being Geo/ecological at 46.1%. We also note the differences in these results and the overall paper information scores, which were inversely-ranked for the larger categories of Linguistic, Medical, and Biological. While Lingusitic papers had lower median information scores, they had far higher rates of label source reporting (79.2%), compared to Medical (69.8%) and especially Biological (58.8%) papers. Figure 9 shows label source reporting rates by journal article versus conference paper, which shows a much higher rate for articles. We ran a 2-tailed Welch's unequal variances t-test (variances "Garbage In, Garbage Out" Revisited by Geiger et al. differed by 0.082) and a statistically significant difference was not found ( = 0.038, = 2.35). We must apply a Bonferroni correction to the p-value threshold to address the problem of multiple comparisons, and the the p-value is above our adjusted target of 0.0166.

Conclusion to information score results
In conclusion, our quantitative metrics show quite varying ranges and distributions of information scores, which does give evidence to the claim that there is substantial and wide variation in the practices around human labeling, training data curation, and research documentation. The ranges of the boxplots of normalized information scores are substantial, both for IQRs (25th & 75th percentile) and the whiskers at 1.5 * IQRs. Ranges are larger when sampling by corpus, but still substantial for the application areas with more papers (e.g. Medical, Biological, Linguistic).
We specifically call for more investigation into applied ML geo/ecological research, which often classifies land use from aerial photos or photos of geological samples. These had the lowest rates of label source specification and the lowest mean normalized information scores (excluding the categories that only had 1 paper). However, from our experience, some papers with lower scores did give excellent levels of detail about how they were using an existing land use classification schema they cited (e.g. the widely-used USGS guide by Anderson et al. (1976)), but did not give any details about who applied that schema to the aerial photos. We can also hypothesize that in fields with widely-established and shared methodological standards, researchers could have far higher rates of adherence to methodological best practices around data labeling, but have lower rates of reporting that they actually followed those practices in papers.
Finally, we draw attention to the different rates when we grouped by corpus versus application area. In our sampling, the corpus was the Scopus-provided metadata field, which is determined at the publication level when a journal is added to Scopus. 6 At this level, we saw fewer differences in quantitative scores. In contrast, our application area field is determined for each paper based on 6 https://web.archive.org/web/20210531200329/https://service.elsevier.com/app/answers/detail/a_id/14882/ Percent of papers explicitly specifying label source (e.g. human or machine-labeled)

Conference Paper
Article Label source reporting rates for all ML classification papers by document type Figure 9. Label source reporting rates for papers presenting an original classifier, by document type the content, independent of the journal or venue in which it was published. Scores varied far more when broken out by application area, which is likely due in part to noise in the smaller samples. However, this may also indicate that methodological reporting rates vary even more within subfields or types of research objects. For future work, we recommend that researchers pay specific attention to differences between fields or objects of study, rather than solely group papers in the high-level way we did with our three corpora.

Findings
First, our study shows that contemporary applications of supervised machine learning across disciplines often rely on training datasets in ways that either reuse existing human-labeled datasets or label items with some kind of automated process. Of the papers in our dataset that presented an original ML classifier, only 26.7% produced a new human-labeled dataset as part of their study -a rate that did not substantially vary among our three corpora from the biomedical & life sciences, the physical & environmental sciences, and the humanities & social sciences. Second, of the applied ML publications that did produce a new human-labeled training dataset, there was significant divergence in reporting methodological details and following best practices in human labeling. A small number of publications received top information scores, but approximately twothirds of publications involving original human labeling did not provide enough information for us to answer more than half of the subsequent questions we asked about the labeling process. This cross-disciplinary trend is cause for concern, given that high-quality training data is essential to the validity of machine learning classifiers and human judgment is notoriously difficult to standardize. When comparing across our three broad corpora of social science & humanities, biomedical & life sciences, and physical & environmental sciences papers, we only see marginal differences in the level of information papers provide. We do see more robust evidence that journal articles have higher rates of reporting information about training data than conference papers, which may relate to conference papers being shorter and only involving a single cycle of peerreview. 7 6.2. Implications 6.2.1. The black-boxing of training data Machine learning is increasingly used across disciplines and application domains, but the quality of supervised ML classifiers is only as good as the data that is used to train it. Based on our findings, we argue for more attention to be placed on the specific details of how that training data is labeled. There is a recent wave of work that interrogates ML models once they are trained, as well as considerations about "automation bias" (Skitka et al., 1999) -that people often treat trained models as a "black box," with their outputs unquestioned and taken as given. These concerns must also extend to the labeling and curation of training datasets, some of which become widely re-used without being examined.
For example, Crawford and Paglen (2019) have called attention to problematic racial labels of images in the popular ImageNet training dataset, which has been a standard benchmark dataset in image recognition for over a decade. Birhane and Prabhu (2021) found thousands of images in the 80 Million Tiny Images dataset that were labeled with offensive racial and gender-based slurs. The careful curation of datasets has long been a central tenet in the institutions of science, although standards and practices can change dramatically over time and across contexts. Historians of science like Bowker (2005) and Gitelman (2013) remind us that data is never "raw," as data always is produced and used within a messy assemblage of partially-overlapping human institutions, each of which have their own practices, values, and assumptions. To this end, we call for applied ML researchers and practitioners who are re-using human-labeled datasets to exercise as much caution and care around the decisions to re-use a labeled dataset as they would if they were labeling the data themselves.
Finally, we have not asked any questions about how papers discuss data cleaning, but we encourage more investigation and consideration of how the often-backgrounded work of data cleaning is performed, managed, and documented. We could have asked another dozen questions about how papers did or did not discuss how they cleaned their data. For future work, we would encourage researchers to study what applied ML papers report about how they cleaned and preprocessed their data. We also see much future work in studying to what extent applied ML papers report efforts at de-biasing datasets and models.

Institutional change around data documentation
We call on the institutions of science -publications, funders, disciplinary societies, and educators -to play a major role in working out solutions to these issues of data quality and research documentation. We see this work as part of the open science and reproducibility movement, specifically the movement for open access to research datasets, materials, protocols, and analysis code. However, even advocates of this movement have long discussed how individual researchers do not have incentives to be first-movers in being more open than usual about the messiness in all research, because it leaves their work more open to rebuttal (Smaldino, 2016;Ali-Khan et al., 2017;Zimring, 2019). In our own experience, we have certainly felt the temptation to not report certain details that would lead others to have less confidence in our study, such as our inter-rater reliability metrics.
In looking towards solutions, we see a parallel to issues in open access to publications, which often requires individual researchers to choose if they want to pay for open access out of their own funding. While some first-movers paid for this out of their own budgets, open access is currently being far more effectively tackled at the institutional level in ways that will not require individuals (and especially first-movers) to pay the costs. So too do we see institutional solutions to the issue of methodological detail, where a common floor could be established that is equally applicable to all researchers. We also see resonance with the various proposed efforts at standardizing documentation about machine learning models and datasets Mitchell et al., 2019;Bender and Friedman, 2018;Holland et al., 2018;Barclay et al., 2019;Beretta et al., 2018;Hind et al., 2018;Raji and Yang, 2019) and urge that human labeling details be included in such efforts.
On the publication process, we note that research publications are limited by length restrictions, which can leave little space for details. We can hypothesize that having a dedicated and visible space for methodology and dataset documentation would make these concerns more central for authors, reviewers, editors, and readers, although we can only speculate as to the best way for this to be implemented. For example, Nature has far shorter word limits for a main research article (2,000 to 2,500 words), which means methodological and dataset documentation is often fully detailed in appendices, which can be of any length. Does this approach more easily lead to readers and/or reviewers ignoring such details and focusing more on results? However, Nature also requires that authors fill out a peer-reviewed checklist form that asks general and domainspecific questions about statistical details (e.g. "a description of all covariates tested") and about the dataset (e.g. for behavioral science, "State the research sample ... provide relevant demographic "Garbage In, Garbage Out" Revisited by Geiger et al. information ... and indicate whether the sample is representative"). 8 Do these kinds of mandatory structured disclosure forms make these concerns more central to authors and reviewers, even if they are not as accessible to readers?
We also note that peer reviewers and editors play a major role in deciding what details are considered extraneous. First, we urge reviewers to make space for what some may see as "boring" methodological details. More importantly, we call on editorial boards to openly signal in author and reviewer guidelines that they invite or even require extended discussion of methodological details. To this end, one recent trend is the growth of multi-stakeholder groups that have collectively released formal guidelines or best practices statements on research reporting, such as the CONSORT guidelines for reporting randomized clinical trials (Schulz et al., 2010), the COREQ guidelines for reporting qualitative research (Tong et al., 2007), or the PRISMA guidelines on reporting meta-analyses and systematic reviews (Moher et al., 2009).
For example, PRISMA guidelines on reporting meta-analyses and systematic reviews have been mandated in the author guidelines of many journals (including The Lancet, 9 PLoS ONE, 10 and Systematic Reviews 11 ), which require authors fill out the 27-item PRISMA checklist. 12 One interesting trend with such multi-stakeholder best practices statements in medicine is the proliferation of subdomain-specific "extensions" that further specify methodological reporting standards. For example, the EQUATOR network tracks 32 extensions to the CONSORT guidelines, 13 including guidelines for reporting randomized clinical trials in pain management (Gewandter et al., 2019), orthodontics (Pandis et al., 2015), and psycho-social interventions (Montgomery et al., 2018).
However, there have been disagreements over the impact and efficacy of these more structured approaches. Page and Moher's 14 meta-analysis of 57 papers studying uptake of the PRISMA guidelines (Page and Moher, 2017) found that while more papers are reporting details in the PRISMA guidelines after it was released in 2009, some details remain low even for papers claiming to adhere to the guidelines. For example, for 9 of the PRISMA items, fewer than 67% of papers actually reported the information in question. Fleming et al. (2014) found that following the widespread uptake of the PRISMA guidelines by certain publications, more meta-analysis articles reported methodological details, but disproportionately those in the PRISMA guidelines. The authors of that study raise concerns that PRISMA has overdetermined the peer review process: authors who are fully-compliant with PRISMA are no longer reporting other methodological details that Fleming et al. claim are also relevant in such work and were in other competing meta-analysis guidelines that ultimately lost to PRISMA. 6.2.3. Are there universal best practices for the labeling of training data?
The efforts around methodological standards in medicine raise an important question about the wisdom of seeking a single one-size-fits-all set of best practices for any application of supervised ML. However, contemporary efforts around "fairness" or "transparency" in machine learning often work towards more universal or domain-independent approaches, which are applied to a wide range of application areas (e.g. finance, social services, policing, hiring, medicine). Yet in our work examining publications from quite different academic fields, we found ourselves needing to pay close attention to the various kinds of specialized expertise that are required to label a training dataset for a particular purpose. As Bowker and Star (1999) and Goodwin (1994) discuss, all classification systems rely on a shared cultural context, which can be exceedingly difficult to formally specify and often falls apart at the edges. It can be difficult to know beforehand what level of shared cultural context and expertise will be involved. 8 https://www.nature.com/documents/nr-reporting-summary-flat.pdf 9 https://els-jbs-prod-cdn.jbs.elsevierhealth.com/pb/assets/raw/Lancet/authors/tl-info-for-authors.pdf 10 https://journals.plos.org/plosone/s/submission-guidelines#loc-systematic-reviews-and-meta-analyses 11 https://systematicreviewsjournal.biomedcentral.com/submission-guidelines/preparing-your-manuscript/research 12 http://prisma-statement.org/documents/PRISMA%202009%20checklist.pdf 13 https://www.equator-network.org/?post_type=eq_guidelines&eq_guidelines_clinical_specialty=0&s=+CONSORT+ extension 14 Moher is the lead author of the PRISMA statement.
"Garbage In, Garbage Out" Revisited by Geiger et al.
Some of the papers we analyzed described in great detail how the people who labeled their dataset were chosen for their expertise, from seasoned medical practitioners diagnosing diseases to youth familiar with social media slang in multiple languages. That said, not all labeling tasks require years of specialized expertise, such as more straightforward tasks we saw, like distinguishing positive versus negative business reviews or identifying different hand gestures. Even projects in the same domain can require different levels of expertise, such as a dataset of animal photos labeled just for the presence of cats and dogs, versus labeling the same photos for the specific breed of cats and dogs. Furthermore, we found that some labeling tasks are well-suited to semiautomated labeling where labelers are assisted with rule-based approaches, while others are not. Finally, even the more seemingly-straightforward classification tasks can still have substantial room for ambiguity and error for the inevitable edge cases, which require training and verification processes to ensure a standardized dataset.
The labeling protocol and schema we developed and used in this paper -which is based on and extends prior work (Geiger et al., 2020) -is an effort at creating a cross-disciplinary standard for any given research project that uses human-labeled training data. While we believe that any peer reviewer or reader can ask these same questions of any ML application paper, they are only a starting point. Issues of validity, consistency, reliability, reproducibility, and accountability require further investigation. The kind of domain-independent criteria we used should be seen as necessary but not sufficient criteria for having confidence in a labeled dataset. We do not advocate for a single, universal, one-size-fits-all solution, but instead seek to spur conversations within and across disciplines about better approaches to bring the work of data labeling into the foreground. We see a role for the classic principle of reproducibility, but for data labeling: does the paper provide enough detail so that another researcher could hypothetically recruit a similar team of labelers, give them the same instructions and training, reconcile disagreements similarly, and have them produce a similarly-labeled dataset?
Data publications could also play a major role in this issue, which are stand-alone peer-reviewed publications that do not answer a research question, but instead spend the entire paper describing the creation of a dataset in rich detail (Costello, 2009;Smith, 2009;Chavan and Penev, 2011;Candela et al., 2015). In seeking to bring the work of data labeling from the background to the foreground, our work is also aligned with scholars who have focused on the often under-compensated labor of crowdworkers and have called for researchers to detail how much they pay for data labeling (Silberman et al., 2018).

Limitations
To conclude, we reflect that our study also has the same kinds of limitations that many human labeling projects have. For example, given the concerns we raise about domain-specific expertise, our team may have missed or misinterpreted crucial details when examining papers. The second issue is around the reliability and reproducibility of our team's labeling process. In conducting this study, we have become quite familiar with the difficulties of getting a medium-sized team to build a consensus around reducing complex objects into quantifiable data. We specifically chose to have a more detailed and time-intensive process in which disagreements were discussed, which traded off with the total number of items we were able to label. We believe this trade-off was the right decision, given our focus on methodological rigor, but it does mean our samples are smaller than we would like. The lower sample size means that we have less confidence in the statistical generalizability of our sample to the population of all applied ML publications. However, we see a wide range of future work that can be done to extend these efforts, such as with teams of domainspecific experts that examine applied ML fields in their area of expertise.
Finally, we only have access to what each publication reported about the work they did, and not the research project itself, which means our unit of analysis is methodological reporting. For example, researchers could have far higher rates of following methodological best practices around data labeling, but have lower rates of reporting that they actually followed those practices in papers.
"Garbage In, Garbage Out" Revisited by Geiger et al.
We could even hypothesize an inverse relationship between a field's overall adherence to methodological best practices and researchers' rates of reporting adherence to those practices, if such practices become so routine and mundane that they are left implicit in publications. For these reasons, we strongly advise against interpreting our quantitative scores as an unproblematic proxy for methodological rigor, especially for the scores by discipline and area. However, given our interest in how labeling practices impact the validity of ML models and classifiers, future work could extend this work through other methods, such as surveys and ethnographic studies of ML researchers.
Flag complicated cases for discussion, use the last notes column. Some papers might have weird cases that require us to redefine our instructions, which is also OK. Don't spend too much time agonizing over small decisions.
"Coding" can mean manual annotation, this is a legacy from early 20th century linguistics.
If they have a link or citation to more info about their annotation, follow that. If it is an existing dataset from another project, that is out of scope. Don't try to search deeply for it if it isn't referenced in the paper.
If the paper is over 50 pages, skip it and flag for discussion.
If two rounds/stages of human annotation are involved, use the first one.
If there are multiple independent classifiers, some using human labels and others using machine labels, focus on the human-labeled classifier and ignore the machine-labeled one.

B.2. Questions
B.2.1. Original ML classification task: Is the paper presenting its own original classifier that is trying to predict something? "Original" means a new classifier they made based on new or old data, not anything about the novelty or innovation in the problem area.
• Yes • No • Unsure / no information Classification involves predicting cases on a defined set of categories. Prediction is required, but not enough. Linear regressions might be included if the regression is used to make a classification, but making predictions for a linear variable is not. Predicting income or age brackets is classification, predicting raw income or age is not.
Example: if the paper is vague about if they actually built a classifier, choose unsure (link removed) Example: any prediction on a linear/scalar value (not binned categories) is not classification (link removed) Example: recommender systems are typically not classification (link removed) Example: analyzing statistics about the kinds of words people use on social media is not a classification task Example: predicting location is a classification task if it is from a set of locations (from work, school, home, or other) but not if it is an infinite/undefined number of locations. (link removed) Example: This paper (link removed) was framed as not an original classification task (more algorithm performance), but they did create an original classifier. This can also be an "unsure" --which is 100% OK to answer.
Example: Literature review papers that include classification papers aren't in this, if they didn't actually build a classifier.
Example: if there is a supervised classification task that is part of a broader process, this counts, "Garbage In, Garbage Out" Revisited by Geiger et al.
focus on that.
If no, skip the following questions.

B.2.2. Classification outcome:
What is the general type of problem or outcome that the classifier is trying to predict? This will be the label, typically. What is the end result, not how did they get there.
If multiple apply, put both and separate by comma Is the classifier at least in part trained on labeled data that involves human(s) who make individual judgments for specific items? This requires a human to make a judgement about individual cases.
• Yes for all -This is the typical case, where every item in the dataset used to train the classifier had a human make a judgment about the item • Yes for some, which applies if *either*: -Human annotation was used to evaluate the classifier, but not train it; or, -There was some process of humans making judgments about some items, but then some kind of automated / mechanical way of scaling this to all items.
• No / machine-labeled "Garbage In, Garbage Out" Revisited by Geiger et al.

-
This includes fully-automated / machine-labeled ways of extracting labels, where a human is not involved in making each individual judgment • Implicit yes -We know based on the subject matter that it had to be human labeled (e.g. patient medical data: 10.1109/CITSM.2017.8089245) • No information -When we know for sure that there is no info in the paper and the context doesn't necessarily imply human labels (e.g. https://doi.org/10.1016/j.jtbi.2016.05.011) • Unsure -This is when we are so confused by the paper that we don't even know if we can or can't answer it If a human is told to do something that is recorded, where the recordings are the data and what they were told to do is the label, this counts as labels from human annotation. (e.g. (link removed) and (link removed)) This includes re-using existing data from human judgements, if it was for the same purpose as the classifier. This does not include clever re-using of metadata or human labeled data for a new purpose.
Setting a threshold for quantitative data is not human-labeled.
Do a quick CTRL-F for "manual", "annot", and "label" if you don't see anything, just to be sure.
Example (link removed): this is yes for some, but very implicit. Unsupervised clustering used to re-label Example: In medicine, if they were classifying for high blood pressure, and they use existing patient records with a cutoff of some number, this is not human annotation, it is "no / machine labeled". If they extract a physician's judgment from medical records, it is.
Example: labels were length of stay, not human annotation: (link removed) BUT: psych diagnostic tests that require judgement in scoring beyond skill are human judgement, although this is borderline.
Example: If the paper is using an external dataset that we know implicitly would have to use human annotation, but they don't say anything, put "Implicit yes." (link removed) Example: paper on political stances was labels from human annotation, just not original. They took the labels from elsewhere and filled in the gaps (more on that in next Q).
Example: Buying followers and seeing who follows (link removed) is not human annotation.
Example: Generating (smart) simulated datasets from metadata is not human annotation.
Example: (link removed) not annotation when looking up political affiliation of politicians, even though it is manual work. No judgement is involved. "No." Example: (link removed) identified hashtags that they believe universally correspond to certain political stances. This would be a kind of "self-annotation" by the tweet's author and therefore "yes for all" "Garbage In, Garbage Out" Revisited by Geiger et al.