Abstract
Supervised machine learning, in which models are automatically derived from labeled training data, is only as good as the quality of that data. This study builds on prior work that investigated to what extent “best practices” around labeling training data were followed in applied ML publications within a single domain (social media platforms). In this paper, we expand by studying publications that apply supervised ML in a far broader spectrum of disciplines, focusing on human-labeled data. We report to what extent a random sample of ML application papers across disciplines give specific details about whether best practices were followed, while acknowledging that a greater range of application fields necessarily produces greater diversity of labeling and annotation methods. Because much of machine learning research and education only focuses on what is done once a “ground truth” or “gold standard” of training data is available, it is especially relevant to discuss issues around the equally important aspect of whether such data is reliable in the first place. This determination becomes increasingly complex when applied to a variety of specialized fields, as labeling can range from a task requiring little-to-no background knowledge to one that must be performed by someone with career expertise.
PEER REVIEW
1. INTRODUCTION
Supervised machine learning (ML) is now widely used in many fields to produce models and classifiers from training data, which allows for automation of tasks such as diagnosing medical conditions (Shipp, Ross et al., 2002; Ye, Qin et al., 2003), identifying astronomical phenomena (Ball & Brunner, 2010; Fluke & Jacobs, 2020), classifying environmental zones (Lary, Alavi et al., 2016; Ma, Li et al., 2017), or distinguishing positive versus negative sentiment in documents (Prabowo & Thelwall, 2009; Ravi & Ravi, 2015; Thelwall, Buckley et al., 2010). Applying supervised ML requires labeled training data for a set of entities with known properties (called a “ground truth” or “gold standard”), which is used to create a classifier that will make predictions about new entities of the same type.
“Garbage in, garbage out” is a classic saying in computing about how problematic input data or instructions will produce problematic outputs (Babbage, 1864; Mellin, 1957), which is especially relevant in ML. Yet data quality is often less of a concern in ML research and education, with these issues often passed over in major textbooks (e.g., Friedman, Hastie, & Tibshirani, 2009; Goodfellow, Bengio, & Courville, 2016; James, Witten et al., 2013). Instead, the focus is typically on the domain-independent mathematical foundations of ML, with ML education and research often using clean, tidy, and prelabeled “toy” data sets. While this may be useful for theoretically oriented basic ML research, those applying ML in any given domain must also understand how low-quality or biased training data threatens the validity of the model (Buolamwini & Gebru, 2018; Dastin, 2018; Geiger, Yu et al., 2020; Obermeyer, Powers et al., 2019).
In this paper, we empirically investigate and discuss a wide range of issues and concerns around the production and use of training data in applied ML. Our team of seven labelers systematically examined published papers that applied supervised ML to a particular domain, sampling from three sets of academic fields: life and biomedical sciences; physical and environmental sciences; and social sciences and humanities. For each paper, we asked up to 15 questions about how the authors reported using supervised ML and how they reported obtaining the labeled training data used to produce the model or classifier. We particularly focus on human-labeled or human-annotated training data, in which one or more individuals make discrete assessments of items. Given that many issues and biases can emerge around human labeling, we examine whether papers reported following best practices in human labeling.
Our project is based on the methodology of structured content analysis, which seeks to systematically turn qualitative phenomena into categorical and quantitative data (Riff, Lacy, & Fico, 2013). We draw on and situate our study within the growing efforts to bridge the fields of qualitative and quantitative science studies (Leydesdorff, Ràfols, & Milojević, 2020; Bowker, 2020; Cambrosio, Cointet, & Abdo, 2020; Kang & Evans, 2020). Quantitative science studies often examines the outputs of science, such as analyzing bibliometrics and other already quantitative trace data to understand how scientists’ final products have been received within science and other institutions. In contrast, qualitative science studies often examines the research process “in action” (Latour, 1987) to investigate how science is produced, such as using more ethnographic or historical methods. This project is in between these two traditions: Our method involves systematically quantifying information from qualitative sources, rather than using already-quantitative trace data; we examined a broad set of publications from across domains, rather than more in-depth case studies; and we analyzed and quantified information about research practices, rather than how publications are cited.
As our research project was a human-labeling project studying other human-labeling projects, we took care in our own practices. Before the research project began, we detailed all questions and valid responses, developed instructions with examples, and had a discussion-based process of reconciling disagreements. Another key issue in data labeling is that of construct validity and operationalization (Jacobs & Wallach, 2019): Is the labeling process actually capturing the theoretical construct that the authors are claiming to capture? In our study, we only have access to the paper reporting about the study and not the actual study or data set itself. This means that our fundamental unit of analysis must be what the papers report, even though our broader intent is to understand what the study’s authors and labelers actually did. Many papers either do not discuss such details at all or lack sufficient detail to make a determination. For example, many papers did note that the study involved the creation of an original human-labeled data set, but did not specify who labeled it. For some of our items, one of the most common labels we gave was “no information.” This is a concerning issue, given how crucial such information is in understanding the validity of the training data set, and by extension, the validity of the classifier.
2. LITERATURE REVIEW AND MOTIVATION
2.1. The Problem with Low-Quality and Biased Training Data
Curating high-quality training data sets for ML involves skill, expertise, and care, especially when items are individually labeled by humans. There can be disastrous results if training data sets are taken as a gold standard when they should not be. Supervised ML models are typically evaluated exclusively using a held-out subset of the original training data set, making systematic flaws in a training data set difficult to identify or audit within the traditional paradigm of ML. These concerns are particularly pressing when ML is used for deeply subjective and politicized decisions, such as in finance, hiring, welfare, and criminal justice. Many ML training data sets have been found to be systematically biased along various axes, including race and gender, which impacts the accuracy of those ML models (e.g., Buolamwini & Gebru, 2018). In other cases, more subtle issues arise around labels, such as a paper claiming to have produced an ML classifier distinguishing criminals from noncriminals using only facial images, with allegedly overwhelmingly high accuracy (Wu & Zhang, 2016). As Bergstrom and West (2020) critique, their labels were problematically derived from the source of the photos: Criminals were taken from prison mug shots, while noncriminals were taken from professional social network profiles. Because people generally do not smile in mug shots but do smile in profile photos, Bergstrom and West argue that the original team effectively built a smile classifier, but claimed it was a criminality classifier.
In another domain, an exposé (Dastin, 2018) reported that Amazon built an internal ML system for hiring that was later scrapped after it was determined to have substantial gender biases. The training data set used was based on hiring managers’ past decisions, where resumes from those hired were given one label and those not hired were given another label. The classifier was thus trained to approximate years of past decisions, and given that Amazon has had significant gender gaps in their workforce (like many tech companies), this meant that such systematic biases were reinforced and rationalized through ML. This is the case even though gender was intentionally excluded as a feature in the model, as the classifier used other features that were a proxy for gender to more closely approximate the biases in the training data. Had the training data been a new data set labeled by a diverse team of trained HR professionals tasked with evaluating resumes with a focus on nondiscrimination, this might have produced a quite different classifier.
ML in the field of medicine is poised for explosive growth, although critics raise similar concerns about training data. Medical privacy risks arise for patients whose health care records may be used in formulating a training data set (Vayena, Blasimme, & Cohen, 2018). Furthermore, there is evidence of biases in health care applications of ML, and in some instances, the consequences of biases may directly impact patients’ survival. One study in the United States labeled patients’ medical records with their severity of illness, using a proxy variable that ostensibly required little human judgment: the cost of the patient’s health care. Yet when this data was used to train a classifier, it caused significant bias against African American patients, who historically have had differential access to medical care (Obermeyer et al., 2019). The medical field itself is encountering new questions surrounding human labeling and annotation. For example, one widely used application is the interpretation of medical imaging. The human who labels MRI images as cancerous or not-cancerous must have specific expertise compared someone who labels product reviews as positive or negative. Meta-research in radiology has found practicing radiologists have about a 3–5% error rate (Brady, 2016), which raises the question of whether radiology training data sets should be independently labeled by multiple experts to ensure data quality. Finally, as with many fields, the introduction of ML using pre-existing data from a particular environment and setting has the distinct potential to reproduce and perpetuate existing systemic biases, especially when that classifier is deployed to a different environment and setting (DeCamp & Lindvall, 2020).
2.2. “Garbage in, Garbage out” Version 1
This project is heavily based on a prior study (Geiger et al., 2020), which similarly had a team of labelers examine issues around training data in a random sample of published papers. That study examined a narrow subset of peer-reviewed and preprint papers in a specific field: applied ML papers trained on Twitter data. They looked for 13 pieces of information in each paper, which they argued were important to understanding the validity of the training data labeling process. This included if the data was human or machine labeled, who the labelers were, how many labelers rated each item, and rates of interrater reliability (if multiple labelers rated each item). The study found a wide divergence both in the level of information reported and in adherence to best practices in human labeling. For example, of papers reporting a new human-labeled training data set, about 75% gave some information about who the labelers were, 55% specified the number of labelers, 11% released the training data set itself, and 0% reported how much crowdworkers were paid for their work.
We expanded on Geiger et al.’s study, drawing heavily from their published questions and protocols. We followed the same general process of having labelers rate each item independently, then reconciled disagreements through a discussion led by the team leader. We made some small modifications and extensions to the questions, which were recommended by the original authors for future work or were better suited to the expanded scope. We added questions about the field/domain of the paper and about the reconciliation process when multiple labelers labeled each item. We also rewrote some of the labeling instructions, label categories, and provided examples, often to clarify ambiguities.
2.3. Best Practices in Human Labeling of Training Data
Geiger et al. (2020) give a substantial review of existing work around human labeling of training data, including an extensive discussion of best practices in this work. They argue that much of the labeling work for ML is a form of structured content analysis, which is a methodology long used in the humanities and social sciences to turn qualitative or unstructured data into categorical or quantitative data. This involves teams of “coders” (also called “annotators,” “labelers,” or “reviewers”) who “code,” “annotate,” or “label” items individually. (Note that we use “label/labeler” in this paper, although we began with using “annotate/annotator,” which is still present in some of our data and protocols.) One textbook describes content analysis as a “systematic and replicable” (Riff et al., 2013, p. 19) method with established best practices, as Geiger et al. summarize:
A “coding scheme” is defined, which is a set of labels, annotations, or codes that items in the corpus may have. Schemes include formal definitions or procedures, and often include examples, particularly for borderline cases. Next, coders are trained with the coding scheme, which typically involves interactive feedback. Training sometimes results in changes to the coding scheme, in which the first round becomes a pilot test. Then, labelers independently review at least a portion of the same items throughout the entire process, with a calculation of “inter-rater reliability” (IRR) or “inter-annotator agreement” (IAA). Finally, there is a process of “reconciliation” for disagreements, which is sometimes by majority vote without discussion and other times discussion-based. (Geiger et al., 2020, p. 2–3)
Structured content analysis is a difficult task, requiring both domain-specific expertise about the phenomenon to be labeled and domain-independent expertise to manage teams of labelers. Historically, undergraduate students have often performed such work for academic researchers. With the rise of crowdwork platforms such as Amazon Mechanical Turk, crowdworkers are often used for data labeling tasks. New software platforms have been developed to support more microlevel labeling and annotation or labeling at scale, including in citizen science (Bontcheva, Cunningham et al., 2013; Chang, Amershi, & Kamar, 2017; Nakayama, Kubo et al., 2018; Pérez-Pérez, Glez-Peña et al., 2015). For example, the Zooniverse (Simpson, Page, & De Roure, 2014) provides a common platform for citizen science projects across domains, where volunteers label data under scientists’ direction.
2.4. Meta-Research and Methods Papers in Linguistics and NLP
We also draw inspiration from metaresearch and standardization efforts in Linguistics and Natural Language Processing (NLP) (Bender & Friedman, 2018; McDonald, Schoenebeck, & Forte, 2019). These fields have developed extensive literatures on standardization and reliability of linguistic labels, including best practices for corpus annotation (e.g., Doddington, Mitchell et al., 2004; Hovy & Lavid, 2010; Linguistic Data Consortium, 2008). In Geiger et al.’s (2020) study, the publisher with the highest information score was the Association for Computational Linguistics. There has been much work in linguistics and NLP around these issues, such as Sap et al.’s study of racial bias among labelers (Sap, Card et al., 2019). Blodgett et al. conducted a content analysis of how 146 NLP researchers discuss “bias” and found that while this has become a prominent topic in NLP, papers’ discussions of motivations and methods around bias “are often vague, inconsistent, and lacking in normative reasoning” (Blodgett, Barocas et al., 2020, p. 5,454). There is also related work in methods papers focused on identifying or preventing “low-effort” responses from crowdworkers (Mozetič, Grčar, & Smailović, 2016; Raykar & Yu, 2012; Soberón, Aroyo et al., 2013), which raise issues around fair labor practices and compensation (Silberman, Tomlinson et al., 2018).
2.5. The Open Science, Reproducibility, and Research Integrity Movements
Two related movements in computationally supported knowledge production have surfaced issues around documentation. First, open science is focused on broader availability to the products of research and research infrastructure, including open access to publications, software tools, data sets, and analysis code (Fecher & Friesike, 2014). The related reproducibility movement calls for researchers to make protocols, data sets, and analysis code public, often focusing on what others need to replicate the original study (Kitzes, Turek, & Deniz, 2018; Wilson, Bryan et al., 2017). Such requirements have long been voluntary, with few incentives to be a first mover, but funding agencies and publications are increasingly establishing such requirements (Gil, David et al., 2016; Goodman, Pepe et al., 2014).
One notable effort is around formally specifying what each author of a paper actually did, which has long been standard in medical journals (Rennie, Flanagin, & Yank, 2000). Author role documentation has gained popularity with the more recent Contributor Roles Taxonomy Project (or CRediT) (Brand, Allen et al., 2015). CRediT declarations are increasingly required by journals, which has led to novel quantitative science studies research (Larivière, Pontille, & Sugimoto, 2020). We also draw inspiration from work about capturing information in ML data fows and supply chains (Gharibi, Walunj et al., 2019; Schelter, Böse et al., 2017; Singh, Cobbe, & Norval, 2019) and developing tools to support data cleaning (Krishnan, Franklin et al., 2016; Schelter, Lange et al., 2018). We note that this work has long been part of library and information science, particularly in Research Data Management (Borgman, 2012; Medeiros & Ball, 2017; Sallans & Donnelly, 2012; Schreier, Wilson, & Resnik, 2006). There is much more work to be done on quantitatively studying issues around research integrity (Silberman et al., 2018; Zuckerman, 2020), which institutionally has often been limited to more egregious and blatant cases of plagiarism and fabrication.
2.6. Fairness, Accountability, and Transparency in ML
Within the field of ML, there is a growing movement in the Fairness, Accountability, and Transparency (or FAccT) subfield, with many recent papers proposing training data documentation in the context of ML. Various approaches and metaphors have been taken in this area, including “datasheets for datasets” (Gebru, Morgenstern et al., 2018), ”model cards” (Mitchell, Wu et al., 2019), “data statements” (Bender & Friedman, 2018), “nutrition labels” (Holland, Hosny et al., 2018), a “bill of materials” (Barclay, Preece et al., 2019), “data labels” (Beretta, Vetrò et al., 2018), and “supplier declarations of conformity” (Hind, Mehta et al., 2018). Many go far beyond the concerns we have raised around human-labeled training data, as some are also (or primarily) concerned with documenting other forms of training data, model performance and accuracy, bias, considerations of ethics and potential impacts, and more. Our work is strongly aligned with this movement, as we seek to include data labeling within these areas of concern. However, as we discuss in our conclusion, a single one-size-fits-all standard may be necessary but not sufficient to address concerns of fairness and bias.
We also call attention to those developing methods for “de-biasing” ML, which is a fast-moving and contentious research area (for surveys and comparative work, see Mehrabi, Morstatter et al., 2019 and Friedler, Scheidegger et al., 2019). Much of this work is in developing domain-independent fairness metrics for evaluating trained models (e.g., Hardt, Price, & Srebro, 2016; Zafar, Valera et al., 2017), which are used to modify trained models or predictions (e.g., Amini, Soleimany et al., 2019; Karimi Mahabadi, Belinkov, & Henderson, 2020). However, other work has approached these issues more as a problem of data set preprocessing (Calmon, Wei et al., 2017) or database repair (Salimi, Howe, & Suciu, 2020). Critics note that domain-independent approaches may fall into what Selbst, Boyd et al. (2019, p. 60) identify as “abstraction traps,” such as failing to account for the particularities of different kinds and qualities of discrimination in a given social context—a critique Hanna, Denton et al. (2020) make of fairness research that treats race as a single fixed attribute. We did not ask any questions about how papers discuss de-biasing or data cleaning due to the large number of questions we were already asking and the novelty of such approaches, but these concerns are deeply related.
3. DATA AND METHODS
3.1. Data: ML Papers Performing Classification Tasks
Our goal was to find a corpus of papers using supervised ML across disciplines and application domains, including papers producing an original labeled data set using human labeling. We used the Scopus bibliographic database (Baas, Schotten et al., 2020), which contains about 40,000 publications that a review board has verified for various qualities, including being peer reviewed, regularly published for at least 2 years, and governed by a named editorial board of experts. We searched for journal articles and conference proceedings from 2013 to 2018 where the title, abstract, or keywords included “machine learning” and either “classif*” or “supervi*” (case-insensitive). We ran three stratified samples across Scopus’s Subject Area classifications1: Physical Sciences (which includes engineering and earth/ecological sciences); Social Sciences & Humanities (a single category); and Life Sciences & Health Sciences (two categories, which we combined). Table 1 describes our sampling. More details about the corpora are in the appendix, which is available as supplementary materials and in our data repository (see Section 3.4).
Summary of sampling across all three corpora
Corpus . | Papers in corpus . | # randomly sampled . | % sampled . |
---|---|---|---|
Social Sciences & Humanities | 5,346 | 70 | 1.30 |
Life & Biomedical Sciences | 9,507 | 60 | 0.63 |
Physical & Environmental Sciences | 11,030 | 70 | 0.63 |
Total | 25,883 | 200 | 0.77 |
Corpus . | Papers in corpus . | # randomly sampled . | % sampled . |
---|---|---|---|
Social Sciences & Humanities | 5,346 | 70 | 1.30 |
Life & Biomedical Sciences | 9,507 | 60 | 0.63 |
Physical & Environmental Sciences | 11,030 | 70 | 0.63 |
Total | 25,883 | 200 | 0.77 |
3.2. Labeling Team, Training, and Workflow
Our labeling team included one research scientist who led the project (RSG) and undergraduate research assistants, who worked 6–10 hours per week for course credit as part of a university-sponsored research experience program (DC, JI, ML, AS, JW, and RT). The project began with six students for one semester, five of whom continued on the project for the second semester. All students had some coursework in computer science and/or data science, with a range of prior experience in ML in both a classroom and applied setting. Students’ majors and minors included Electrical Engineering & Computer Science, Data Science, Statistics, Economics, Linguistics, and Biology. For the first four weeks, the team leader trained the students in both broader ML concepts and the specific questions to be answered for this project. The team first labeled and discussed a practice set of 40 papers sampled from across the three corpora, which were not included in the final data set. In these initial weeks, the team learned the coding schema and the reconciliation process, which were further refined.
Following this training, the labeling workfow was that each week, a set of papers were randomly sampled from one corpus, typically between 10–15 papers. The students independently reviewed and labeled the same papers, using different web-based spreadsheets to record labels. The team leader synthesized labels and identified disagreement. The team met in person or by videochat to discuss the week’s cases of disagreement. The team leader explained various issues in question and built a consensus about the proper label (as opposed to purely majority vote). The team leader had the final say when a consensus could not be reached.
All 200 papers were labeled by at least four labelers; one labeled 137 items and another labeled 100 items. Following the first round of labeling and reconciliation, we conducted a second round of verification. Where there was any initial disagreement on labels in the first round, each paper was re-examined and discussed by at least two labelers and the team leader. The second round began multiple months after the first round, meaning that there was at least one month between when each paper was examined and re-examined. If there was still disagreement, the final decision was made by the team leader. The team leader did a final check to review every label for all 200 papers.
3.3. IRR and Labeled Data Quality
All human labeling projects that involve multiple labelers should evaluate the intersubjective reliability of the labeling process (Tinsley & Weiss, 1975). We present IRR metrics using three metrics. For all metrics, we recoded “unsure” and blank responses to both be blank (NaN), but treated “N/A” answers as a distinct judgment. First, we calculated mean total agreement, or the proportion of items where all labelers initially gave the same label before reconciliation, but not counting blank and unsure responses. As Table 2 shows, this is a more stringent metric: All nonblank/unsure responses must be the same for an item to have a 1 score, otherwise the score is 0. Second, we present the mean percentage correct rate, which is the proportion of labelers who initially gave the same label arrived after discussion and reconciliation, but also not counting blank and unsure responses. As Table 2 shows, this is a more forgiving metric: If five out of six labelers give the same final correct label, the score is 0.83 for that item. For these two metrics, we calculated per-question scores by taking the mean of all scores for an item.
Example of IRR calculations for sample rows
# . | Labeler #1 . | Labeler #2 . | Labeler #3 . | Labeler #4 . | Labeler #5 . | Labeler #6 . | Final/correct label . | Total agreement . | Mean percentage correct . |
---|---|---|---|---|---|---|---|---|---|
1 | yes | unsure | yes | yes | yes | [blank] | yes | 1 | 1 |
2 | yes | yes | yes | yes | no | yes | yes | 0 | 0.83 |
3 | no | n/a | yes | no | yes | [blank] | yes | 0 | 0.4 |
4 | yes | no | yes | [blank] | yes | unsure | no | 0 | 0.25 |
# . | Labeler #1 . | Labeler #2 . | Labeler #3 . | Labeler #4 . | Labeler #5 . | Labeler #6 . | Final/correct label . | Total agreement . | Mean percentage correct . |
---|---|---|---|---|---|---|---|---|---|
1 | yes | unsure | yes | yes | yes | [blank] | yes | 1 | 1 |
2 | yes | yes | yes | yes | no | yes | yes | 0 | 0.83 |
3 | no | n/a | yes | no | yes | [blank] | yes | 0 | 0.4 |
4 | yes | no | yes | [blank] | yes | unsure | no | 0 | 0.25 |
We also present the widely used Krippendorff’s alpha (Krippendorff, 1970) metric, although we strongly advise against relying on it. Our data does not meet the statistical assumptions for both Fleiss’s kappa and Krippendorf’s alpha, which are popular because they support missing labels for 3+ labelers and take into account the possibilities that raters made decisions based on random chance. However, this requires assuming a uniform prior possibility of such a random distribution, which generally only applies if each possible response by raters is equally likely. Rates can be dramatically lower when there is a highly skewed distribution of response categories (Quarfoot & Levine, 2016; Oleinik, Popova et al., 2014). Our data set has highly skewed distributions, especially for many of the more specialized questions, which lead to minuscule scores for some questions with especially skewed distributions (e.g., prescreening for crowdwork; reported IRR).
Table 3 presents both our custom metrics and Krippendorff’s alpha for all questions. Mean total agreement rates ranged from 37.5% to 66%, with an average of 48.0% across all questions. Mean percentage correct rates ranged from 65.4% to 85.8%, with an average of 73.1% across all questions. Some questions that had lower rates (especially for mean total agreement) were due to a labeler making an incorrect assessment on an earlier question, which determines whether they answer subsequent questions or mark them as “N/A.”
Interrater reliability metrics per question
Question . | Mean total agreement . | Mean percentage correct . | Krippendorff’s alpha . |
---|---|---|---|
Original classification task | 66.0% | 84.8% | 0.670 |
Classifier area/domain | 34.5% | 65.4% | 0.520 |
Labels from human judgment | 37.5% | 68.2% | 0.517 |
Human labeling for training data | 46.5% | 77.3% | 0.517 |
Used original human labeling | 43.5% | 71.0% | 0.498 |
Original human labeling source | 43.5% | 71.1% | 0.330 |
Prescreening for crowdwork | 58.5% | 84.2% | 0.097 |
Labeler compensation | 46.0% | 68.0% | 0.343 |
Training for human labelers | 48.0% | 70.0% | 0.364 |
Formal instructions | 47.5% | 66.8% | 0.337 |
Multiple labeler overlap | 48.5% | 69.3% | 0.370 |
Synthesis of labeler overlap | 53.0% | 83.4% | 0.146 |
Reported interrater reliability | 55.5% | 85.8% | 0.121 |
Total number of human labelers | 50.5% | 69.3% | 0.281 |
Median number of labelers per item | 48.5% | 69.3% | 0.261 |
Link to data set available | 41.0% | 66.1% | 0.322 |
Average across all questions | 48.0% | 73.1% | 0.356 |
Median across all questions | 48.0% | 70.0% | 0.343 |
Question . | Mean total agreement . | Mean percentage correct . | Krippendorff’s alpha . |
---|---|---|---|
Original classification task | 66.0% | 84.8% | 0.670 |
Classifier area/domain | 34.5% | 65.4% | 0.520 |
Labels from human judgment | 37.5% | 68.2% | 0.517 |
Human labeling for training data | 46.5% | 77.3% | 0.517 |
Used original human labeling | 43.5% | 71.0% | 0.498 |
Original human labeling source | 43.5% | 71.1% | 0.330 |
Prescreening for crowdwork | 58.5% | 84.2% | 0.097 |
Labeler compensation | 46.0% | 68.0% | 0.343 |
Training for human labelers | 48.0% | 70.0% | 0.364 |
Formal instructions | 47.5% | 66.8% | 0.337 |
Multiple labeler overlap | 48.5% | 69.3% | 0.370 |
Synthesis of labeler overlap | 53.0% | 83.4% | 0.146 |
Reported interrater reliability | 55.5% | 85.8% | 0.121 |
Total number of human labelers | 50.5% | 69.3% | 0.281 |
Median number of labelers per item | 48.5% | 69.3% | 0.261 |
Link to data set available | 41.0% | 66.1% | 0.322 |
Average across all questions | 48.0% | 73.1% | 0.356 |
Median across all questions | 48.0% | 70.0% | 0.343 |
In interpreting these metrics, we note that the standard approach of human labeling checked by IRR metrics treats individual humans as scientific instruments that turn complex phenomena into discrete structured data. If there is a high degree of IRR, then reconciliation can easily take place through a majority vote process involving no discussion, or if rates are quite high, then many researchers assume they can use just one of those human labelers per item in future work. These rates were not high enough for us to have confidence that we could have a purely quantitative/majority-vote reconciliation process, much less a process of only using one labeler per item. However, these rates are sufficient to show there is enough agreement to proceed to a discussion-based reconciliation process and a final check of all items by the team leader. As McDonald et al. (2019) discuss, standardized IRR metrics such as Krippendorf’s alpha are useful in highly structured labeling projects that do not have a discussion-based reconciliation process, as they only evaluate the agreement of independent initial labels. Such metrics would be more essential to the validity of our study if we were conducting a quantitative, majority-rule reconciliation process or if only a subset of items were reviewed by multiple labelers. We included mean percentage correct rates to partially include the reconciliation and verification process.
Furthermore, our approach was largely focused on identifying the presence or absence of various kinds of information within long-form publications. This is a different kind of human judgment than is typically involved in common tasks using human labeling for ML (e.g., labeling a single social media post for positive/negative sentiment) or traditional social science and humanities content analysis (e.g., categorizing newspaper articles by topic). Our items were full research publications with many pages of detail, which followed many different field-specific conventions and genres. Our labelers were looking for up to 15 different kinds of information per paper, each of which could be found anywhere in the paper. We reflected that in our reconciliation process, most of the time when labelers disagreed, it was because some had caught a piece of information in the paper that others had not seen. Once that information was brought to the group, it was most often the case that some labelers said that they had missed that information and changed their response. It was less common for our team to have disagreements arising from two labelers differently interpreting the same text, especially after the first few weeks. For such reasons, we are relatively confident that if, after our process, no individual member of our team has identified the presence of such information, then it is quite likely not present in the paper.
3.4. Software, Data Sets, and Research Materials
We used Google Sheets to enter labels. For computational analysis and scripting for corpus collection, data management, and data analysis, we used Python 3.7 (van Rossum, 1995), using the following libraries: Pandas dataframes (McKinney, 2010) for data parsing and transformation; SciPy (Jones, Oliphant et al., 2001) and NumPy (van der Walt, Colbert, & Varoquaux, 2011) for quantitative computations; Matplotlib (Hunter, 2007) and Seaborn (Waskom, Botvinnik et al., 2018) for visualization; and SimpleDorff (Perry, 2020) for IRR calculations. Analysis was conducted in Jupyter Notebooks (Kluyver, Ragan-Kelley et al., 2016) using the IPython (Pérez & Granger, 2007) kernel.
Data sets, analysis scripts, labeling instructions, and other supplementary information can be downloaded from GitHub2 and Zenodo3. Data sets include all labels from all labelers for the first round of independent labeling and the consolidated set of final labels and scores for all items. Paper URLs/DOIs have been anonymized with a unique salted hash. Analysis scripts are in Jupyter Notebooks and can be explored and modified in any modern web browser using the cloud-based MyBinder.org (Project Jupyter, Bussonnier et al., 2018)4.
4. FINDINGS
Figure 1 shows a summary of results. For this figure, we recoded (or consolidated) some questions with many answers to reflect whether the paper reported an answer to that question. For example, for “original human labeling source,” any answer that specified a source is “yes,” while ”no information” is “no.” This is also how we calculated paper information scores in Section 5. Figure 1 illustrates how we asked more detailed questions for papers based on answers to prior questions. For example, 103 papers used labels from human judgment—either ”yes” or ”unsure (assumed yes)”—and the next three questions were answered for those 103 papers. The remaining 10 questions were answered for the 45 papers that could be assumed to use original human labeling, with two of those questions only answered for the six papers involving multiple labeler overlap.
Summary of results. Note that some questions have been recoded to show the presence or absence of information.
Summary of results. Note that some questions have been recoded to show the presence or absence of information.
4.1. Original ML Classification Task
The first question was whether the paper was conducting an original classification task using supervised ML. Our keyword-based process of generating the corpus included some papers that used ML keywords but were not actually presenting a new ML classifier. However, defining the boundaries of supervised ML and classification tasks is difficult, particularly for papers that are long, complex, and ambiguously worded. We defined ML broadly: any automated process that does not exclusively rely on explicit rules, in which the performance of a task increases with additional data (Mitchell, 1997, p. 2). We decided that this can include simple linear regressions, although there is much debate about if and when simple linear regressions are a form of ML. However, as we were also looking for classification tasks, linear regressions were only included if they were used to make a prediction in a set of defined classes. We defined an “original” classifier to mean a classifier that the authors made based on new or old data, which excludes the exclusive use of pretrained classifiers or models. We found that some papers claimed to be using ML, but when we examined the details, these did not fall under our definition.
As Table 4 shows, the majority of papers in our data set were involved in an original classification task. We placed 10 papers in the “N/A” category—meaning they did not give enough detail for us to determine, were not in English, were not able to be accessed, or were complex boundary cases.
Is the paper presenting an original/newly created ML classifier?
. | Count . | Proportion . |
---|---|---|
Yes | 141 | 70.50% |
No | 49 | 24.50% |
N/A (paper ineligible or inaccessible) | 10 | 5.00% |
Total | 200 | 100.00% |
. | Count . | Proportion . |
---|---|---|
Yes | 141 | 70.50% |
No | 49 | 24.50% |
N/A (paper ineligible or inaccessible) | 10 | 5.00% |
Total | 200 | 100.00% |
4.2. Classifier Area/Domain
The next question categorized the paper into one of eight fields/areas of study. We had sampled three broad disciplinary categories (Social Sciences & Humanities, Biomedical & Life Sciences, and Physical & Environmental Sciences), which are determined by Scopus on a per-journal/conference level. We made these area/domain determinations based on the paper’s content, without consulting the Scopus-provided category. As Table 5 shows, our data set contained a wide variety of ML application fields. Medical papers had the plurality of responses, followed by Linguistic; then papers from Biological, Physical, Soft/hardware, and Geo/ecological had similar sizes.
Classifier area/domain
. | Count . | Proportion . |
---|---|---|
Medical | 43 | 30.50% |
Linguistic | 24 | 17.02% |
Biological (nonmedical) | 17 | 12.06% |
Physical | 14 | 9.93% |
Soft/hardware | 14 | 9.93% |
Geo/ecological | 13 | 9.22% |
Activities and actions | 7 | 4.96% |
Demographic | 5 | 3.55% |
Other | 4 | 2.84% |
Total of applicable papers (presenting original ML classifier) | 141 | 100.00% |
Nonapplicable papers | 59 | – |
. | Count . | Proportion . |
---|---|---|
Medical | 43 | 30.50% |
Linguistic | 24 | 17.02% |
Biological (nonmedical) | 17 | 12.06% |
Physical | 14 | 9.93% |
Soft/hardware | 14 | 9.93% |
Geo/ecological | 13 | 9.22% |
Activities and actions | 7 | 4.96% |
Demographic | 5 | 3.55% |
Other | 4 | 2.84% |
Total of applicable papers (presenting original ML classifier) | 141 | 100.00% |
Nonapplicable papers | 59 | – |
4.3. Labels from Human Judgment
While all approaches to curating training data involve some kind of human judgment, this question focused on cases where humans made discrete judgments about a set of specific items, which were then turned into labels for training data. More than a quarter of the papers in our corpora used some form of automation, scripting, or quantitative thresholds to label items. For example, one boundary case used medical records to label patients with or without high blood pressure (hypertension). We decided that if a medical practitioner made a diagnosis that researchers used as the label, it was human labeled. If the researchers set a quantitative threshold for high blood pressure, then parsed medical records for blood pressure readings with a script, it was not human labeled. In addition, individual human labeling could be done for all of the paper’s training data (the typical case) or only a portion. For example, some authors reported using scripts or thresholds to label some items (e.g., the “easy” cases) then labeled the remaining items manually.
In some instances, we determined that answer could be an “implicit yes” if ample evidence indicated a particular labeling method that most likely used humans at some point, but it was not explicitly stated by the authors. For example, many medical papers reported using diagnoses from a patient’s medical records as labels. Some of these papers gave substantial detail about who originally made the diagnosis and even what diagnostic criteria were used, while others generated labels based on medical records and did not explicitly state that a human (e.g., a medical practitioner) made the diagnosis. If we could reasonably assume a human was involved in the original diagnosis, we generally labeled the second type of papers as “no information (implicit yes).” One paper was far less clear about the source of the data than other “implicit yes” papers, such that we labeled it “Unsure.” However, we included the paper in subsequent questions because we felt we could answer subsequent questions about it, which reused externally obtained data for labeling.
As Table 6 shows, the second highest response are papers that do not clearly state whether their labeling was performed by a human or a machine, but contained enough contextual details for us to be reasonably confident in assuming that human labeling was used. Note that this question was originally titled “Labels from human annotation” throughout the labeling and reconciliation process, but was renamed in the analysis stage to better reflect the instructions.
Were labels derived from humans making discrete judgments of items?
. | Count . | Proportion . |
---|---|---|
No/Machine-labeled | 38 | 26.95% |
Yes for all items | 53 | 37.59% |
Yes for some items | 10 | 7.09% |
No information (implicit yes) | 39 | 27.66% |
Unsure (but assumed yes) | 1 | 0.71% |
Subtotal: papers assumed to use human labeled-data | 103 | 73.05% |
Total of applicable papers (presenting original ML classifier) | 141 | 100.00% |
Nonapplicable papers | 59 | – |
. | Count . | Proportion . |
---|---|---|
No/Machine-labeled | 38 | 26.95% |
Yes for all items | 53 | 37.59% |
Yes for some items | 10 | 7.09% |
No information (implicit yes) | 39 | 27.66% |
Unsure (but assumed yes) | 1 | 0.71% |
Subtotal: papers assumed to use human labeled-data | 103 | 73.05% |
Total of applicable papers (presenting original ML classifier) | 141 | 100.00% |
Nonapplicable papers | 59 | – |
4.4. Human Labeling for Training Versus Evaluation
This question and all subsequent questions were only applicable to papers that involved human labeling, which had “yes” or “implicit” designations to the previous question. This allowed for further specification of human labeled data usage within each publication. As Table 7 shows, human labeling for training data is the typical case, where labels are created and then used to train the classifier. Often part of this data is held out as a test set to evaluate the classifier. Human labeling for evaluation only is when the authors of the paper train the classifier using nonhuman-labeled data, but use humans to evaluate the validity of either that data set or the classifier. The overwhelming majority of papers took the more standard approach of using labels as training data, but a few did have human evaluation of classifiers trained with machine-labeled data. This question had lower rates of “unsure,” where the paper did not give enough information to make a determination.
Was human-labeled data used for training data or to evaluate a classifier trained on nonhuman-labeled training data?
. | Count . | Proportion . |
---|---|---|
Human labeling for training data | 94 | 91.26% |
Human labeling for evaluation only | 6 | 5.83% |
Unsure | 3 | 2.91% |
Total of applicable papers (assumed to use human-labeled training data) | 103 | 100.00% |
Nonapplicable papers | 97 | – |
. | Count . | Proportion . |
---|---|---|
Human labeling for training data | 94 | 91.26% |
Human labeling for evaluation only | 6 | 5.83% |
Unsure | 3 | 2.91% |
Total of applicable papers (assumed to use human-labeled training data) | 103 | 100.00% |
Nonapplicable papers | 97 | – |
4.5. Original and/or External Human-Labeled Data
Our next question was about whether papers that used human labeling used original human labeling, which we defined as a process in which the paper’s authors obtained new labels from human judgments for items. This is in contrast to externally obtained data, which involves reusing existing private or public data sets of human judgments. Table 8 shows that most of the papers in our corpus that used labels from human judgment were reusing externally labeled data. Our assumption behind this question is that papers that rely on existing data sets may have less of a burden to discuss the details around the labeling process in the paper itself, as readers could review the cited paper for such details. In some cases, external and original human labeling were combined, such as if authors reused a existing labeled data set and then further labeled it for additional information.
Did authors reuse an existing human-labeled data set (external), create a new human-labeled data set (original), or both?
. | Count . | Proportion . |
---|---|---|
Only external | 58 | 56.31% |
Only original | 29 | 28.16% |
Original and external | 9 | 8.74% |
Unsure/no information (but can assume original) | 7 | 6.80% |
Subtotal: assumed to include some original human labeling | 45 | 43.69% |
Total of applicable papers (assumed to use human-labeled training data) | 103 | 100.00% |
Nonapplicable papers | 97 | – |
. | Count . | Proportion . |
---|---|---|
Only external | 58 | 56.31% |
Only original | 29 | 28.16% |
Original and external | 9 | 8.74% |
Unsure/no information (but can assume original) | 7 | 6.80% |
Subtotal: assumed to include some original human labeling | 45 | 43.69% |
Total of applicable papers (assumed to use human-labeled training data) | 103 | 100.00% |
Nonapplicable papers | 97 | – |
Like the prior question, this question had lower rates of “unsure/no information” where the paper did not give enough information to make a determination. We note that for all of the papers we labeled as ”unsure/no information” we had enough contextual or implicit information to assume that it was not a reused/externally labeled data set. This means that the total number of papers we assume to include at least some original human labeling is 45.
4.6. Summary of ML Papers’ Approaches to Training Data
We synthesized responses to the prior questions to summarize the general breakdown of applied ML publications’ approach to their data. Out of the 141 papers in our sample that presented an original ML classifier, 27% used machine-labeled data (either by the authors or from a reused data set), 41% used an existing human-labeled data set, 27% produced a novel human labeled data set, and 5% did not provide enough information for us to answer. Table 9 and Figure 2 present these results by corpus, which show few differences at this level.
Approach to training data by corpus: count (proportion). Totals may not equal 100% due to rounding.
. | Life Sciences & Biomedical . | Physical & Environmental Sciences . | Social Sciences & Humanities . | All corpora . |
---|---|---|---|---|
Original human-labeled data | 12 (26.7%) | 13 (25.0%) | 13 (29.5%) | 38 (26.95%) |
External human-labeled data | 20 (44.4%) | 20 (38.5%) | 18 (40.9%) | 58 (41.1%) |
Machine-labeled data | 12 (26.7%) | 15 (28.8%) | 11 (25.0%) | 38 (26.95%) |
Unsure | 1 (2.2%) | 4 (7.7%) | 2 (4.5%) | 7 (5.0%) |
Subtotal: ML classifier papers | 45 (100%) | 52 (100%) | 34 (100%) | 141 (100%) |
(No ML classifier/NA) | 15 | 18 | 26 | 59 |
Grand total | 60 | 70 | 60 | 200 |
. | Life Sciences & Biomedical . | Physical & Environmental Sciences . | Social Sciences & Humanities . | All corpora . |
---|---|---|---|---|
Original human-labeled data | 12 (26.7%) | 13 (25.0%) | 13 (29.5%) | 38 (26.95%) |
External human-labeled data | 20 (44.4%) | 20 (38.5%) | 18 (40.9%) | 58 (41.1%) |
Machine-labeled data | 12 (26.7%) | 15 (28.8%) | 11 (25.0%) | 38 (26.95%) |
Unsure | 1 (2.2%) | 4 (7.7%) | 2 (4.5%) | 7 (5.0%) |
Subtotal: ML classifier papers | 45 (100%) | 52 (100%) | 34 (100%) | 141 (100%) |
(No ML classifier/NA) | 15 | 18 | 26 | 59 |
Grand total | 60 | 70 | 60 | 200 |
4.7. Original Human Labeling Source
Our next question asked who the labelers were for the 45 papers that used original human labeling. As Table 10 shows, we found a diversity of approaches to the recruitment of human labelers. The plurality of papers gave no information about who performed their labeling task. The “survey/self-reported” category refers to papers that have individuals label data they generated, which included surveys as well as studies such as those using motion tracking, where subjects recorded performing different physical gestures. In contrast to Geiger et al.’s prior findings about papers that used Twitter data, none of the papers in our data set reported using crowdworking platforms. We did not consider volunteer citizen science crowdsourcing platforms to be crowdworking.
Who were the humans doing the labeling work?
. | Count . | Proportion . |
---|---|---|
Paper’s authors | 10 | 22.22% |
No information | 19 | 42.22% |
Other with claim of expertise | 9 | 20.00% |
Other no claim of expertise | 2 | 4.44% |
Survey/self-reported | 5 | 11.11% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
. | Count . | Proportion . |
---|---|---|
Paper’s authors | 10 | 22.22% |
No information | 19 | 42.22% |
Other with claim of expertise | 9 | 20.00% |
Other no claim of expertise | 2 | 4.44% |
Survey/self-reported | 5 | 11.11% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
4.8. Labeler Compensation
The next question inquired as to if and what type of compensation was offered to labelers for their work. Our labels for compensation included money or gift cards, class credit, paper authorship, other compensation, explicitly stating no compensation was given (or volunteers), and no information. As Table 11 shows, we observed that most publications did not provide this information, and therefore the label of “no information” was given to the majority of papers for this question.
How were labelers compensated, if at all?
. | Count . | Proportion . |
---|---|---|
Paper authorship | 10 | 22.22% |
Volunteer/explicit no compensation | 4 | 8.89% |
Other compensation specified | 0 | 0.00% |
No information | 31 | 68.89% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
. | Count . | Proportion . |
---|---|---|
Paper authorship | 10 | 22.22% |
Volunteer/explicit no compensation | 4 | 8.89% |
Other compensation specified | 0 | 0.00% |
No information | 31 | 68.89% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
4.9. Training for Human Labelers and Formal Instructions
The next two questions (see Tables 12 and 13) focused on how labelers were prepared for their work. We defined training as practicing the labeling task with interactive feedback (e.g., being told what they got right or wrong, or being able to ask questions) prior to starting the main labeling work for the study. Formal instructions are documents or videos containing guidelines, definitions, and examples that the labelers could reference as an aid. In two cases, the paper gave enough detail for us to know that no definitions or instructions were given to labelers beyond the text of the question, but about half of papers did not give enough information to make a determination.
Were any details specified about how labelers were trained?
. | Count . | Proportion . |
---|---|---|
Some training details | 7 | 15.56% |
No information | 38 | 84.44% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
. | Count . | Proportion . |
---|---|---|
Some training details | 7 | 15.56% |
No information | 38 | 84.44% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
What kind of formal instructions and/or examples were given to labelers?
. | Count . | Proportion . |
---|---|---|
Instructions with formal definitions or examples | 21 | 46.67% |
No instructions beyond question text | 2 | 4.44% |
No information | 22 | 48.89% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
. | Count . | Proportion . |
---|---|---|
Instructions with formal definitions or examples | 21 | 46.67% |
No instructions beyond question text | 2 | 4.44% |
No information | 22 | 48.89% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
4.10. Multiple Labeler Overlap
Our next three questions were all about using multiple labelers to review the same items. Having multiple independent labelers is typically a foundational best practice in structured content analysis, so that the integrity of the labels and the schema can be evaluated (although see McDonald et al., 2019). For multiple labeler overlap, our definitions required that papers state whether all or some of the items were labeled by multiple labelers, otherwise “no information” was recorded. We can reasonably assume that papers that did not mention whether multiple labelers were used for each item did not engage in this more intensive process, although we cannot be certain. As Table 14 shows, very few papers mentioned using multiple labelers per item, with the overwhelming majority not giving any indication.
Multiple labeler overlap
. | Count . | Proportion . |
---|---|---|
No | 8 | 17.78% |
Yes for all items | 6 | 13.33% |
Yes for some items | 0 | 0.00% |
No information | 31 | 68.89% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
. | Count . | Proportion . |
---|---|---|
No | 8 | 17.78% |
Yes for all items | 6 | 13.33% |
Yes for some items | 0 | 0.00% |
No information | 31 | 68.89% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
4.11. Synthesis of Labeler Overlap and Reported IRR
The next two questions (see Tables 15 and 16) built on the previous question, which were only answered if the paper had been given the label of “yes for all items” or “yes for some items.” For these papers that had multiple labeler overlap, we examined the method by which labeler disagreement was reconciled and whether any IRR or IAA metric was reported. We did not record what kind of IRR/IAA metric was used, such as Cohen’s kappa or Krippendorff’s alpha, but many different metrics were used. We also did not record what the exact statistic was, although we did notice a wide variation in what was considered an acceptable score.
How were disagreements between labelers reconciled?
. | Count . | Proportion . |
---|---|---|
Qualitative/discussion | 3 | 50.00% |
Quantitative/no discussion | 2 | 33.33% |
No information | 1 | 16.67% |
Total of applicable papers (involving multiple overlap) | 6 | 100.00% |
Nonapplicable papers | 194 | – |
. | Count . | Proportion . |
---|---|---|
Qualitative/discussion | 3 | 50.00% |
Quantitative/no discussion | 2 | 33.33% |
No information | 1 | 16.67% |
Total of applicable papers (involving multiple overlap) | 6 | 100.00% |
Nonapplicable papers | 194 | – |
Did the paper report an interrater reliability metric?
. | Count . | Proportion . |
---|---|---|
Yes | 3 | 50.00% |
No | 3 | 50.00% |
Total of applicable papers (involving multiple overlap) | 6 | 100.00% |
Nonapplicable papers | 194 | – |
. | Count . | Proportion . |
---|---|---|
Yes | 3 | 50.00% |
No | 3 | 50.00% |
Total of applicable papers (involving multiple overlap) | 6 | 100.00% |
Nonapplicable papers | 194 | – |
4.12. Total and Median Number of Human Labelers
We then asked two final questions regarding how many individuals completed a paper’s labeling task. Because this information can be presented differently based on the labeling process, we divided this into two. The total number of human labelers referred to all human labelers involved in the project at any time (see Table 17). The median number of human labelers per item referred to how many labelers evaluated each item in a publication’s data set, which were greater than one in the case of papers that had multiple labelers per item (see Table 18). Eight papers specifed that there was only one labeler per item, which matches with the data in the first question about multiple labeler overlap. The majority of the papers did not provide enough information to answer the question.
Total number of labelers in the project
. | Count . | Proportion . |
---|---|---|
1 | 2 | 4.44% |
2 | 6 | 13.33% |
3 | 2 | 4.44% |
5 | 1 | 2.22% |
7 | 1 | 2.22% |
10 | 1 | 2.22% |
30 | 2 | 4.44% |
659 | 1 | 2.22% |
??? | 29 | 64.44% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
. | Count . | Proportion . |
---|---|---|
1 | 2 | 4.44% |
2 | 6 | 13.33% |
3 | 2 | 4.44% |
5 | 1 | 2.22% |
7 | 1 | 2.22% |
10 | 1 | 2.22% |
30 | 2 | 4.44% |
659 | 1 | 2.22% |
??? | 29 | 64.44% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
Median number of labelers per item
. | Count . | Proportion . |
---|---|---|
1 | 8 | 17.78% |
2 | 5 | 11.11% |
3 | 1 | 2.22% |
?? | 31 | 68.89% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
. | Count . | Proportion . |
---|---|---|
1 | 8 | 17.78% |
2 | 5 | 11.11% |
3 | 1 | 2.22% |
?? | 31 | 68.89% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
4.13. Link to Data Set Available
Our final question was about whether the paper contained a link to the data set containing the original human-labeled training data set. Note that this question was only answered for papers involving some kind of original or novel human labeling, and papers that were exclusively re-using an existing open or public data set were left blank to avoid double-counting. We did not follow such links or verify that such data was actually available. As Table 19 shows, the overwhelming majority of papers did not include such a link, with five papers (11.11%) using original human-labeled training data sets linking to such data. Given the time, labor, expertise, and funding in creating original human labeled data sets, authors may be hesitant to release such data until they feel they have published as many papers as they can, especially junior scholars. Data sharing also requires specific expertise in data formats, documentation, and platforms, which may not be equally distributed across academic disciplines.
Link to data set available
. | Count . | Proportion . |
---|---|---|
No | 40 | 88.89% |
Yes | 5 | 11.11% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
. | Count . | Proportion . |
---|---|---|
No | 40 | 88.89% |
Yes | 5 | 11.11% |
Total of applicable papers (involving original human labeling) | 45 | 100.00% |
Nonapplicable papers | 155 | – |
5. PAPER INFORMATION SCORES
After finalizing the labels, we quantifed the information that each paper provided about training data, based on how many questions we could answer for each paper. We developed a total and normalized information score, as different studies demanded different levels of information. For example, our questions about whether IRR metrics and reconciliation methods were reported are only applicable for papers involving multiple labelers per item. However, all other questions are relevant for any project involving original human labeling. As such, papers involving original human labeling without multiple labelers per item had a maximum of 11 points, while those with multiple labelers per item had a maximum of 13 points. The normalized score is the total score divided by the maximum score possible.
5.1. Overall Distributions of Information Scores
Figure 3 shows histograms for total and normalized information scores, which show that scores varied substantially. As Geiger et al. (2020) also found, this roughly suggests two overlapping distributions and thus populations of publications: one centered around total scores of 3–5 and normalized scores of 0.3 and another centered around total scores of 9 and normalized scores of 0.7. The normalized information score ranged from 0 to 1, with one paper having a normalized score of 0 and three papers with a full score of 1. The total information score ranged from 0 to 11, with no paper receiving a full score of 13, which would have required a study involving multiple labeler overlap that gave answers to all questions, including IRR metrics and reconciliation method. Overall, the mean total score was 5.4, with a median of 5 and a standard deviation of 3.2. The mean normalized information score was 0.472, with a median of 0.455 and a standard deviation of 0.268. This is quite similar to the findings by Geiger et al. (2020) for their normalized scores, which had a mean of 0.441, a median of 0.429, and a standard deviation of 0.261.
Histograms of total and normalized information scores for all papers involving original human labeling.
Histograms of total and normalized information scores for all papers involving original human labeling.
5.2. Information Scores by Corpus and Application Areas
We analyzed information scores by corpus for all papers using original human labeling. Figure 4 is a boxplot illustrating the distribution of normalized information scores by corpus5. There was a lower median score (red lines in boxplots) for Social Science & Humanities papers (0.364) than Life Science & Biomedical papers (0.455) and Physical & Environmental Science papers (0.455). However, when examining means between groups ( in boxplots), the Physical & Environmental Science papers had a lower mean (0.428) than Social Science & Humanities papers (0.482) and Life Science & Biomedical papers (0.519). We ran a one-way analysis of variance (ANOVA) of normalized information scores by corpus. No statistically significant difference was found (p = 0.65, F = 0.43). Because we run three statistical tests in this paper, we apply a Bonferroni correction to address the multiple comparisons problem (Dunn, 1961), moving our p-value target from 0.05 to 0.0166.
Boxplots of normalized information scores for papers using original human labeling, by corpus.
Boxplots of normalized information scores for papers using original human labeling, by corpus.
Next, we conducted a similar analysis by the classification area/domain. A boxplot showing the distribution of normalized information scores is shown in Figure 5. These were not stratified random samples, and we ended up with far more papers in some categories than others, with only 1 item for Physical and Other. The small sample size makes formal statistical tests difficult to interpret, and the assumption of homoscedasticity is not fulfilled due to the wide range in standard deviations between these groups (e.g., 0.13 for Geo/ecological to 0.39 for Demographic). We recommend against making generalizable statistical tests or generalizations based on this analysis, but we report these scores to inform future work. Most groups’ mean and median scores were between 0.4 and 0.6, with papers in the Linguistic category having lower medians (0.318). The most common categories—Linguistic, Medical, and Biological—also had much wider distributions and IQRs, but similar means. Activities & actions was the highest scoring category in terms of the mean, median, and upper and lower IQR. In these studies, it is generally the case that the data are recordings of a person performing an activity, and each label is the activity they are asked to perform. This research design may lead authors to more concretely detail such methods.
Boxplots of normalized information scores for all papers involving original human labeling by application area/domain. Physical and Other only had one paper. Activities & actions do not have whiskers because no items had scores from 1 to 1.5 * IQR, but did have two outliers outside the 1.5 * IQR range.
Boxplots of normalized information scores for all papers involving original human labeling by application area/domain. Physical and Other only had one paper. Activities & actions do not have whiskers because no items had scores from 1 to 1.5 * IQR, but did have two outliers outside the 1.5 * IQR range.
5.3. Normalized Information Scores by Document Type
For the 45 papers using original human labeling, 33 were journal articles and 12 were conference papers. We conducted an analysis of normalized information scores by document type, which showed larger differences. As Figure 6 shows, articles have a higher mean (0.53 vs. 0.31) and median (0.45 vs. 0.27). We ran a two-tailed Welch’s unequal variances t test (Welch, 1947) (variances differed by 0.024) and found a statistically significant difference (p = 0.0086, t = 2.86). We applied a Bonferroni correction to the p-value threshold to address the problem of multiple comparisons, but the p-value is well below our adjusted target of 0.0166. This means that in our sample, we can assume that articles generally provide more information about training data than conference papers.
Boxplot of normalized information scores for papers using original labeling, by document type.
Boxplot of normalized information scores for papers using original labeling, by document type.
5.4. Label Source Information Scores
Finally, because of the relatively small number of papers involving original human labeling (n = 45) that lead to low statistical power for paper information scores, we examined all papers that presented an original ML classifier (n = 141) based on whether they gave information sufficient to determine if their data set’s labels were derived from original human labeling. As discussed in Section 4.3, we gave many papers the answer “no information (implicit yes),” which means we could reasonably assume that labels were made by humans, but the paper never explicitly said humans were involved. Papers with answers “Yes for all items,” “Yes for some items,” and “No/machine-labeled” were scored 1. Papers with answers “No information (implicit yes)” and “Unsure (but assumed yes)” were scored 0. N/A papers that did not present an original classifier were excluded.
Figure 7 shows the label source reporting rates by corpus, which shows strikingly similar rates. Social Science & Humanities papers had a rate of 72.7%, compared to rates of 71.1% for the other two corpora. Figure 8 shows the label source reporting rates by application area, which shows a much wider range. Activities &actions also has the highest rate at 100% (likely for the same reasons hypothesized earlier), with the lowest rate being Gee/ecological at 46.1%. We also note the differences in these results and the overall paper information scores, which were inversely ranked for the larger categories of Linguistic, Medical, and Biological. While Linguistic papers had lower median information scores, they had far higher rates of label source reporting (79.2%), compared to Medical (69.8%) and especially Biological (58.8%) papers.
Label source reporting rates for papers presenting an original classifier, by corpus.
Label source reporting rates for papers presenting an original classifier, by corpus.
Label source reporting rates for papers presenting an original classifier, by area.
Label source reporting rates for papers presenting an original classifier, by area.
Figure 9 shows label source reporting rates by journal article versus conference paper, which shows a much higher rate for articles. We ran a two-tailed Welch’s unequal variances t test (variances differed by 0.082) and a statistically significant difference was not found (p = 0.038, t = 2.35). We must apply a Bonferroni correction to the p-value threshold to address the problem of multiple comparisons, and the p-value is above our adjusted target of 0.0166.
Label source reporting rates for papers presenting an original classifier, by document type.
Label source reporting rates for papers presenting an original classifier, by document type.
5.5. Conclusion to Information Score Results
In conclusion, our quantitative metrics show quite varying ranges and distributions of information scores, which does give evidence for the claim that there is substantial and wide variation in the practices around human labeling, training data curation, and research documentation. The ranges of the boxplots of normalized information scores are substantial, for both IQRs (25th and 75th percentile) and the whiskers at 1.5 * IQRs. Ranges are larger when sampling by corpus, but still substantial for the application areas with more papers (e.g., Medical, Biological, Linguistic).
We specifically call for more investigation into applied ML geo/ecological research, which often classifies land use from aerial photos or photos of geological samples. These had the lowest rates of label source specification and the lowest mean normalized information scores (excluding the categories that only had one paper). However, from our experience, some papers with lower scores did give excellent levels of detail about how they were using an existing land use classification schema they cited (e.g., the widely used USGS guide by Anderson, Hardy et al. [1976]), but did not give any details about who applied that schema to the aerial photos. We can also hypothesize that in fields with widely established and shared methodological standards, researchers could have far higher rates of adherence to methodological best practices around data labeling, but have lower rates of reporting that they actually followed those practices in papers.
Finally, we draw attention to the different rates when we grouped by corpus versus application area. In our sampling, the corpus was the Scopus-provided metadata field, which is determined at the publication level when a journal is added to Scopus6. At this level, we saw fewer differences in quantitative scores. In contrast, our application area field is determined for each paper based on the content, independent of the journal or venue in which it was published. Scores varied far more when broken out by application area, which is likely due in part to noise in the smaller samples. However, this may also indicate that methodological reporting rates vary even more within subfields or types of research objects. For future work, we recommend that researchers pay specific attention to differences between fields or objects of study, rather than solely group papers in the high-level way we did with our three corpora.
6. CONCLUDING DISCUSSION
6.1. Findings
First, our study shows that contemporary applications of supervised ML across disciplines often rely on training data sets in ways that either reuse existing human-labeled data sets or label items with some kind of automated process. Of the papers in our data set that presented an original ML classifier, only 26.7% produced a new human-labeled data set as part of their study—a rate that did not substantially vary among our three corpora from the biomedical & life sciences, the physical & environmental sciences, and the humanities & social sciences. Second, of the applied ML publications that did produce a new human-labeled training data set, there was significant divergence in reporting methodological details and following best practices in human labeling. A small number of publications received top information scores, but approximately two-thirds of publications involving original human labeling did not provide enough information for us to answer more than half of the subsequent questions we asked about the labeling process.
This cross-disciplinary trend is cause for concern, given that high-quality training data is essential to the validity of ML classifiers and human judgment is notoriously difficult to standardize. When comparing across our three broad corpora of social science & humanities, biomedical & life sciences, and physical & environmental sciences papers, we only see marginal differences in the level of information papers provide. We do see more robust evidence that journal articles have higher rates of reporting information about training data than conference papers, which may relate to conference papers being shorter and only involving a single cycle of peer-review7.
6.2. Implications
6.2.1. The black-boxing of training data
ML is increasingly used across disciplines and application domains, but the quality of supervised ML classifiers is only as good as the data that is used to train it. Based on our findings, we argue for more attention to be placed on the specific details of how that training data is labeled. There is a recent wave of work that interrogates ML models once they are trained, as well as considerations about “automation bias” (Skitka, Mosier, & Burdick, 1999)—that people often treat trained models as a “black box,” with their outputs unquestioned and taken as given. These concerns must also extend to the labeling and curation of training data sets, some of which become widely reused without being examined.
For example, Crawford and Paglen (2019) have called attention to problematic racial labels of images in the popular ImageNet training data set, which has been a standard benchmark data set in image recognition for over a decade. Birhane and Prabhu (2021) found thousands of images in the 80 Million Tiny Images data set that were labeled with offensive racial and gender-based slurs. The careful curation of data sets has long been a central tenet in the institutions of science, although standards and practices can change dramatically over time and across contexts. Historians of science such as Bowker (2005) and Gitelman (2013) remind us that data is never “raw,” as data always is produced and used within a messy assemblage of partially overlapping human institutions, each of which have their own practices, values, and assumptions. To this end, we call for applied ML researchers and practitioners who are re-using human-labeled data sets to exercise as much caution and care around the decisions to reuse a labeled data set as they would if they were labeling the data themselves.
Finally, we have not asked any questions about how papers discuss data cleaning, but we encourage more investigation and consideration of how the often-backgrounded work of data cleaning is performed, managed, and documented. We could have asked another dozen questions about how papers did or did not discuss how they cleaned their data. For future work, we would encourage researchers to study what applied ML papers report about how they cleaned and preprocessed their data. We also see much future work in studying to what extent applied ML papers report efforts at de-biasing data sets and models.
6.2.2. Institutional change around data documentation
We call on the institutions of science—publications, funders, disciplinary societies, and educators—to play a major role in working out solutions to these issues of data quality and research documentation. We see this work as part of the open science and reproducibility movement, specifically the movement for open access to research data sets, materials, protocols, and analysis code. However, even advocates of this movement have long discussed how individual researchers do not have incentives to be first-movers in being more open than usual about the messiness in all research, because it leaves their work more open to rebuttal (Ali-Khan, Harris, & Gold, 2017; Smaldino, 2016; Zimring, 2019). In our own experience, we have certainly felt the temptation to not report certain details that would lead others to have less confidence in our study, such as our IRR metrics.
In looking towards solutions, we see a parallel with issues in open access to publications, which often requires individual researchers to choose if they want to pay for open access out of their own funding. While some first-movers paid for this out of their own budgets, open access is currently being far more effectively tackled at the institutional level in ways that will not require individuals (and especially first-movers) to pay the costs. So too do we see institutional solutions to the issue of methodological detail, where a common floor could be established that is equally applicable to all researchers. We also see resonance with the various proposed efforts at standardizing documentation about ML models and data sets (Barclay et al., 2019; Bender & Friedman, 2018; Beretta et al., 2018; Gebru et al., 2018; Hind et al., 2018; Holland et al., 2018; Mitchell et al., 2019; Raji & Yang, 2019) and urge that human labeling details be included in such efforts.
On the publication process, we note that research publications are limited by length restrictions, which can leave little space for details. We can hypothesize that having a dedicated and visible space for methodology and data set documentation would make these concerns more central for authors, reviewers, editors, and readers, although we can only speculate as to the best way for this to be implemented. For example, Nature has far shorter word limits for a main research article (2,000 to 2,500 words), which means methodological and data set documentation is often fully detailed in appendices, which can be of any length. Does this approach more easily lead to readers and/or reviewers ignoring such details and focusing more on results? However, Nature also requires that authors fill out a peer-reviewed checklist form that asks general and domain-specific questions about statistical details (e.g., “a description of all covariates tested“) and about the data set (e.g., for behavioral science, “State the research sample … provide relevant demographic information … and indicate whether the sample is representative”)8. Do these kinds of mandatory structured disclosure forms make these concerns more central to authors and reviewers, even if they are not as accessible to readers?
We also note that peer reviewers and editors play a major role in deciding what details are considered extraneous. First, we urge reviewers to make space for what some may see as “boring” methodological details. More importantly, we call on editorial boards to openly signal in author and reviewer guidelines that they invite or even require extended discussion of methodological details. To this end, one recent trend is the growth of multistakeholder groups that have collectively released formal guidelines or best practices statements on research reporting, such as the CONSORT guidelines for reporting randomized clinical trials (Schulz, Altman et al., 2010), the COREQ guidelines for reporting qualitative research (Tong, Sainsbury, & Craig, 2007), or the PRISMA guidelines on reporting meta-analyses and systematic reviews (Moher, Liberati et al., 2009).
For example, PRISMA guidelines on reporting meta-analyses and systematic reviews have been mandated in the author guidelines of many journals (including The Lancet9, PLOS ONE10, and Systematic Reviews11), which require authors to fill out the 27-item PRISMA checklist12. One interesting trend with such multistakeholder best practices statements in medicine is the proliferation of subdomain-specific “extensions” that further specify methodological reporting standards. For example, the EQUATOR network tracks 32 extensions to the CONSORT guidelines13, including guidelines for reporting randomized clinical trials in pain management (Gewandter, Eisenach et al., 2019), orthodontics (Pandis, Fleming et al., 2015), and psycho-social interventions (Montgomery, Grant et al., 2018).
However, there have been disagreements over the impact and efficacy of these more structured approaches. Page and Moher’s14 meta-analysis of 57 papers studying uptake of the PRISMA guidelines (Page & Moher, 2017) found that while more papers are reporting details in the PRISMA guidelines after it was released in 2009, some details remain low even for papers claiming to adhere to the guidelines. For example, for nine of the PRISMA items, fewer than 67% of papers actually reported the information in question. Fleming, Koletsi, and Pandis (2014) found that following the widespread uptake of the PRISMA guidelines by certain publications, more meta-analysis articles reported methodological details, but disproportionately those in the PRISMA guidelines. The authors of that study raise concerns that PRISMA has overdetermined the peer review process: Authors who are fully compliant with PRISMA are no longer reporting other methodological details that Fleming et al. claim are also relevant in such work and were in other competing meta-analysis guidelines that ultimately lost to PRISMA.
6.2.3. Are there universal best practices for the labeling of training data?
The efforts around methodological standards in medicine raise an important question about the wisdom of seeking a single one-size-fits-all set of best practices for any application of supervised ML. However, contemporary efforts around “fairness” or “transparency” in ML often work towards more universal or domain-independent approaches, which are applied to a wide range of application areas (e.g., finance, social services, policing, hiring, medicine). Yet in our work examining publications from quite different academic fields, we found ourselves needing to pay close attention to the various kinds of specialized expertise that are required to label a training data set for a particular purpose. As Bowker and Star (1999) and Goodwin (1994) discuss, all classification systems rely on a shared cultural context, which can be exceedingly difficult to formally specify and often falls apart at the edges. It can be difficult to know beforehand what level of shared cultural context and expertise will be involved.
Some of the papers we analyzed described in great detail how the people who labeled their data set were chosen for their expertise, from seasoned medical practitioners diagnosing diseases to youth familiar with social media slang in multiple languages. That said, not all labeling tasks require years of specialized expertise, such as more straightforward tasks we saw, including distinguishing positive versus negative business reviews or identifying different hand gestures. Even projects in the same domain can require different levels of expertise, such as a data set of animal photos labeled just for the presence of cats and dogs, versus labeling the same photos for the specific breed of cats and dogs. Furthermore, we found that some labeling tasks are well suited to semiautomated labeling where labelers are assisted with rule-based approaches, while others are not. Finally, even the more seemingly straightforward classification tasks can still have substantial room for ambiguity and error for the inevitable edge cases, which require training and verification processes to ensure a standardized data set.
The labeling protocol and schema we developed and used in this paper—which is based on and extends prior work (Geiger et al., 2020)—is an effort at creating a cross-disciplinary standard for any given research project that uses human-labeled training data. While we believe that any peer reviewer or reader can ask these same questions of any ML application paper, they are only a starting point. Issues of validity, consistency, reliability, reproducibility, and accountability require further investigation. The kind of domain-independent criteria we used should be seen as necessary but not suficient criteria for having confidence in a labeled data set. We do not advocate for a single, universal, one-size-fits-all solution, but instead seek to spur conversations within and across disciplines about better approaches to bring the work of data labeling into the foreground. We see a role for the classic principle of reproducibility, but for data labeling: Does the paper provide enough detail so that another researcher could hypothetically recruit a similar team of labelers, give them the same instructions and training, reconcile disagreements similarly, and have them produce a similarly labeled data set?
Data publications could also play a major role in this issue, which are standalone peer-reviewed publications that do not answer a research question, but instead spend the entire paper describing the creation of a data set in rich detail (Candela, Castelli et al., 2015; Chavan & Penev, 2011; Costello, 2009; Smith, 2009). In seeking to bring the work of data labeling from the background to the foreground, our work is also aligned with scholars who have focused on the often undercompensated labor of crowdworkers and have called for researchers to detail how much they pay for data labeling (Silberman et al., 2018).
6.3. Limitations
To conclude, we reflect that our study also has the same kinds of limitations that many human labeling projects have. For example, given the concerns we raise about domain-specific expertise, our team may have missed or misinterpreted crucial details when examining papers. The second issue is around the reliability and reproducibility of our team’s labeling process. In conducting this study, we have become quite familiar with the difficulties of getting a medium-sized team to build a consensus around reducing complex objects into quantifable data. We specifically chose to have a more detailed and time-intensive process in which disagreements were discussed, which traded off with the total number of items we were able to label. We believe this trade-off was the right decision, given our focus on methodological rigor, but it does mean our samples are smaller than we would like. The lower sample size means that we have less confidence in the statistical generalizability of our sample to the population of all applied ML publications. However, we see a wide range of future work that can be done to extend these efforts, such as with teams of domain-specific experts that examine applied ML fields in their area of expertise.
Finally, we only have access to what each publication reported about the work they did, and not the research project itself, which means our unit of analysis is methodological reporting. For example, researchers could have far higher rates of following methodological best practices around data labeling, but have lower rates of reporting that they actually followed those practices in papers. We could even hypothesize an inverse relationship between a field’s overall adherence to methodological best practices and researchers’ rates of reporting adherence to those practices, if such practices become so routine and mundane that they are left implicit in publications. For these reasons, we strongly advise against interpreting our quantitative scores as an unproblematic proxy for methodological rigor, especially for the scores by discipline and area. However, given our interest in how labeling practices impact the validity of ML models and classifiers, future work could extend this work through other methods, such as surveys and ethnographic studies of ML researchers.
ACKNOWLEDGMENTS
We would like to acknowledge support for this project from the UC-Berkeley Undergraduate Research Apprenticeship Program (URAP) and to Stacey Dorton for administrative and logistical support. We thank the peer reviewers for their feedback and suggestions.
AUTHOR CONTRIBUTIONS
R. Stuart Geiger: Conceptualization, Data curation, Formal Analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing—original draft, Writing—review & editing. Dominique Cope: Investigation, Validation, Writing—original draft, Writing—review & editing. Jamie Ip: Data curation, Investigation, Software, Validation. Marsha Lotosh: Investigation, Validation, Visualization. Aayush Shah: Investigation, Validation. Jenny Weng: Investigation, Validation, Writing—review & editing. Rebekah Tang: Investigation.
COMPETING INTERESTS
The authors have no competing interests.
FUNDING INFORMATION
This work was funded in part by the Gordon & Betty Moore Foundation (Grant GBMF3834) and the Alfred P. Sloan Foundation (Grant 2013-10-27), as part of the Moore-Sloan Data Science Environments grant to UC-Berkeley. The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
DATA AVAILABILITY
Notes
For this and all other boxplots in this paper: The main box is the interquartile range (IQR), or the 25th and 75th percentiles. The middle red line is the median, the black is the mean. The outer whiskers are the highest and lowest data points in a range of 1.5 times the IQR from the median. Grey diamonds are outliers beyond 1.5 times the IQR from the median.
Not all conference papers are peer reviewed, but all conference proceedings indexed by Scopus are peer reviewed.
Moher is the lead author of the PRISMA statement.
REFERENCES
Author notes
The majority of the work on this paper was conducted when this author was affliated with the University of California, Berkeley.
Handling Editor: Ludo Waltman