Abstract
Reproducibility has become an increasingly debated topic in NLP and ML over recent years, but so far, no commonly accepted definitions of even basic terms or concepts have emerged. The range of different definitions proposed within NLP/ML not only do not agree with each other, they are also not aligned with standard scientific definitions. This article examines the standard definitions of repeatability and reproducibility provided by the meta-science of metrology, and explores what they imply in terms of how to assess reproducibility, and what adopting them would mean for reproducibility assessment in NLP/ML. It turns out the standard definitions lead directly to a method for assessing reproducibility in quantified terms that renders results from reproduction studies comparable across multiple reproductions of the same original study, as well as reproductions of different original studies. The article considers where this method sits in relation to other aspects of NLP work one might wish to assess in the context of reproducibility.
1 Introduction
Reproducibility of results is coming under increasing scrutiny in the machine learning (ML) and natural language processing (NLP) fields, against the background of a perceived reproducibility crisis in science more widely (Baker 2016), and NLP/ML specifically (Mieskes et al. 2019). There have been workshops and checklist initiatives, conferences promoting reproducibility via calls, chairs’ blogs and special themes, and the first shared tasks, including REPROLANG’20 (Branco et al. 2020) and ReproGen’21 (Belz et al. 2021b).
Despite this growing body of research on reproducibility, it has been observed that no standard terms and definitions have emerged (Cohen et al. 2018). For the Association of Computing Machinery (ACM 2020), results have been reproduced if obtained in a different study by a different team using artifacts supplied in part by the original authors, and replicated if obtained in a different study by a different team using artifacts not supplied by the original authors. For Drummond (2009), replicability is defined as an experiment being able to be re-run exactly, whereas reproducibility is the ability to obtain the same result by different means. For Rougier et al. (2017), “[r]eproducing the result of a computation means running the same software on the same input data and obtaining the same results. [...]. Replicating a published result means writing and then running new software based on the description of a computational model or method.” Wieling, Rawee, and van Noord (2018) tie reproducibility to “the same data and methods,” and Whitaker (2017), followed by Schloss (2018), tie definitions of reproducibility, replicability, robustness, and generalizability to different combinations of same vs. dif- ferent data and code. For Cohen et al. (2018) reproducibility is “a property of the outcomes of an experiment: arriving—or not—at the same conclusions, findings, or values,” whereas they “exclude issues related to the ability to repeat the experiments reported in a paper [which is] replicability or repeatability.”
No two of the above definitions are fully in agreement with each other, and none are entirely aligned with the standard scientific definitions of repeatability and reproducibility used in other fields and codified in the International Vocabulary of Metrology (VIM) (JCGM 2012). In fact, the US National Information Standards Organization had to ask the ACM to “harmonize its terminology and definitions with those used in the broader scientific research community” (ACM 2020), which prompted the latter to switch its definitions of replicability and reproducibility, without however achieving full alignment with the standard definitions of the two terms.
The remainder of this article examines the VIM definitions and the kind of approach to assessing reproducibility their adoption for NLP/ML would result in. Section 2 presents and discusses the VIM definitions (JCGM 2012), Section 3 lays out what reproducibility assessment looks like if wholly based on them, including practical steps and an example application in NLP. Section 4 discusses the limits of the proposed approach and a variety of questions that have come up in discussions and reviews of this work. The article concludes (Section 5) with a consideration of what would need to change for Quantified Reproducibility Assessment (QRA) to become part of NLP workflows, and what incentive the field might have to accept the implied overhead.
2 Definitions from VIM and Implications for Reproducibility Assessment
The International Vocabulary of Metrology (VIM) (JCGM 2012) defines repeatability and reproducibility as follows (defined terms in bold, see VIM for subsidiary defined terms):
- 2.21
measurement repeatability (or repeatability, for short) is measurement precision under a set of repeatability conditions of measurement.
- 2.20
a repeatability condition of measurement (repeatability condition) is a condition of measurement, out of a set of conditions that includes the same measurement procedure, same operators, same measuring system, same operating conditions and same location, and replicate measurements on the same or similar objects over a short period of time.1
- 2.25
measurement reproducibility (reproducibility) is measurement precision under reproducibility conditions of measurement.
- 2.24
a reproducibility condition of measurement (reproducibility condition) is a condition of measurement, out of a set of conditions that includes different locations, operators, measuring systems, etc. A specification should give the conditions changed and unchanged, to the extent practical.
In other words, repeatability and reproducibility are properties not of objects, scores, results, or conclusions, but of measurements M that are parameterized by measurand m, object O, time t, and conditions of measurement C and return a measured quantity value v. They are defined as measurement precision, that is, quantified by calculating the precision of a set of measured quantity values, relative to a set of conditions of measurement, which have to be known and specified for assessment of repeatability and reproducibility to be meaningful.
3 Assessing Reproduction Results in Practice
Generally, when carrying out a reproduction study of some original work in ML/NLP, we consider the system and how it was created and trained, the scores that were obtained for it, and the claims made on the basis of the scores. In existing work, the corresponding questions that are addressed in assessing the results of reproduction are of three broad types: how easily can the system be recreated; how similar are the scores; and can the same claims be supported. Only the second of these is about reproducibility in the general scientific sense, and only is it answerable from metrology. In order to have a complete approach to answering it in practice we need three more components in addition to the definitions from Section 2: (i) a method for computing precision; (ii) specified conditions of measurement; and (iii) a procedure for carrying out reproducibility assessments. Each is addressed in turn below, followed by an example application.
Computing Precision
In other fields, measurement precision is typically reported in terms of the coefficient of variation (CV) defined as the standard deviation over the mean (of a sample of measured quantity values). Other values reported can include: mean along side standard deviation with 95% confidence intervals, coefficient of variation, and percentage of values within n standard deviations. CV serves well as the “headline” result, because it is a general measure, not in the unit of the measurements (unlike mean and standard deviation), providing a quantification of degree of reproducibility that is comparable across studies (Ahmed 1995, page 57). This also holds for percentage within n standard deviations but the latter is less recognized and intuitive.
Conditions of Measurement
Individual conditions are specified by name and value, and their role is to capture those attributes of a measurement where different values may cause differences in measured quantity values. For repeatability assessment, conditions need to capture all sources of variation in results (although not necessarily each with a separate condition). It so happens that much of the reproducibility work in ML and NLP has so far focused on what standard conditions of measurement (information about system, data, dependencies, computing environment, etc.) for metric measurements need to be specified in order to enable repeatability assessment, even if it hasn’t been couched in these terms.
Reproducibility checklists such as those provided by Pineau (2020) and the ACL4 for metric evaluation, and evaluation datasheets like the one proposed by Shimorina and Belz (2021) for human evaluation, are lists of types of information (attributes), for which authors are asked to provide information (values), that can directly be construed as conditions of measurement. In the example in the following section, we use the following conditions of measurement, based on the above checklists and described in more detail in Belz, Popovic, and Mille (2022):
The names of the conditions of measurement used in this paper are as given above. The values for each condition characterize how measurements differ in respect to the condition; in reporting results from QRA tests below, we use paper and method identifiers as shorthand for distinct condition values (full details in each case being available from the referenced papers). The intention here is not to propose a definitive set of conditions for general use; that is beyond the scope of this article, and at any rate should evolve by consensus over time.
Reproducibility Assessment Procedure
We now have all the components needed for quantified reproducibility assessment, but how do we go about carrying it out in practice? From metrology we know we need to compute the precision of two or more measured quantity values obtained in measurements of the same object and measurand with specified conditions of measurement, identical in the case of repeatability and differing in the case of reproducibility. There are two possible starting positions: (a) the original and (some) reproduction studies have been carried out; and (b) the original study has been carried out, but no reproductions.
In the case of a, the conditions of measurement that can be used are limited to the information that can be gleaned from publications, code/data repositories, and the authors. This rules out repeatability assessment in most cases. In such cases, reproducibility assessment involves the steps indicated in the bottom half of Figure 1, termed 1-Phase QRA. The example QRA below is such a case. If the set of measurements includes subsets of measurements that have more conditions in common, then it makes sense to report R for these subsets separately.
In the case of b, repeatability assessment can be carried out, and arguably should be carried out before any reproducibility assessment. For a true estimate of the variation resulting from given differences in condition values, the baseline variation, present when all condition values are the same, needs to be known. Repeatability assessment should therefore be carried out prior to reproducibility assessment. For example, if the coefficient of variation is under identical conditions C0, and under varied conditions Ci, then it is the difference between and that estimates the effect of varying the conditions. The steps in this 2-phase reproducibility assessment are shown at the top of Figure 1.
Example Application.
7 As an example of how QRA can be used in practice we apply the approach outlined above, in its 1-phase version with conditions of measurement and CV* as given above, to a set of reproductions of the essay scoring system evaluations reported by Vajjala and Rama (2018), which were carried out as part of REPROLANG (Branco et al. 2020). More particularly, we look at five of the 11 multilingual essay scoring system variants and the weighted F1 scores (wF1) computed for them. Table 1 shows object (system variant) and measurand (wF1) in the first two columns. The baseline classifier (mult-base) uses document length (number of words) as its only feature. For the other variants, +/− indicates that the multilingual classifier was / was not given information about which language the input was in; mult-dep uses n-grams over dependency relation, dependent POS, and head POS triples; and mult-emb uses word and character embeddings. The mult-base model is a logistic regressor, the others are random forests.
Object . | Measurand . | Object conditions . | Measurement method conditions . | Measurement procedure conditions . | Measured quantity . | CV* . | ||||
---|---|---|---|---|---|---|---|---|---|---|
Code by . | Compiled/ trained by . | Method . | Implemented by . | Procedure . | Test set . | Performed by . | value . | |||
mult-base | wF1 | Va.& Ra./s1 | Va.& Ra. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Va.& Ra. | 0.428 | 14.63 |
Va.& Ra./s2 | Hub.&Colt. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d2 | Hub.&Colt. | 0.493 | |||
Va.& Ra./s? | Arhiliuc&al. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Arhiliuc&al. | 0.426 | |||
Va.& Ra./s1 | Va.& Ra./env1 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.574 | |||
Va.& Ra./s1 | Va.& Ra./env2 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.579 | |||
Va.& Ra./s2 | Va.& Ra./env2 | wF1(o,t) | ≈Va.& Ra. | OTE | Va.& Ra./d3 | Bestgen | 0.590 | |||
Va.& Ra./s1 | Va.& Ra./env3 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Cai.&But. | 0.574 | |||
Cai.&But. | Cai.&But. | wF1(o,t) | Cai.&But. | OTE | Va.&Ra./d4 | Cai.&But. | 0.600 | |||
mult-dep− | wF1 | Va.& Ra./s1 | Va.& Ra. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Va.& Ra. | 0.703 | 4.5 |
Va.& Ra./s2 | Hub.&Colt. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d2 | Hub.&Colt. | 0.660 | |||
Va.& Ra./s? | Arhiliuc&al. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Arhiliuc&al. | 0.650 | |||
Va.& Ra./s1 | Va.& Ra./env1 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.651 | |||
Va.& Ra./s1 | Va.& Ra./env2 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.699 | |||
Va.& Ra./s2 | Va.& Ra./env2 | wF1(o,t) | ≈Va.& Ra. | OTE | Va.& Ra./d3 | Bestgen | 0.711 | |||
Va.& Ra./s1 | Va.& Ra./env3 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Cai.&But. | 0.651 | |||
Cai.&But. | Cai.&But. | wF1(o,t) | Cai.&But. | OTE | Va.&Ra./d4 | Cai.&But. | 0.710 | |||
mult-dep + | wF1 | Va.& Ra./s1 | Va.& Ra. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Va.& Ra. | 0.693 | 4.39 |
Va.& Ra./s2 | Hub.&Colt. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d2 | Hub.&Colt. | 0.661 | |||
Va.& Ra./s? | Arhiliuc&al. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Arhiliuc&al. | 0.652 | |||
Va.& Ra./s1 | Va.& Ra./env1 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.653 | |||
Va.& Ra./s1 | Va.& Ra./env2 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.699 | |||
Va.& Ra./s2 | Va.& Ra./env2 | wF1(o,t) | ≈Va.& Ra. | OTE | Va.& Ra./d3 | Bestgen | 0.712 | |||
Va.& Ra./s1 | Va.& Ra./env3 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Cai.&But. | 0.653 | |||
Cai.&But. | Cai.&But. | wF1(o,t) | Cai.&But. | OTE | Va.&Ra./d4 | Cai.&But. | 0.716 | |||
mult-emb− | wF1 | Va.& Ra./s1 | Va.& Ra. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Va.& Ra. | 0.693 | 17.03 |
Va.& Ra./s2 | Hub.&Colt. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d2 | Hub.&Colt. | 0.658 | |||
Va.& Ra./s? | Arhiliuc&al. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Arhiliuc&al. | 0.683 | |||
Va.& Ra./s1 | Va.& Ra./env1 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.668 | |||
Va.& Ra./s1 | Va.& Ra./env2 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.692 | |||
Va.& Ra./s2 | Va.& Ra./env2 | wF1(o,t) | ≈Va.& Ra. | OTE | Va.& Ra./d3 | Bestgen | 0.689 | |||
Va.& Ra./s1 | Va.& Ra./env3 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Cai.&But. | 0.659 | |||
Cai.&But. | Cai.&But. | wF1(o,t) | Cai.&But. | OTE | Va.&Ra./d4 | Cai.&But. | 0.391 | |||
mult-emb + | wF1 | Va.& Ra./s1 | Va.& Ra. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Va.& Ra. | 0.689 | 16.23 |
Va.& Ra./s2 | Hub.&Colt. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d2 | Hub.&Colt. | 0.662 | |||
Va.& Ra./s? | Arhiliuc&al. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Arhiliuc&al. | 0.681 | |||
Va.& Ra./s1 | Va.& Ra./env1 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.659 | |||
Va.& Ra./s1 | Va.& Ra./env2 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.681 | |||
Va.& Ra./s2 | Va.& Ra./env2 | wF1(o,t) | ≈Va.& Ra. | OTE | Va.& Ra./d3 | Bestgen | 0.684 | |||
Va.& Ra./s1 | Va.& Ra./env3 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Cai.&But. | 0.657 | |||
Cai.&But. | Cai.&But. | wF1(o,t) | Cai.&But. | OTE | Va.&Ra./d4 | Cai.&But. | 0.401 |
Object . | Measurand . | Object conditions . | Measurement method conditions . | Measurement procedure conditions . | Measured quantity . | CV* . | ||||
---|---|---|---|---|---|---|---|---|---|---|
Code by . | Compiled/ trained by . | Method . | Implemented by . | Procedure . | Test set . | Performed by . | value . | |||
mult-base | wF1 | Va.& Ra./s1 | Va.& Ra. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Va.& Ra. | 0.428 | 14.63 |
Va.& Ra./s2 | Hub.&Colt. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d2 | Hub.&Colt. | 0.493 | |||
Va.& Ra./s? | Arhiliuc&al. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Arhiliuc&al. | 0.426 | |||
Va.& Ra./s1 | Va.& Ra./env1 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.574 | |||
Va.& Ra./s1 | Va.& Ra./env2 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.579 | |||
Va.& Ra./s2 | Va.& Ra./env2 | wF1(o,t) | ≈Va.& Ra. | OTE | Va.& Ra./d3 | Bestgen | 0.590 | |||
Va.& Ra./s1 | Va.& Ra./env3 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Cai.&But. | 0.574 | |||
Cai.&But. | Cai.&But. | wF1(o,t) | Cai.&But. | OTE | Va.&Ra./d4 | Cai.&But. | 0.600 | |||
mult-dep− | wF1 | Va.& Ra./s1 | Va.& Ra. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Va.& Ra. | 0.703 | 4.5 |
Va.& Ra./s2 | Hub.&Colt. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d2 | Hub.&Colt. | 0.660 | |||
Va.& Ra./s? | Arhiliuc&al. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Arhiliuc&al. | 0.650 | |||
Va.& Ra./s1 | Va.& Ra./env1 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.651 | |||
Va.& Ra./s1 | Va.& Ra./env2 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.699 | |||
Va.& Ra./s2 | Va.& Ra./env2 | wF1(o,t) | ≈Va.& Ra. | OTE | Va.& Ra./d3 | Bestgen | 0.711 | |||
Va.& Ra./s1 | Va.& Ra./env3 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Cai.&But. | 0.651 | |||
Cai.&But. | Cai.&But. | wF1(o,t) | Cai.&But. | OTE | Va.&Ra./d4 | Cai.&But. | 0.710 | |||
mult-dep + | wF1 | Va.& Ra./s1 | Va.& Ra. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Va.& Ra. | 0.693 | 4.39 |
Va.& Ra./s2 | Hub.&Colt. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d2 | Hub.&Colt. | 0.661 | |||
Va.& Ra./s? | Arhiliuc&al. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Arhiliuc&al. | 0.652 | |||
Va.& Ra./s1 | Va.& Ra./env1 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.653 | |||
Va.& Ra./s1 | Va.& Ra./env2 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.699 | |||
Va.& Ra./s2 | Va.& Ra./env2 | wF1(o,t) | ≈Va.& Ra. | OTE | Va.& Ra./d3 | Bestgen | 0.712 | |||
Va.& Ra./s1 | Va.& Ra./env3 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Cai.&But. | 0.653 | |||
Cai.&But. | Cai.&But. | wF1(o,t) | Cai.&But. | OTE | Va.&Ra./d4 | Cai.&But. | 0.716 | |||
mult-emb− | wF1 | Va.& Ra./s1 | Va.& Ra. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Va.& Ra. | 0.693 | 17.03 |
Va.& Ra./s2 | Hub.&Colt. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d2 | Hub.&Colt. | 0.658 | |||
Va.& Ra./s? | Arhiliuc&al. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Arhiliuc&al. | 0.683 | |||
Va.& Ra./s1 | Va.& Ra./env1 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.668 | |||
Va.& Ra./s1 | Va.& Ra./env2 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.692 | |||
Va.& Ra./s2 | Va.& Ra./env2 | wF1(o,t) | ≈Va.& Ra. | OTE | Va.& Ra./d3 | Bestgen | 0.689 | |||
Va.& Ra./s1 | Va.& Ra./env3 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Cai.&But. | 0.659 | |||
Cai.&But. | Cai.&But. | wF1(o,t) | Cai.&But. | OTE | Va.&Ra./d4 | Cai.&But. | 0.391 | |||
mult-emb + | wF1 | Va.& Ra./s1 | Va.& Ra. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Va.& Ra. | 0.689 | 16.23 |
Va.& Ra./s2 | Hub.&Colt. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d2 | Hub.&Colt. | 0.662 | |||
Va.& Ra./s? | Arhiliuc&al. | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Arhiliuc&al. | 0.681 | |||
Va.& Ra./s1 | Va.& Ra./env1 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.659 | |||
Va.& Ra./s1 | Va.& Ra./env2 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Bestgen | 0.681 | |||
Va.& Ra./s2 | Va.& Ra./env2 | wF1(o,t) | ≈Va.& Ra. | OTE | Va.& Ra./d3 | Bestgen | 0.684 | |||
Va.& Ra./s1 | Va.& Ra./env3 | wF1(o,t) | Va.& Ra. | OTE | Va.& Ra./d1 | Cai.&But. | 0.657 | |||
Cai.&But. | Cai.&But. | wF1(o,t) | Cai.&But. | OTE | Va.&Ra./d4 | Cai.&But. | 0.401 |
As can be see from the Performed by column and the measured quantity values (wF1 scores) in the second-to-last column, for each system variant, we have the original score reported by Vajjala and Rama, and seven reproduction scores from four different papers (Huber and Çöltekin 2020; Arhiliuc, Mitrović, and Granitzer 2020; Bestgen 2020; Caines and Buttery 2020). The conditions of measurement described above are given in columns 3–9; for space reasons the condition values are given as paper and method IDs; for example, Test set = Va.&Ra./d2 means the original data was used, but the test set split was different. Finally, CV* scores are given in the last column.
Results show that system variant pairs that differ only in whether they use language information have very similar CV* scores. For example, mult-dep− (without language information) and mult-dep + (with language information) have a CV* of 4.5 and 4.39, respectively. This tendency holds for all such pairs (including the ones not shown here), indicating that using language information makes next to no difference to reproducibility.8 Results also show that the syntactic information is obtained/used in a way that is particularly reproducible, whereas the word embeddings are obtained/used in a way that is particularly hard to reproduce. Overall, the random forest models using syntactic features have the best reproducibility; the logistic regressors using domain-specific features (mult-dom systems not included here) have the worst.
These insights can be used in different ways. The substantial differences in CV* seen here might prompt further code checking (and indeed there seem to have been issues with the mult-base code according to three of the reproduction papers). Knowing that the method for obtaining word embeddings is associated with particularly poor reproducibility might prompt an attempt to improve it. Automated CV* checking can be built into further development of the system to ensure good reproducibility of baseline variants.
4 Discussion
Is QRA all we need?
QRA directly addresses the second of the questions (how similar are scores) mentioned at the start of Section 3 by computing degree of reproducibility as a function of scores. It also contributes to answering the first (how recreatable is the system) by providing pointers to which conditions of measurement (which aspect of system, training, compiling, evaluation, etc.) may be causing poor reproducibility. And it contributes to answering the third question (can the same conclusions be drawn) by providing a quantitative basis for deciding what is otherwise a subjective judgment. That is not to say that QRA does everything; in particular, conducting study-level comparisons (e.g., correlation between sets of scores, system rankings, and significant differences) provides additional insights (Fokkens et al. 2013).
What counts as a good level of reproducibility?
This differs substantially between disciplines and contexts. In bio-science assays, precision (CV) ranges from <10 for enzyme assays, to 20–50 for in vivo and cell-based assays, and >300 for virus titre assays (AAH/USFWS n.d.). For NLP, typical CV ranges would have to be established over time, but it seems clear that we would expect them to be much lower (better) for metric-based measurements than for human-assessments. For wF1 measurements, the >15 CV* scores above seem very high.
Does it matter how similar scores are?
CV* provides a quantitative basis for deciding if the same conclusions can be drawn as discussed above, but looking at the degree of similarity of scores relative to similarity of conditions of measurement can also reveal important information, as exemplified by the systematic patterns in CV* found in the example application in the last section. Knowing what are expected degrees of reproducibility for a type of system or evaluation method helps pinpoint problems when those expectations aren’t met.
If score differences in reproductions of the same system are larger than score differences reported as a new state of the art for the same task elsewhere, then that’s a problem because it is unclear whether higher scores are in fact due to methodological improvements. If reproduction scores are, as would seem reasonable to assume, normally distributed, then overall, half of all reproduction scores should be better than the original scores, and half worse. But that is not the case: a recent survey (Belz et al. 2021a) showed that 60% of all reproductions that are different are in fact worse. In this way, examining differences between scores can facilitate higher-level insights, in this case, an apparent tendency to cherry-pick better results.
Differences in quality between studies
In the course of a reproduction study, it can happen that the reproducing authors discover what they consider errors in code or mismatches between the published description of the code and the code itself. In such cases, some judgment is involved: there is no point in reproducing an erroneous original study; if the code is different from the description, but not otherwise deemed problematic, a reproduction with corrected description might still make sense. Beyond actual errors, conditions of measurement capture the similarities and differences (some of which may be deemed a matter of quality) between different studies.
QRA in NLP workflows
In order to put metrological principles and quantified reproducibility assessment into practice, two things are needed at the field level: (i) recording conditions of measurement for all evaluations using a standard template; and (ii) routinely reporting CV* and conditions of measurement for reproduction studies. Ideally, the third element would be to conduct repeatability assessment as part of developing new evaluation methods, and usefully even as part of new applications of existing methods.
5 Conclusion
The reproducibility debate in NLP/ML has been framed in terms of exactly what information we need to share so that others are guaranteed to obtain the same metric scores, and initially the expectation was that “[r]eproducibility would be quite easy to achieve in machine learning simply by sharing the full code used for experiments” (Sonnenburg et al. 2007, page 2450). What is becoming clear, however, is that no matter how much of our code, data, and ancillary information we share, residual amounts of variation remain that are stubbornly resistant to being eliminated. A recent survey (Belz et al. 2021a) found that just 14% of the 513 original/reproduction score pairs analyzed were exactly the same. Judging the remainder simply “not reproduced” is of limited usefulness, as some are much closer to being the same than others. On the other hand, judging whether the same conclusions can be drawn on the basis of possibly different scores without a quantitative basis is prone to subjectivity and low agreement.
QRA offers an alternative by quantifying the closeness of results in a way that is comparable across different studies and can be used to establish, over time, expected levels of closeness for different types of systems and evaluation methods. Such expected CV* levels and the individual CV* results for given reproductions moreover provide a quantitative basis for deciding whether the same conclusions can be drawn, as well as providing information about the recreatability of systems and evaluation methods.
Reproducibility is one of the cornerstones of scientific research: Inability to reproduce results within accepted limits is, with few exceptions, seen as casting doubt on their validity. Building QRA into NLP workflows comes with an overhead, both in terms of recording information about systems and evaluations in a standardized form, and in terms of carrying out reproduciblity assessments. It can often seem in NLP that the performance-improvement paradigm and the pressure to maximize publications does not allow for any additional work not directly contributing to increasing performance or publications. In medicine and the physical sciences more generally it’s inconceivable that reproducibility assessment could be viewed as an optional extra. As NLP/ML matures as a science, perhaps it’s time that we too insist on our results being verifiably reliable, including in the important sense of being reproducible.
Notes
In physical measurements objects can change simply as a function of time.
Code and data are available here: https://github.com/asbelz/coeff-var.
Otherwise CV* reflects differences solely due to different minimum values.
For definition of ‘measurement method’, see VIM 2.5.
For definition of ‘measurement procedure’, see VIM 2.6.
References
Author notes
This article makes the general case for (i) adopting general scientific definitions of reproducibility and repeatability, and (ii) computing precision as the quantified measure of the degree of reproducibility of scores, as in other scientific disciplines. The QRA approach itself is described and discussed in more detailin Belz (2021), and tested on a variety of reproduction scenarios in Belz, Popovic, and Mille (2022). The code for computing CV* and several example score sets from the above papers can be found here: github.com/asbelz/coeff-var.