Are disruption index indicators convergently valid? The comparison of several indicator variants with assessments by peers

Recently, Wu, Wang, and Evans (2019) proposed a new family of indicators, which measure whether a scientific publication is disruptive to a field or tradition of research. Such disruptive influences are characterized by citations to a focal paper, but not its cited references. In this study, we are interested in the question of convergent validity. We used external criteria of newness to examine convergent validity: In the postpublication peer review system of F1000Prime, experts assess papers whether the reported research fulfills these criteria (e.g., reports new findings). This study is based on 120,179 papers from F1000Prime published between 2000 and 2016. In the first part of the study we discuss the indicators. Based on the insights from the discussion, we propose alternate variants of disruption indicators. In the second part, we investigate the convergent validity of the indicators and the (possibly) improved variants. Although the results of a factor analysis show that the different variants measure similar dimensions, the results of regression analyses reveal that one variant (DI5) performs slightly better than the others.


INTRODUCTION
Citation analyses often focus on counting the number of citations to a focal paper (FP). To assess the academic impact of the FP, its citation count is compared with the citation count for a similar paper (SP) that has been published in the same research field and year. If the FP receives significantly more citations than the SP, its impact is noteworthy: The FP seems to be more useful or interesting for other researchers than the SP. However, the simple counting and comparing of citations does not reveal what the reasons for the impact of publications might be. As the overviews by Bornmann and Daniel (2008) and Tahamtan and Bornmann (2019) show, various reasons exist why publications are (highly) cited. Especially for research evaluation purposes, it is very interesting to know whether certain publications have impact because they report novel or revolutionary results. These are the results from which science and society mostly profit.
In this paper, we focus on a new type of indicator family measuring the impact of publications by examining not only the number of citations received but also the references cited in publications. Recently, Funk and Owen-Smith (2017) proposed a new family of indicators that measure the disruptive potential of patents.  transferred the conceptual idea to publication data by measuring whether an FP disrupts a field or tradition of research (see also a similar proposal in Bu, Waltman, & Huang, 2019). Azoulay (2019) describes the so-called disruption index proposed by  as follows: "When the papers that cite a given article also reference a substantial proportion of that article's references, then the article can be seen as consolidating its scientific domain. When the converse is true-that is, when future citations to the article do not also acknowledge the article's own intellectual forebears-the article can be seen as disrupting its domain" (p. 331).
We are interested in the question of whether disruption indicators are convergently valid with assessments by peers. The current study has two parts: In the first part, we discuss the indicators introduced by  and Bu et al. (2019) and identify possible limitations. Based on the insights from the discussion, we propose alternate variants of disruption indicators. In the second part, we investigate the convergent validity of the indicators proposed by  and Bu et al. (2019) and the (possibly) improved variants. We used an external criterion of newness, which is available at the paper level for a large paper set: tags (e.g., "new finding") assigned to papers by peers expressing newness.
Convergent validity asks "to what extent does a bibliometric exercise exhibit externally convergent and discriminant qualities? In other words, does the indicator satisfy the condition that it is positively associated with the construct that it is supposed to be measuring? The criteria for convergent validity would not be satisfied in a bibliometric experiment that found little or no correlation between, say, peer review grades and citation measures" (Rowlands, 2018). The analyses are intended to identify the indicator (variant) that is more strongly related to assessments by peers (concerning newness) than other indicators.

INDICATORS MEASURING DISRUPTION
The new family of indicators measuring disruption has been developed based on the previous introduction of another indicator family measuring novelty. Research on the novelty indicator family is based on the view of research as a "problem solving process involving various combinatorial aspects so that novelty comes from making unusual combinations of preexisting components" (Wang, Lee, & Walsh, 2018, p. 1074. Uzzi, Mukherjee, et al. (2013) analyzed cited references, and investigated whether referenced journal pairs in papers are atypical or not. Papers with many atypical journal pairs were denoted as papers with high novelty potential. The authors argue that highly cited papers are not only highly novel but also very conventionally oriented. In a related study, Boyack and Klavans (2014) reported strong disciplinary and journal effects in inferring novelty.
In recent years, Lee, Walsh, and Wang (2015) proposed an adapted version of the novelty measure proposed by Uzzi et al. (2013), Wang, Veugelers, and Stephan (2017),  introduced a novelty measure focusing on publications with great potential of being novel by identifying new journal pairs (instead of atypical pairs). A different approach is used by Boudreau, Guinan, et al. (2016) and Carayol, Lahatte, and Llopis (2017), who used unusual combinations of keywords for measuring novelty. Other studies in the area of measuring novelty have been published by Foster, Rzhetsky, and Evans (2015), Mairesse and Pezzoni (2018), Bradley, Devarakonda, et al. (2020), and Wagner, Whetsell, and Mukherjee (2019), each with a different focus. According to the conclusion by Wang et al. (2018), "prior work suggests that coding for rare combinations of prior knowledge in the publication produces a useful a priori measure of the novelty of a scientific publication" (p. 1074).
Novelty indicators have been developed against the backdrop of the desire to identify and measure creativity. How is creativity defined? According to Hemlin, Allwood, et al. (2013) "creativity is held to involve the production of high-quality, original, and elegant solutions to complex, novel, ill-defined, or poorly structured problems" (p. 10). Puccio, Mance, and Zacko-Smith (2013) claim that "many of today's creativity scholars now define creativity as the ability to produce original ideas that serve some value or solve some problem" (p. 291). The connection between the indicators measuring novelty and creativity of research is made by that stream of research viewing creativity "as an evolutionary search process across a combinatorial space and sees creativity as the novel recombination of elements" (Lee et al., 2015, p. 685). For Estes and Ward (2002) "creative ideas are often the result of attempting to determine how two otherwise separate concepts may be understood together" (p. 149), whereby the concepts may refer to different research traditions or disciplines. Similar statements on the roots of creativity can be found in the literature from other authors, as the overview by Wagner et al. (2019) shows. Bibliometric novelty indicators try to capture the combinatorial dynamic of papers (and thus, the creative potential of papers; see  by investigating lists of cited references or keywords for new or unexpected combinations (Wagner et al., 2019).
In a recent study,  investigated two novelty indicators and tested whether they exhibit convergent validity. They used a similar design to this study and found that only one indicator is convergently valid.
In this context of measuring creativity, not only the development of indicators measuring novelty but also the introduction of indicators identifying disruptive research have occurred. These indicators are interested in exceptional research that turns knowledge formation in a field around. The family of disruption indicators proposed especially by  and Bu et al. (2019) seizes on the concept of Kuhn (1962), who differentiated between phases of normal science and scientific revolutions. Normal science is characterized by paradigmatic thinking, which is rooted in traditions and consensus orientation; scientific revolutions follow divergent thinking and openness (Foster et al., 2015). Whereas normal science means linear accumulation of research results in a field (Petrovich, 2018), scientific revolutions are dramatic changes with an overthrow of established thinking (Casadevall & Fang, 2016). Preconditions for scientific revolutions are creative knowledge claims that disrupt linear accumulation processes in field-specific research (Kuukkanen, 2007). Bu et al. (2019) see the development of disruption indicators in the context of a multidimensional perspective on citation impact. In contrast to simple citation counting under the umbrella of a one-dimensional perspective, the multidimensional perspective considers breadth and depth through the cited references of an FP and the cited references of its citing papers (see also Marx & Bornmann, 2016). In contrast to the family of novelty indicators, which are based exclusively on cited references, disruption indicators combine the cited references of citing papers with the cited references data of FPs. The disruptiveness of an FP is measured based on the extent to which the cited references of the papers citing the FP also refer to the cited references of the FP. According to this idea, many citing papers not referring to the FP's cited references indicate disruptiveness. In this case, the FP is the basis for new work that does not depend on the context of the FP (i.e., the FP gives rise to new research).
Disruptiveness was first described by  and Funk and Owen-Smith (2017) and presented as a weighted index DI 1 (see Figure 1) calculated for an FP by dividing the difference between the number of publications that cite the FP without citing any of its cited references (N i ) and the number of publications that cite both the FP and at least one of its cited references (N 1 j ) by the sum of N i , N 1 j , and N k (the number of publications that cite at least one of the FP's cited reference without citing the FP itself ). Simply put, this is the ratio DI 1 corresponds to a certain notion of disruptiveness, according to which only few papers are disruptive (while most papers are not disruptive), and a paper needs to have a large citation impact to score high on DI 1 . DI 1 detects only a few papers as disruptive due to the term N k , which is often very large compared to the other terms in the formula (Wu & Yan, 2019). A large N k produces disruption values of small magnitude, as N k only occurs in the denominator of the formula. As a result, the disruption index is very similar for many papers, and only a few papers get high disruption values. However, Funk and Owen-Smith (2017), who originally defined the formula for the disruption index, designed the indicator to measure disruptiveness on a continuous scale from −1 to 1: "the measure is continuous, able to capture degrees of consolidation and destabilization" (p. 793). This raises the question of whether different nuances of disruptions can be adequately captured by DI 1 , or if the term N k is too dominant for this purpose.
Including N k in the formula can also be questioned with regard to its function for assessing the disruptiveness of a paper. The basic idea that all disruption indicators share is to distinguish between citing papers of an FP indicating disruptiveness, and citing papers indicating consolidation. N k does not refer to this distinction. Instead, it captures the citation impact of an FP compared to other papers in a similar context (all papers citing the same papers as the FP). Assuming a notion of disruptiveness that aims at detecting papers that have a large-scale disruptive effect, this idea seems reasonable. However, this form of considering citation impact can be problematic, as it strongly depends on the citation behavior of the FP, and small changes in the FP's cited references can have a large effect on the disruption score. A more general issue regarding the function of N k is whether citation impact should be considered at all when measuring disruptiveness, which depends on the underlying notion of disruptiveness. Bu et al. (2019) suggest separating disruptiveness (depth of citation impact in their terminology) and citation impact in terms of number of forward citations. This perspective assumes that disruptiveness is a quality of papers that can also be observed on a small impact level. From this perspective, N k should not be included in the formula for measuring disruptiveness, especially as it is the dominant factor in most cases. Consequently, an alternative indicator would be simply to drop the term N k , which corresponds to DI nok 1 according to the formula in Figure 1. This variant of the disruption indicator has been proposed by Wu and Yan (2019). With Ni NiþN 1 j , a very similar approach for calculating a paper's disruption has been proposed by Bu et al. (2019). This indicator can be defined as a function of DI nok 1 , such that differences between papers just change by the factor 0.5, so that both variants allow identical conclusions. In our analyses, we will consider DI nok 1 because it has the same range of output values as the original disruption index DI 1 .
In contrast to the aforementioned variants of indicators measuring disruptiveness, Bu et al. (2019) also proposed an indicator that considers how many cited references of the FP are cited by an FP's citing paper. This approach takes into account how strongly the FP's citing papers rely on the cited references of the FP, instead of just considering if this relationship exists (in the form of at least one citation of a cited reference of the FP). The corresponding indicator proposed by Bu et al. (2019) (denoted as DeIn in Figure 1) is defined as the average number of cited references of the FP that its citing papers cite. In contrast to the other indicators mentioned earlier, DeIn is supposed to decrease with the disruptiveness of a paper, because it measures the dependency of the paper on earlier work (as opposed to disruptiveness). Another difference to the other indicators is that the range of DeIn has no upper bound, because the average number of citation links from a paper citing the FP to the FP's cited references is not limited. This makes it more difficult to compare the results of DeIn and the other indicators.
By considering only those citing papers of the FP that cite at least l (l > 1) of the FP's cited references, it becomes possible to follow the idea of taking into account how strongly the FP's citing papers rely on the cited references of the FP, but also get values that are more comparable to the other indicators. The probability that a citing paper of the FP cites a highly cited reference of the FP is higher than it is for a less frequently cited reference of the FP. Therefore, the fact that a paper cites both the FP and at least one of its cited references is not equally indicative for a developmental FP in all cases. Only considering those of the FP's citing papers that also cite at least a certain number of the FP's cited references mitigates the problem, because the focus on the most reliable cases of citing papers indicates a developmental FP. This is formalized in the formulae in Figure 1, where the subscripts of DI l and DI nok l correspond to the threshold for the number of cited references of the FP which a citing paper must cite to be considered. With a threshold of l = 1 (i.e., without any restriction on the number of the FP's cited references that the FP's citing papers must cite), the indicator is identical to the indicator originally proposed by . To analyze how well these different strategies are able to measure the disruptiveness of a paper, we compare the following indicators in our analyses: DI 1 , DI 5 , DI nok 1 , DI nok 5 , DeIn. The subscript in four variants indicates the minimum number of cited references that are cited along with the FP. The superscript "no k" in two variants indicates that N k is excluded from the calculation.

F1000Prime
F1000Prime is a database including important papers from biological and medical research (see https://f1000.com/prime/home). The database is based on a postpublication peer review system: Peer-nominated faculty members (FMs) select the best papers in their specialties and assess these papers for inclusion in the F1000Prime database. FMs write brief reviews explaining the importance of papers and rate them as "good" (1 star), "very good" (2 stars) or "exceptional" (3 stars). Many papers in the database are assessed by more than one FM. To rank the papers in the F1000Prime database, the individual scores are summed up to a total score for each paper.
FMs also assign the following tags to the papers, if appropriate: 1 -Confirmation: article validates published data or hypotheses -Controversial: article challenges established dogma -Good for teaching: key article in a field and/or is well written -Hypothesis: article presents an interesting hypothesis − Negative/null results: article has null or negative findings -New finding: article presents original data, models or hypotheses -Novel drug target: article suggests new targets for drug discovery -Refutation: article disproves published data or hypotheses -Technical advance: article introduces a new practical/theoretical technique, or novel use of an existing technique The tags in bold reflect aspects of novelty in research. As disruptive research should include elements of novelty, we expect that the tags are positively related to the disruption indicator scores. For instance, we assume that a paper receiving many "new finding" tags from FMs will have a higher disruption index score than a paper receiving only a few tags (or none at all). The tags not printed in bold are not related to newness (e.g., confirmation of published hypotheses), so that the expectations for these tags are zero or negative correlations with disruption index scores. In terms of measures that are likely to be inversely correlated with disruptiveness, the one that seems most plausible is the "confirmation" tag. The tag "controversial" is printed in italics. It is not clear whether the tag is able to reflect novelty or not. FMs further assign the tags "clinical trial," "systematic review/meta-analysis," and "review/commentary" to papers that are not relevant for this study (and not used thus).
We interpret the empirical results in Section 4.3 against the backdrop of the above assumptions. In the interpretations of the results, however, it should be considered that the allocations of tags by the FMs are subjective decisions associated with (more or less) different meanings. In other words, the tags data are affected by noise (uncertainties) covering the signal (clear-cut judgments). Another point which should be considered in the interpretation of the empirical results is the fact that the above assumptions can be formulated in another way. For example, we anticipate that papers that are "good for teaching" would be inversely correlated with disruptiveness. The opposite could be true as well. Papers that introduce new topics, perspectives, and ways of thinking-that shift the conversation-would be most useful for teaching. Many factors play a role in the interpretation of the "good for teaching" tag: How complex is the paper assessed by the FMs? Is it a landmark paper published decades ago or a recent research front paper? Is the paper intended for teaching of bachelor, masters or doctoral students?
Many other studies have already used data from the F1000Prime database for correlating them with metrics. Most of these studies are interested in the relationship between quantitative (metrics-based) and qualitative (human-based) assessments of research. The analysis of Anon (2005) shows that "papers from high-profile journals tended to be rated more highly by the faculty; there was a tight correlation (R 2 = .93) between average score and the 2003 impact factor of the journal" (see also Jennings, 2006).  correlated several bibliometric indicators and F1000Prime recommendations. They found that the "percentile in subject area achieves the highest correlation with F1000 ratings" (p. 286). Waltman and Costas (2014) report "a clear correlation between F1000 recommendations and citations. However, the correlation is relatively weak" (p. 433). Similar results were published by Mohammadi and Thelwall (2013). Bornmann (2015) investigated the convergent validity of F1000Prime assessments. He found that "the proportion of highly cited papers among those selected by the FMs is significantly higher than expected. In addition, better recommendation scores are also associated with higher performing papers" (p. 2415). The most recent study by Du, Tang, and Wu (2016) shows that "(a) nonprimary research or evidencebased research are more highly cited but not highly recommended, while (b) translational research or transformative research are more highly recommended but have fewer citations" (p. 3008).

Data Set Used and Variables
The study is based on a data set from F1000Prime including 207,542 assessments of papers. These assessments refer to 157,020 papers (excluding papers with duplicate DOIs, missing DOIs, missing expert assessments, etc.). The bibliometric data for these papers are from an in-house database (Korobskiy, Davey, et al., 2019), which utilizes Scopus data (Elsevier Inc.). To increase the validity of the indicators included in this study, we considered only papers with at least 10 cited references and at least 10 citations. Furthermore, we included only papers from 2000 to 2016 to have reliable data (some publications are from 1970 or earlier) and a citation window for the papers of at least 3 years (since publication until the end of 2018). The reduced paper set consists of 120,179 papers published between 2000 and 2016 (see Table 1).
We included several variables in the empirical part of this study: the disruption index proposed by  (DI 1 ) and the dependence indicator proposed by Bu et al. (2019) (DeIn). The alternative disruption indicators described in Section 2 considered were: DI 5 , DI nok 1 , and DI nok 5 . For the comparison with the indicators reflecting disruption, we included the sum (ReSc.sum) and the average (ReSc.avg) of reviewer scores (i.e., scores from FMs). Besides the qualitative assessments of research, quantitative citation impact scores are also considered: number of citations until the end of 2018 (Citations) and percentile impact scores (Percentiles).
As publication and citation cultures are different in the fields, it is standard in bibliometrics to field-and time-normalize citation counts (Hicks, Wouters, et al., 2015). Percentiles are field-and time-normalized citation impact scores (Bornmann, Leydesdorff, & Mutz, 2013) that are between 0 and 100 (higher scores reflect more citation impact). For the calculation of percentiles, the papers published in a certain subject category and publication year are ranked in decreasing order. Then the formula (i − 0.5)/n × 100 (Hazen, 1914) is used to calculate percentiles (i is the rank of a paper and n the number of papers in the subject category). Impact percentiles of papers published in different fields can be directly compared (despite possibly differing publication and citation cultures). Table 2 shows the key figures for citation impact scores, reviewer scores, and variants measuring disruption. As the percentiles reveal, the paper set includes especially papers with a considerable citation impact. Table 3 lists the papers that received the maximum scores in Table 2. The maximum DI 1 with the value 0.677 has been reached by the paper entitled "Cancer statistics, 2010" published by Jemal, Siegel, et al. (2010). This publication is one of an annual series published on incidence, mortality, and survival rates for cancer and its high score may be an artifact of the DI 1 formula because it is likely that the report is cited much more than its cited references. In fact, this publication may make the case for the DI 5 formulation. Similarly, the 2013 edition of the cancer statistics report was found to have the maximum percentile. The maximum number of citations was seen for the well-known review article by Hanahan and Weinberg (2011), "Hallmarks of cancer: the next generation."

Statistics Applied
The statistical analyses in this study have three steps: 1. We investigated the correlations between citation impact scores, reviewer scores, and the scores of the indicators measuring disruption. All variables are not normally distributed and affected by outliers. To tackle this problem, we logarithmized the scores by using the formula log(x + 1). This logarithmic transformation approximates the distributions to normal distributions 2 . As perfectly normally distributed variables cannot be achieved with the transformation, Spearman rank correlations have been calculated (instead of Pearson correlations). We interpret the correlation coefficients against the backdrop of the guidelines proposed by Cohen (1988) and Kraemer, Morgan, et al. (2003): small effect = 0.1, medium effect = 0.3, large effect = 0.5, and very large effect = 0.7. 2. We performed an exploratory factor analysis (FA) to analyze the variables. FA is a statistical method for data reduction (Gaskin & Happell, 2014); it is an exploratory technique to identify latent dimensions in the data and to investigate how the variables are related to the dimensions (Baldwin, 2019). We expected three dimensions, because we have variables with citation impact scores, reviewer scores, and indicators' scores measuring disruption. As the (logarithmized) variables do not perfectly follow the normal distribution, we performed the FA using the robust covariance matrix following Verardi and McCathie (2012). Thus, the results of the FA are based on not the variables but on a covariance matrix. The robust covariance matrix has been transformed into a correlation matrix (StataCorp, 2017), which has been analyzed by the principal component factor method (the communalities are assumed to be 1). We interpreted the factor loadings for the orthogonal varimax rotation; the factor loadings have been adjusted "by dividing each of them by the communality of the correspondence variable. This adjustment is known as the Kaiser normalization" (Afifi, May, & Clark, 2012, p. 392).
In the interpretation of the results, we focused on factor loadings with values greater than 0.5. 3. We investigated the relationship between the dimensions (identified in the FA) and F1000Prime tags (as proxies for newness or not). We expected a close relationship between the dimension reflecting disruption and tags reflecting newness. The tags are count variables including the sum of the tags assignments from F1000Prime FMs for single papers. To calculate the relationship between dimensions and tags, we performed a robust Poisson regression (Hilbe, 2014;Long & Freese, 2014). The Poisson model is recommended to be used in cases of count data as dependent variable. Robust methods are recommended when the distributional assumptions for the model are not completely met (Hilbe, 2014). Because we are interested in identifying indicators for measuring disruption that might perform better than the other variants, we tested the correlation between each variant and the tag assignments using several robust Poisson regressions. Citations, disruptiveness, and tag assignments are dependent on time . Thus, we included the number of years between 2018 and the publication year as exposure time in the models (Long & Freese, 2014, pp. 504-506).

Correlations Between Citation Impact Scores, Reviewer Scores, and Variants
Measuring Disruption Figure 2 shows the matrix including the coefficients of the correlations between reviewer scores, citation impact indicators, and variants measuring disruption. DI 1 is correlated at a medium level with the other indicators measuring disruption, whereby these indicators correlate among themselves at a very high level. Very high positive correlations are visible between citations and percentiles and between the average and sum of reviewer scores.
The correlation between DI 1 and citation impact (citations and percentiles) is at least at the medium level, but it is negative (r = −.46, r = −.37). Thus, the original DI 1 seems to measure another dimension than citation impact. This result is in agreement with results reported by   Figure 2a). However, the situation changed with the other indicators measuring disruption to small positive (negative in the case of DeIn) correlation coefficients.  Kourtis, Nikoletopoulou, and Tavernarakis (2012). Small heat-shock proteins protect from heat-stroke-associated neurodegeneration ReSc.sum Lolle, Victor, et al. (2005). Genome-wide non-Mendelian inheritance of extra-genomic information in Arabidopsis ReSc.avg McEniery, Yasmin, et al. (2005). Normal vascular aging: Differential effects on wave reflection and aortic pulse wave velocity: The Anglo-cardiff Collaborative Trial (ACCT) Citations Hanahan and Weinberg (2011). Hallmarks of cancer: The next generation Percentiles Siegel, Naishadham, and Jemal (2013). Cancer statistics, 2013

Factor Analysis to Identify Latent Dimensions
We calculated an FA including reviewer scores, citation impact indicators, and variants measuring disruption to investigate the underlying dimensions (latent variables). Most of the results shown in Table 4 agree with expectations: We found three dimensions, which we labeled as disruption (factor 1), citations (factor 2), and reviewers (factor 3). However, contrary to what was expected, DI 1 loads negatively on the citation dimension revealing that (a) high DI 1 scores are related to low citation impact scores (see above) and (b) all other indicators measuring disruption are independent of DI 1 . Thus, the other indicators (at least one) seem to be promising developments compared to the originally proposed indicator DI 1 .

Relationship Between Tag Mentions and FA Dimensions
Using Poisson regression models including the tags, we calculated correlations between the tags and the three FA dimensions (disruption, citations, and reviewers). We are especially interested in the correlation between the tags (measuring newness of research or not) and the disruption dimension from the FA. We also included the citation impact and reviewer dimensions into the analyses to see the corresponding results for comparison. In the analyses, we considered the FA scores for the three dimensions (predicted values from FA that are not correlated by definition) as independent variables and the various tags (the sum of the tags assignments) as dependent variables in nine Poisson regressions (one regression model for each tag).
The results of the regression analyses are shown in Table 5. We do not focus on the statistical significance of the results, because they are more or less meaningless against the backdrop of the high case numbers. The most important information in the table is the signs of the coefficients and the percentage change coefficients. The percentage change coefficients are counterparts to odds ratios in regression models, which measure the percentage changes in the dependent variable if the independent variable (FA score) increases by one standard deviation (Deschacht & Engels, 2014;Hilbe, 2014;Long & Freese, 2014). The percentage change coefficient for the model based on the "technical advance" tag and the disruption dimension can be interpreted as follows: For a standard deviation increase in the scores for disruption, a paper's expected number of new finding tags increases by 10.93%, holding other variables in the regression analysis constant. This increase is as expected and substantial. However, the results of the other tags expressing newness have a negative sign and are against expectations.
The percentage change coefficients for the citation dimension are significantly higher than for the disruption dimension (especially for the new finding tag) and positive. This result is against our expectations, because the disruption variants should measure newness in a better way than citations. However, one should consider in the interpretation of the results that DI 1 correlates negatively with the citation indicators. Thus, the dimension also measures disruptiveness (as originally proposed), whatever the case may be. If we interpret the results for the dimension against this backdrop, at least the results for the tags not representing newness seem to accord with our expectations. The results for the reviewer dimension are similar to the citations dimension results. The consistent positive coefficients for the citations and reviewers dimensions in Table 5 might result from the fact that the tags are from the same FMs as the recommendations, and the FMs probably use citations to find relevant papers for reading, assessing, and including in the F1000Prime database. Table 6 reports the results from some additional regression analyses. Because we are interested in not only correlations between dimensions (reflecting disruptiveness) and tags (the sum of the tags assignments) but also in correlations between the various variants measuring disruption and tags, we calculated 45 additional regression models. We are interested in the question of which variant measuring disruption reflects newness better than other variants: Are the different variants differently or similarly related to newness, as expressed by the tags? Table 6 only shows percentage change coefficients (see above) from the regression models

Quantitative Science Studies
(because of the great number of models). In other words, percentage changes in expected counts (of the tag) for a standard deviation increase in the variant measuring disruption are listed. For example, a standard deviation change in DI 1 on average increases a paper's expected number of technical advance tags by 6.72%. This result agrees with expectation, because the technical advance tag reflects newness. It seems that DI 5 reflects the assessments by FMs at best; the lowest number of results in agreement is visible for DeIn.

DISCUSSION
For many years, scientometrics research has focused on improving the way of field-normalizing citation counts or developing improved variants of the h-index. However, this research is rooted in a relatively one-dimensional way of measuring impact. With the introduction of the new family of disruption indicators, the one-dimensional method of impact measurement may now give way to multidimensional approaches. Disruption indicators consider not only timescited information but also cited references data (of FPs and citing papers). High indicator values should be able to point to published research disrupting traditional research lines. Disruptive papers catch the attention of citing authors (at the expense of the attention devoted to previous research); disruptive research enters unknown territory, which is scarcely consistent with known territories handled in previous papers (and cited by disruptive papers). Thus, the citing authors exclusively focus on the disruptive papers (by citing them) and do not reference previous papers cited in the disruptive papers.
Starting from the basic approach of comparing cited references of citing papers with cited references of FPs, different variants of measuring disruptiveness have been proposed recently. An overview of many possible variants can be found in Wu and Yan (2019). In this study, we included some variants that sounded reasonable and/or followed different approaches. For example, DeIn proposed by Bu et al. (2019) is based on the number of citation links from an FP's citing papers to the FP's cited references (without considering N k ). We were interested in the convergent validity of these new indicators following the basic analyses approach by . The convergent validity can be tested by using an external criterion measuring a similar dimension. Although we did not have an external criterion at hand measuring disruptiveness specifically, we used tags from the F1000Prime database reflecting newness. FMs assess papers using certain tags and some tags reflect newness. We assumed that disruptive research is assessed as new. Based on the F1000Prime data, we investigated whether the tags assigned by the FMs to the papers correspond with indicator values measuring disruptiveness.
In the first step of the statistical analyses, we calculated an FA for inspecting whether the various indicators measuring disruptiveness load on a single "disruptiveness" dimension. As the results reveal, this was partly the case: All variants of the DI 1 -the original disruption index proposed by -loaded on one dimension-the "disruptiveness" dimension. However, the original disruption index itself loaded on a dimension which reflects citation impact; however, it loaded negatively. These results might be interpreted as follows: The proposed disruption index variants measure the same construct, which might be interpreted as "disruptiveness." DI 1 is related to citation impact whereby negative values-the developmental index manifestation of this indicator (see Section 2)-correspond to high citation impact levels. As all variants of DI 1 loaded on the same factor in the FA, the results do not show which variant should be preferred (if any). Thus, we considered a second step of analyses in this study.
In this step, we tested the correlation between each variant (including the original) and the external "newness" criterion. The results showed that DI 5 reflects the FMs' assessments at best (corresponding with our expectations more frequently than the other indicators); the lowest number of results that demonstrated an agreement between tag and indicator scores is visible for DeIn. The difference between the variants is not very large; however, the results can be used to guide the choice of a variant if disruptiveness is measured in a scientometric study. Although the authors of the paper introducing DI 1  performed analyses to validate the index (e.g., by calculating the indicator for Nobel-Prize-winning papers), they did not report on evaluating possible variants of the original, which might perform better.
We noted that while a single publication was the most highly disruptive for the DI 1 (0.6774), and DI nok 1 (0.9747), 703 and 3816 publications respectively scored the maximum disruptiveness value of 1.0 for variants DI 5 and DI nok 5 . We also reviewed examples of the most highly disruptive publications as measured by all four variants and observed that instances of an annual Cancer Statistics report published by the American Cancer Society received maximal disruptiveness scores for all four variants, presumably because this report is highly cited in each year of its publication without its references being cited. A publication from the Journal of Global Environmental Change (https://doi.org/10.1016/j.gloenvcha.2008.10.009) was also noteworthy and may reflect the focus on climate change.
All empirical research has limitations. This study is no exception. We assumed in this study that novelty is necessarily a (or the) defining feature of disruptiveness. There are plenty of existing bibliometric measures of novelty (e.g., new keyword or cited references combinations; see . Although novelty may be necessary for disruptiveness, it is not necessarily sufficient to make something disruptive. We cannot completely exclude the possibility that many nondisruptive discoveries are novel. "Normal science" (Kuhn, 1962) discoveries do not-necessarily lack novelty; they make novel contributions (e.g., hypotheses, findings, or technical advances) in a way that builds on and enhances prior work (e.g., within the paradigm). A second limitation of the study might be that the F1000Prime data are affected by possible biases. We know from the extensive literature on ( journal and grant) peer review processes that indications for possible biases in the assessments by peers exist (Bornmann, 2011). The third limitation refers to the F1000Prime tags that we used in this study. None of them directly captures disruptiveness (as we acknowledge above). Most of the tags capture novelty, which is related, but conceptually different from disruptiveness, and for which there are already existing metrics (see Section 2). Because disruption indicators propose measuring disruption (and not novelty), we can't directly make claims on whether disruption indicators measure what they propose to measure.
It would be interesting to follow up in future studies that use mixed-methods approaches to evaluate the properties of N i , N j , and N k variants more systematically against additional gold standard data sets. The F1000 data set is certain to feature its own bias (e.g., it is restricted to biomedicine and includes disproportionately many high-impact papers) and the variants we describe may exhibit different properties when evaluated against multiple data sets.