Informed peer review for publication assessments: Are improved impact measures worth the hassle?

In this work we ask whether and to what extent applying a predictor of publications' impact better than early citations, has an effect on the assessment of research performance of individual scientists. Specifically, we measure the total impact of Italian professors in the sciences and economics in a period of time, valuing their publications first by early citations and then by a weighted combination of early citations and impact factor of the hosting journal. As expected, scores and ranks by the two indicators show a very strong correlation, but there occur also significant shifts in many fields, mainly in Economics and statistics, and Mathematics and computer science. The higher the share of uncited professors in a field and the shorter the citation time window, the more recommendable the recourse to the above combination.


Introduction
Evaluative scientometrics is mainly aimed at measuring and comparing research performance of entities. In general, a research entity is said to perform better than another if, all production factors being equal, its total output has higher impact. The question then is how to measure the impact of output. Citation-based indicators are more apt to assess scholarly impact than social impact, although it is reasonable to expect that a certain correlation between scholarly and social impact exists (Abramo, 2018).
As far as scholarly impact is concerned, three approaches are available to assess the impact of publications: by human judgment (peer review), or through the use of citationbased indicators (bibliometrics), or by drawing on both whereby bibliometrics informs peer-review judgment (informed peer review).
Although scientometricians for short say that they "measure" scholarly impact, what they actually do is "predicting" impact. The reason is that to serve its purpose, any research assessment aimed at informing policy and management decisions cannot wait for the publications life-cycle to be completed (i.e. the publications stop being cited), which may take decades ( van Raan, 2004;Teixeira, Vieira, & Abreu, 2017;Song, Situ, Zhu, & Lei, 2018).
As a consequence, scientometricians count early citations, not overall. The question then is how long should the citation time window be, in order for the early citations to be considered an accurate and robust proxy of overall scholarly impact. The longer the citation time window, the more accurate the prediction. In the end, the answer is subjective, because of the embedded tradeoff: the appropriate choice of citation time window is a compromise between the two objectives of accuracy and timeliness in measurement, and the relative solutions differ from one discipline to another. The topic has been extensively examined in literature (Rousseau, 1988;Glänzel, Schlemmer, & Thijs, 2003;Adams, 2005;Stringer, Sales-Pardo, & Nunes Amaral, 2008;Abramo, Cicero, & D'Angelo, 2011;Nederhof, Van Leeuwen, & Clancy, 2012Wang, 2013Onodera, 2016).
Most studies in evaluative scientometrics focus on providing new creative solutions to the problem of how to best support the measurement of research performance. An extraordinary number of performance indicators continue to be proposed. It suffices to say that in the recent 17 th International Society of Scientometrics and Informetrics Conference -(ISSI 2019), a special plenary session and five parallel sessions, including 25 contributions altogether (leaving aside poster presentations), were devoted to "novel bibliometric indicators".
Much fewer studies have tackled the problem of how to improve the impact prediction power of early citations, given inevitable citation short time windows. A number of scholars proposed to combine citation counts with other independent variables related to the publication. Whichever the combination, there is a common awareness that it cannot be the same across disciplines, because the citation accumulation speed and distribution curves vary across disciplines (Garfield, 1972;Mingers, 2008;Wang, 2013;Baumgartner & Leydesdorff, 2014).
It has been shown that in mathematics (and with weaker evidence in biology and earth sciences), for two-years or less citation windows, the journal's two-year impact factor (IF) predicts better than early citations long-term impact (Abramo, D'Angelo, & Di Costa, 2010). In all disciplines of the sciences but mathematics, for citation windows of zero or one year only, a combination of IF and citations was recommended (Levitt & Thelwall, 2011;Bornmann, Leydesdorff, & Wang 2014). The same seems to be valid in the social sciences as well (Stern, 2014). A model based on IF and citations to predict long-term citations was proposed by Stegehuis, Litvak, and Waltman (2015). The weighted combination of citations and journal metric percentiles adopted in the Italian research assessment exercise, VQR 2011-2014 (Anfossi, Ciolfi, Costa, Parisi, & Benedetto, 2016), proved to be a worse predictor of future impact than citations only (Abramo and D'Angelo, 2016).
To provide practitioners and decision makers with a better predictor of overall impact, and awareness of how the predicting power varies with the citation time window, Abramo, D'Angelo, and Felici (2019) made available, in each of the 170 subject categories (SCs) in the sciences and economics, with more than 100 Italian 2004-2006 publications: i) the weighted combinations of two-year IF and citations, as a function of the citation time window, which best predict overall impact; and ii) the predictive power of each combination.
It emerged that the IF has a non-negligible role only with very short citation time windows (0 to 2 years); for longer ones, the weight of early citations is dominating and the IF is not informative in explaining the difference between long-term and short-term citations.
The calibration of the weights by citation time window and SC, and the measurement of the impact indicator is not so straightforward as the simple measurement of normalized citations.
In this study, we want to find out whether all this hassle about improving the predicting power of early citations is worthwhile. We ask whether and to what extent applying a predictor of overall impact more accurate than early citations, has an effect on the research performance ranks of individuals. In this specific case, as a performance indicator we recur to total impact of individuals. This indicator is particularly appropriate when one needs to identify the top experts in a particular field, for consultancy work or the like. Counting on an authors' name disambiguation algorithm of Italian academics, we measure the total impact of Italian professors (assistant, associate, full) in the sciences and in economics in a period of time, valuing their publications first by the early citations and then by the weighted citation-IF combination provided by Abramo, D'Angelo, and Felici (2019). At this point, we can analyze the extent of variations in rank of individuals, in each discipline and field where they are classified. 2 The rest of the manuscript is organized as follows. In Section 2, we present the data and method. In Section 3, we report the comparison of the rankings by the two methods of valuing overall impact, at field and discipline level. The discussion of results in Section 4 will conclude the work.

Data and methods
For the purpose of this study, we are interested in how a different measure of impact affects the ranking of Italian professors by total impact, in the period of 2015-2017.
Data on the faculty at each university were extracted from the database on Italian university personnel, maintained by the Ministry of Universities and Research, MUR. For each professor this database provides information on their gender, affiliation, field classification and academic rank, at the end of each year. 3 In the Italian university system all academics are classified in one and only one field, named scientific disciplinary sector (SDS), 370 in all. SDSs are grouped into disciplines, named university disciplinary areas (UDAs), 14 in all.
Data on output and relevant citations are extracted from the Italian Observatory of Public Research, a database developed and maintained by Abramo and D'Angelo, and derived under license from the Clarivate Analytics Web of Science Core Collection (WoS). Beginning from the raw data of the WoS, and applying a complex algorithm to reconcile the author's affiliation and disambiguation of the true identity of the authors, each publication (article, letter, review and conference proceeding) is attributed to the university professor that produced it. 4 Thanks to this algorithm, we can produce rankings by total impact at the individual level, on a national scale. Based on the value of total impact we obtain a ranking list expressed on a percentile scale of 0-100 (worst to best) of all Italian academics of the same academic rank and SDS.
We limit our field of analysis to the sciences and economics, where the WoS coverage is acceptable for bibliometric assessment. The dataset thus formed consists of 38,456 professors from 11 UDAs (mathematics and computer sciences, physics, chemistry, earth sciences, biology, medicine, agricultural and veterinary sciences, civil engineering, industrial and information engineering, psychology, economics and statistics) and 218 SDSs, as shown in Table 1. 9.3% of professors are unproductive (0 publications), and as a consequence their scores remain unchanged by the two indicators, but not necessarily their ranks. In fact, the scores and ranks of uncited productive professors (4.2% in all) will change (because IF is always above 0). Measuring the latter's impact by citations only, their score (0) and rank would be the same as for unproductive professors. It would not, when measured by the weighted combination of normalized citations and IF.
As for impact, we measure it in two ways: one way values publications by early citations only; and the other by the weighted combinations of citations and IF, 5 as a function of the citation time window and field of research, which best predict future impact (Abramo, D'Angelo, & Felici, 2019). Because citation behavior varies across fields, we standardize the citations for each publication with respect to the average of the distribution of citations for all publications indexed in the same year and the same SC. 6 We apply the same procedure to the IF. Furthermore, research projects frequently involve a team of scientists, which is registered in the co-authorship of publications. In this case, we account for the fractional contributions of scientists to outputs, which is sometimes further signaled by the position of the authors in the list of authors. The yearly total impact of a professor, termed TI, is then defined as: The fractional contribution equals the inverse of the number of authors in those fields where the practice is to place the authors in simple alphabetical order but assumes different weights in other cases. For the life sciences, widespread practice in Italy is for the authors to indicate the various contributions to the published research by the order of the names in the listing of the authors. For the life sciences, we give then different weights to each co-author according to their position in the list of authors and the character of the co-authorship (intra-mural or extra-mural). 7 For reasons of significance, the analysis is limited to those professors who held formal faculty roles, for at least two years over the 2015-2017 period.
Citations are observed at 31 December 2018, implying citation time windows ranging from one to four years.

Results
In the following, we present the score and rank of performance by total impact of Italian professors, by SDS and UDA, as measured respectively by: • Early citations (TIC); • The weighted combination of citations and IF of the hosting journal (TIWC). As already said, no variations will occur for professors with no publications in the period under observation. We expect instead significant variations in score and rank for professors with uncited publications. In fact, while TIC is nil, TIWC is going to be above 0.
As an example, Table 2 shows the scores and ranks by TIC and TIWC, for the 26 Italian professors in the SDS Aerospace propulsion. The score variation is nil for the two unproductive professors at the bottom of the list, while it is maximum for the uncited productive professors (ID 49113 and 2592). Twelve professors experience no shift, among them the top five in ranking. Few pairs swap positions e.g. ID 78162 and ID 49106. The maximum shift is 3 positions. The SDS Industrial chemistry consists of 114 professors, mostly productive and cited. Figure 1 shows the dispersion of their impact. The very strong correlation of scores (Pearson  = 0.999) and ranks (Spearman  = 0.998) by TIc and TIWC are as expected.

Figure 1: Score dispersion by TIc and TIWC of the 114 Italian professors in the SDS Industrial chemistry
Higher dispersion (Figure 2) occurs instead for the 73 professors in the SDS Complementary mathematics, whereby about two thirds (50) of professors present nil TIC, and 20% (15) while productive are uncited (TIC above 0). As a matter of fact, noticeable shifts in relative scores occur for high performers too (right-top side of the diagram), notwithstanding a very strong score correlation (Pearson  = 0.988). The ability of TIWC to discriminate the impact of uncited publications, and therefore the relevant performance of uncited professors, explains the lower rank correlation (Spearman  = 0.915). Although variations in score are not that noticeable, those in rank are. To better show that, Figure 3 reports the share of professors experiencing a rank shift in both SDSs. In Complementary mathematics, above 60% of professors do not change rank (50% could not, as unproductive). The remaining 40% though present shifts, which are in some cases quite noticeable: five professors improve their rank by no less than 10 positions. Rank shifts are less evident in Industrial chemistry: the average shift is 1.47 positions, as compared to 1.89 Complementary mathematics. In the former SDS, because of the lower number of unproductive professors, shifts concern a higher share of the population, namely 70%.
For a better appreciation of the rank variations in the whole SDS spectrum Figure 4 shows the box plots of the average percentile shifts in the SDSs of each UDA, while Table  3 presents some relevant descriptive statistics.
Economics and statistics is the UDA with the highest average percentile shift (6.5), the highest dispersion among SDSs (3.6 standard deviation), and the widest range of percentile shift, from 1.4 of SECS-P/13 (Commodity science) to 15.9 of SECS-P/04 (History of economic thought). It is followed by Mathematics and computer science, whose range of variation of the percentile shift is between 2.0 of MAT/08 (Numerical analysis) and 12.6 of MAT/04 (Complementary mathematics). On the contrary, UDAs 4 (Earth sciences) and 5 (Biology) show the lowest dispersion among SDSs (0.3 standard deviation) and quite low average percentile shifts. In UDA Medicine, a peculiar case occurs: in SDS MED/47 (Nursing and midwifery) the two ranking lists are exactly the same. The same occurs also in two SDSs of Industrial and information engineering: ING-IND/29 (Raw materials engineering) and ING-IND/30 (Hydrocarburants and fluids of the subsoil). In general, in 17 out of 218 SDSs, the average percentile shift is never below five percentiles.  In general, the correlation between TIC and TIWC is very strong. Table 4 presents some descriptive statistics of both Pearson  (score) and Spearman  (rank) for the SDSs of each UDA. As for the scores, the minimum correlation (0.957) occurs in an SDS of Medicine (MED/02 -History of medicine). As for the ranks, it occurs (0.884) in an SDS of Economics and statistics, SECS-P/04 (History of economic thought), outstanding also for the maximum average percentile shift among all SDSs (Table 3). It is a relatively small SDS, 35 professors in all, two thirds of which with nil TIC.  Mathematics and computer science;2,Physics;3,Chemistry;4,Earth sciences;5,Biology;6,Medicine;7,Agricultural and veterinary sciences;8,Civil engineering;9,Industrial and information engineering;10,Psychology;11,Economics and statistics. The rank variations in general appear strongly correlated with the share of productive professors with nil TIC, i.e. with only uncited publications. The correlation between the two variables is shown in Figure 5 (Pearson  = 0.791).

Figure 5: Field dispersion per share of uncited professors and average rank shift by TIC and TIWC
A typical way to report performance is by quartile ranking. We then analyse the performance quartile shifts by the two indicators of impact.  Table 6 shows the shift distributions by UDA. Economics and statistics presents the highest share of professors (17.1%) shifting quartile, followed by Mathematics (9%). In the remaining UDAs shares range between 4% and 7%. It must be noted that 0.4% of professors (154) experience two quartile shifts, and all but two shift from bottom to above the median. They are mainly in Economics and statistics, and in Mathematics.

Conclusions
Evaluative scientometrics is mainly aimed at measuring and comparing research performance of individuals and organizations. A critical issue in the process is the accurate prediction of scholarly impact of publications, when citation short time windows are allotted. This is often the case, when the evaluation is geared to informed decision making.
A better impact prediction accuracy often involves complex, costly and timeconsuming measurements. Pragmatism asks for an analysis of the effects of improved indicators on the performance ranking of the subjects under evaluation. This study follows up the work by the same authors (Abramo, D'Angelo & Felici, 2019), which demonstrated that especially with very short time windows (0 to 2 years) the IF can be combined with early citations, as a powerful covariate for predicting long term impact.
Using the outcomes of such inspiring work, i.e. the weighted combinations of IF and citations (as a function of the citation time window), which best predict overall impact of single publications in each SC, we have been able to measure the 2015-2017 total impact of all Italian professors in the sciences and economics, and to analyze the entity of variations in performance ranks when using early citations only.
As expected, scores and ranks by the two indicators show a very strong correlation. Nevertheless, in 7% of SDSs, the average shift results never below 5 percentiles, and 15.6 and 12.9 on average in the SDSs, respectively, of Economics and statistics, and Mathematics and computer science.
In terms of quartile shifts, almost 7% of professors undergo them. In Economics and statistics, 3% of professors shift from Q4 to above the median.
It is to be noticed a strong correlation between the rate of shifts in rank and the share of uncited professors in the SDS. The total impact of uncited professors is in fact nil by TIC, but above 0 by TIWC. In short, TIWC can better discriminate the performance of professors in the left tail of the distribution. The higher the share of uncited professors in an SDS, the more recommendable the recourse to TIWC is. Furthermore, the shorter the citation time window, the heavier the relative weight of IF in predicting the long-term impact. TIWC is then highly recommendable when citation time windows are short and the rate of uncited professors are high.
In the case of national research assessment exercises based on informed peer-review or on bibliometrics only, the weighted combination of normalized citations and IF to rank publications might be adopted, as the weights can be made available by the authors, for all SCs and citation time windows up to six years.
Possible future investigations within this stream of research, might concern the effect of the improved indicator of publications' impact on the performance score and rank of research organizations and research units.