Abstract
In recent decades, many countries have started funding academic institutions based on the evaluation of their scientific performance. Postpublication peer review is often used in this context. Bibliometric indicators have been suggested as an alternative to peer review. A recurrent question is whether peer review and metrics yield similar outcomes. In this paper, we study this question based on a sample of publications submitted to the national Italian research assessment exercise (2011–2014). In particular, we study the agreement between peer review and metrics at the institutional level, and compare this to the internal agreement of peer review. We base our analysis on a hierarchical Bayesian model using cross-validation. We find that the level of agreement is higher at the institutional level than at the publication level. Overall, the agreement between metrics and peer review is low, but on par with the similarly low internal agreement among two reviewers for certain fields of science. The low agreement between metrics and peer review is no reason for rejecting the use of metrics for some fields in the Italian national research assessment exercise. Our results provide input to the broader discussion of research evaluation, in which other factors also play an important role.
PEER REVIEW
1. INTRODUCTION
Since the 1980s, performance-based research funding systems (PBRFS) have been introduced in many countries in order to strengthen the accountability of research institutions and steer their behavior. PBRFS may vary considerably in how they function (Hicks, 2012; Zacharewicz, Lepori et al., 2019), but they have one element in common: the need to evaluate research. Peer review is often considered the principal method for evaluating scientific products. Indeed, some countries, such as the United Kingdom, have opted for research assessment that is primarily based on peer review. In large research assessment exercises, peer review may become costly. To facilitate the assessment, bibliometric indicators can be used to inform the judgement of peers. In Italy, the research assessment exercise, known as the Valutazione della Qualità della Ricerca (VQR), uses an informed peer review approach, where review by selected panellists and external peers is supported by bibliometrics in fields for which bibliometric indicators seem informative (see Ancaiani, Anfossi et al., 2015 for more details).
A recurrent question in this context is whether peer review and metrics tend to yield similar outcomes, or whether they differ substantially (Narin, 1976). This question has been repeatedly addressed in the context of the U.K. Research Excellence Framework (REF), culminating in a systematic large-scale comparison between peer review and metrics in the Metric Tide report (Wilsdon, Allen et al., 2015, Supplementary Report II). We believe that this report has two crucial shortcomings (Traag & Waltman, 2019): (a) The agreement between peer review and metrics was studied at the publication level, in contrast to the aggregate institutional level at which the REF outcomes are relevant; and (b) the internal agreement of peer review itself (i.e., the extent to which different reviewers or different peer review panels come to the same conclusion) was not considered. In the Italian context, the Agenzia Nazionale di Valutazione del sistema Universitario e della Ricerca (ANVUR), the agency tasked with the implementation of the VQR, collected data on peer review and quantified the internal agreement of peer review at the publication level. The results of the analysis showed that the agreement between metrics and peer review is similar to, or higher than, the agreement between two independent reviewers (Alfò, Benedetto et al., 2017; Ancaiani et al., 2015; Bertocchi, Gambardella et al., 2015), although this result has been subject to debate (Baccini & De Nicolao, 2016, 2017). In this paper, we use the ANVUR data set to study the internal peer review agreement at the institutional level. Similar to Traag and Waltman (2019) we find that the agreement between peer review and metrics tends to be higher at the institutional level than at the individual publication level. In addition to Traag and Waltman (2019), we quantify the internal peer review agreement at the institutional level, which is also higher than at the publication level. Most importantly, we find that the agreement between metric and peer review is generally on par with the internal agreement among two reviewers for the fields included in our analysis.
In the next section, we provide a short background of PBRFS and a brief description of the Italian VQR exercise. We then present the collected data and outline our methodology, followed by a summary of the main results obtained in our analysis. Finally, in the conclusion, we discuss the connection with broader questions around evaluation.
2. PERFORMANCE-BASED RESEARCH FUNDING SYSTEMS
Since the early 1980s, public management has changed around the world. Reforms led to the redesign of the main public administration mechanisms at all levels and in all sectors, including higher education systems and public research organizations. Over recent decades, a considerable number of countries, particularly EU member states, have implemented performance-based PBRFS. According to the definition provided by Hicks (2012), and used in Zacharewicz et al. (2019), the main characteristics of PBRFS are the following:
Research output and/or impact is evaluated ex-post.
The allocation of research funding depends (partly) on the outcome of the evaluation.
The assessment and funding allocation takes place at the organizational level.
PBRFS are a national or regional system.
PBRFS exclude any kind of degree programs and teaching assessment. Grant-based funding, which is based on ex ante evaluation of grant proposals, is excluded, as are funding systems that assign funds only on the basis of the number of researchers or PhD students. Furthermore, PBRFS provide, directly or indirectly, tools and mechanisms to allocate research funds. They are performed at the national level, not the local or institutional level, and they typically do not result in mere suggestions or recommendations to evaluated organizations, but affect the allocation of resources (Hicks, 2012; Zacharewicz et al., 2019).
Many countries have no research performance-based elements in their funding allocation at all (for a detailed examination see the work of Zacharewicz et al., 2019). Countries that do base their funding allocation partly on research performance do so in a variety of ways. This includes countries such as the United Kingdom and Italy, which have both implemented a PBRFS, but using a different approach. The rationales of both exercises may be summarized as follows: (a) steering public funds allocation on the basis of quality, excellence, or meritocratic criteria; (b) providing comparative information on institutions for benchmarking purposes; and (c) providing accountability regarding the effectiveness of research management and its impact in terms of public benefits (Abramo & D’Angelo, 2015; Franceschini & Maisano, 2017). The United Kingdom has developed the oldest and best-known assessment exercise, nowadays called the Research Excellence Framework (REF), which is a point of reference for many PBRFS. Italy introduced a PBRFS more recently, nowadays called the Valutazione della Qualità della Ricerca (VQR), which was partly inspired by the U.K. REF, although there are also marked differences.
Whereas the U.K. REF is believed to use metrics moderately and is mainly based on peer review assessments, the Italian VQR makes more extensive use of bibliometric indicators, especially in the STEM and Life Science sectors (Zacharewicz et al., 2019). These differences are not absolute, but gradual: Both rely partly on metrics and partly on peer review, but they do so in different ways. Peer review is often seen as a kind of gold standard for research evaluation (Wilsdon et al., 2015). The research community has long debated whether to use metrics or peer review for research evaluation: Both methods have strengths and weaknesses. There seems to be a growing consensus that metrics and peer review both provide useful input for research evaluation, with an understanding that metrics should support but not supplant peer review (Wilsdon et al., 2015). It is in this context that we study the agreement between peer review and metrics.
2.1. The VQR Exercises
Italy introduced a national research assessment exercise in 2006 as the Valutazione Triennale della Ricerca (VTR), which looked back at the period 2001–2003 (see Debackere, Arnold et al., 2018). The second assessment exercise, the so-called Valutazione della Qualità della Ricerca (VQR), the evaluation of research quality, looked back at the years 2004–2010 and its results were published in July 2013 by a new national agency, Agenzia Nazionale per la Valutazione dell’Università e della Ricerca (ANVUR). The third research assessment exercise (VQR 2011–2014) started in 2015 with reference to the period 2011–2014 and its results were published in February 2017 by ANVUR. The fourth research assessment exercise (VQR 2015–2019), covered the years 2015–2019, and its results were published in July 2022 by ANVUR. VTR and VQR results have been used by the government to allocate a growing share of the Ordinary University Fund (FFO), starting from 2.2% of the total funding in 2009 and reaching almost 30% of the total funding in 2022.
VQR evaluates research outputs of all permanent scientific staff in 96 universities and 39 public research organizations. With reference to the period 2011–2014, these organizations submitted what they considered to be the best outputs for evaluation on behalf of 52,677 researchers (two for each university researcher, and three for each scientist employed in a public research organization). All in all, 118,036 outputs were submitted, of which 78% were journal articles. The remaining types of research outputs included a wide range of materials such as book chapters, conference proceedings, and even works of art. Outputs were classified in 16 research areas and ANVUR appointed a Gruppo di Esperti della Valutazione (GEV), a panel of experts, for each research area.
In humanities and social sciences (except for Psychology and Economics & Statistics), a pure peer review system was employed, assisted by external (national and international) reviewers. For all research areas, almost 17,000 reviewers were used in VQR 2011–2014. External reviewers rated the publication with a score of 1–10 on three criteria: originality, methodological rigor, and impact.
Generally speaking, in Science, Technology, Engineering, and Medicine areas (STEM), the same review procedure was used, but in addition, bibliometric indicators were also produced by ANVUR to inform the panels. In Mathematics and Economics & Statistics, bibliometric indicators played a central role even if these two GEVs adopted slightly different bibliometric evaluation algorithms with respect to STEM and Psychology (for details see the GEV1 and GEV13 Area Reports). In particular, in Economics & Statistics, no bibliometric scores were gathered by ANVUR, and therefore, these variables are missing for all publications in Economics & Statistics. More specifically, peer evaluation was integrated with the use of bibliometric indicators concerning citations and journal impact, drawn from the major international databases (see Anfossi, Ciolfi et al., 2016, for a description of the bibliometric algorithm used in the exercise). The indicators used were the 5-Year Impact Factor and the Article Influence Score (AIS) for the WoS database and the Scimago Journal Rank (SJR) and the Impact Per Publication (IPP) indicators in Scopus. On the basis of the bibliometric algorithm, when citations and journal impact indicators provided contrasting results the paper was not assigned an evaluation class but was rather sent to external review for informed review (IR). More specifically, if a research product was published in a high-impact journal but received few citations, or (vice versa) was published in a low-impact journal but was cited frequently, the product was evaluated with peer review1.
3. DATA AND INDICATORS
The analysis carried out in this paper is based on reviews collected previously by ANVUR of a sample extracted from the full data set of journal articles that were submitted for evaluation for VQR 2011–2014 (ANVUR, 2017, p. 17 and Appendix B). The sample was limited to the research areas that made use of metrics (i.e., all STEM areas, Psychology, and Economics & Statistics2). In total, 77,159 journals articles were submitted in these research areas that were evaluated through bibliometrics in VQR 2011–2014. A random sample of 10% of these 77,159 journal articles was drawn, stratified by research area, resulting in 7,667 sampled journal articles3. See Table 1 for an overview of the research areas and the number of articles per research area. ANVUR sent out all journal articles in the sample to two independent peer reviewers, irrespective of their bibliometric indicators. ANVUR did not provide reviewers with bibliometric indicators of the articles that were requested to be reviewed. The response rate of reviewers was high and 7,164 articles were peer reviewed by two reviewers. Overall, the empirical sample could be considered a reasonable approximation of the population of reference (see also Alfò et al., 2017). Poststratification analysis suggests that the sample is also sufficiently representative at the institutional level (Figure 1).
Overview of research areas, Gruppo di Esperti della Valutazione (GEV), and the number of submitted articles and the number of articles included in the sample that is analyzed
ID . | Name . | Submitted articles . | In sample . |
---|---|---|---|
1 | Mathematics & Computer Sciences | 4,631 | 468 |
2 | Physics | 10,182 | 1,018 |
3 | Chemistry | 6,625 | 662 |
4 | Earth Sciences | 3,953 | 394 |
5 | Biology | 10,423 | 1,037 |
6 | Medicine | 15,400 | 1,524 |
7 | Agricultural & Veterinary Sciences | 6,354 | 638 |
8b | Civil Engineering | 2,370 | 237 |
9 | Industrial & Information Engineering | 9,930 | 998 |
11b | Psychology | 1,801 | 180 |
13 | Economics & Statistics | 5,490 | 511 |
Total | 77,159 | 7,667 |
ID . | Name . | Submitted articles . | In sample . |
---|---|---|---|
1 | Mathematics & Computer Sciences | 4,631 | 468 |
2 | Physics | 10,182 | 1,018 |
3 | Chemistry | 6,625 | 662 |
4 | Earth Sciences | 3,953 | 394 |
5 | Biology | 10,423 | 1,037 |
6 | Medicine | 15,400 | 1,524 |
7 | Agricultural & Veterinary Sciences | 6,354 | 638 |
8b | Civil Engineering | 2,370 | 237 |
9 | Industrial & Information Engineering | 9,930 | 998 |
11b | Psychology | 1,801 | 180 |
13 | Economics & Statistics | 5,490 | 511 |
Total | 77,159 | 7,667 |
Distribution of percentage of publications in the sample across institutions. The x-axis shows the percentage of publications of all submitted publications that are included in the sample used in our study. The y-axis shows the number of institutions with the indicated percentage of publications. The line shows the expected distribution of percentages based on 1,000 random stratified samples.
Distribution of percentage of publications in the sample across institutions. The x-axis shows the percentage of publications of all submitted publications that are included in the sample used in our study. The y-axis shows the number of institutions with the indicated percentage of publications. The line shows the expected distribution of percentages based on 1,000 random stratified samples.
We matched all publications in the sample with the CWTS in-house version of Web of Science (WoS). In total, 6,337 publications could be matched to WoS, 6,001 of which were reviewed by two reviewers, comprising 7.8% of the reference population, from 110 Italian institutions. A few institutions that only had one or two publications in the sample had no match with WoS at all. It is worth noting that institutions were not always included in all areas.
As stated, publications in the sample were assessed by two independent reviewers. We randomly determined which reviewer was considered as reviewer number 1 and which one was considered as reviewer number 2. We summed the three scores on the three different criteria of originality, methodological rigor, and impact to obtain an overall score that ranged from 3 to 30.
As said, 7,164 publications were reviewed by two independent reviewers. There were 122 publications without any reviewer and 381 publications that were reviewed by only a single reviewer. Many of these publications with missing reviewer scores are concentrated in Medicine, Biology, and Industrial & Information Engineering (Figure 2). In our analysis, we include all publications, also those for which we miss reviewer scores or citation scores, which we explain in more detail in Section 4.
Number of publications in the sample with missing reviewer scores per GEV.
For each paper included in the sample, we calculated two indicators on the basis of WoS: one at the article level and one at the journal level. We calculated (a) the normalized citation score (NCS) for each paper, given by the number of citations divided by the average number of citations of all publications in the same field and the same year; and (b) the normalized journal score (NJS), which is the average NCS of all publications in a certain journal and a certain year. To be consistent with the timing of the VQR, we took into account citations up to (and including) 2015. We used the WoS journal subject categories for calculating normalized indicators. In the case of journals that were assigned to multiple subject categories, we applied a fractionalization approach to normalize citations (Waltman, van Eck et al., 2011). Publications within journals in the multidisciplinary category (e.g., Science, Nature, PLOS ONE) were fractionally reassigned to other subject categories based on their references. Besides the bibliometric information calculated from WoS, we also considered the indicators gathered by ANVUR during the VQR itself. Two different types of indicators were collected: one citation-based indicator and one journal-based indicator. Those indicators may come from various sources (e.g., Scopus, WoS, MathSciNet), and for different publications different journal indicators may be used, such as the 5-year Impact Factor, Article Influence Score, SJR, and IPP4. Institutions could choose the source that should be used for the indicators for each individual article that was submitted. All VQR scores were normalized as percentiles with respect to the field definitions as provided by the data source. This procedure allowed the VQR to gain a greater degree of flexibility in practice (Anfossi et al., 2016), but also made the data more heterogeneous, thereby complicating the interpretation of the results. Nonetheless, we also included the VQR indicators in our analysis in order to compare them to the bibliometric information obtained exclusively from WoS. Besides the two reviewers’ scores for each paper, we hence obtained the two VQR percentile metrics (a citation metric and a journal metric), and the two WoS metrics (a citation metric and a journal metric).
We want to compare the agreement between metrics and peer review fairly to the internal agreement of peer review. To do so, we consistently compare all scores and metrics to the overall score of reviewer 2. Internal peer review agreement is then quantified by the agreement of the score of reviewer 2 with the predicted score based on the score of reviewer 1. Likewise, for each metric, we calculate the agreement of reviewer 2 with the predicted score based on metrics. By performing the analysis in this way, the agreement between metrics and peer review can be compared fairly to the internal agreement of peer review. If we had chosen to compare each metric to the average score of reviewers 1 and 2, this would have already cancelled out some differences in the scores of the reviewers, and as a result, the agreement of the predicted scores based on metrics with the reviewer scores would not have been directly comparable to the internal peer review agreement.
At the level of institutions, there are two views of the aggregate scores: a size-dependent view, considering the total over a certain score, and a size-independent view, considering the average over a certain score. For the size-dependent view, we simply take the sum of each score, while for the size-independent view we take the average of each score. Note that at the aggregate level we still speak of the scores of reviewer 1 and reviewer 2, even though this refers to two sets of reviewers, not to two individual reviewers.
4. METHODOLOGY
Illustration of the hierarchical Bayesian model. In (a) we show the distribution of paper values for a fictitious institution, which is distributed as lognormal(0.8, 0.4), with the solid black line representing one particular paper value of ϕp = 1.5. In (b) we illustrate the distributions of the (nonzero) citation and review scores for the paper value ϕp = 1.5 with σreview = 0.6 and σcitation = 0.8. The top axis in (b) shows the distribution of the discrete review scores 3–30 corresponding with the continuous review scores shown at the bottom.
Illustration of the hierarchical Bayesian model. In (a) we show the distribution of paper values for a fictitious institution, which is distributed as lognormal(0.8, 0.4), with the solid black line representing one particular paper value of ϕp = 1.5. In (b) we illustrate the distributions of the (nonzero) citation and review scores for the paper value ϕp = 1.5 with σreview = 0.6 and σcitation = 0.8. The top axis in (b) shows the distribution of the discrete review scores 3–30 corresponding with the continuous review scores shown at the bottom.
Our hierarchical Bayesian model naturally incorporates contextual information from an institution. For example, suppose we have two papers of an institution, only one of which has a review score. Based on the review score r1 of paper 1, a paper value ϕ1 is inferred, which hence also leads to an inferred institutional value λ. If we are then asked to predict the review score r2 of paper 2, we then sample a paper value from the distribution based on λ. Hence, if we observe a higher review score r1 of a paper from some institution, we are more likely to predict higher review scores r2 for some other paper. This is similar to the approach taken in Traag and Waltman (2019).
Our hierarchical Bayesian model naturally handles missing values. We use the observed review scores rp and citation scores cp to infer paper values. As said, the paper value is assumed to be drawn from a distribution of paper values that is specific to each institution. In the absence of some observed score, whether review or citation, any other score will still help infer paper values. If no scores are observed at all for a particular paper, the paper value is simply distributed according to the overall distribution of paper values at the institutional level.
We use k-fold cross-validation to separate our data into training and test sets, using k = 5 folds, for each metric and each research area (i.e., GEV) separately. All data are included as test data in exactly one fold, and is used as training data in the remaining k − 1 = 4 folds. It is important that we consider the hierarchical structure of the model in how the folds are defined. In particular, an institution should be in its entirety in the training set or the test set, but not partly. Otherwise, the model would learn already the institutional value from some publications that are in the training set, which would help the prediction in the test set, which is sometimes called leakage. That is, only the fitted parameters α, β, and σ help the prediction in the test set, and any information from particular institutions λi or papers ϕp is not used in the prediction in the test set. For the training set we use data from citations and both reviewers simultaneously, while we only use data from either reviewer 1 or citations for the test set. This truthfully reflects how we could apply our hierarchical Bayesian model in practice.
We calculate two statistics at the institutional level: the mean absolute difference (MAD) for the size-independent view and the mean absolute percentage difference (MAPD) for the size-dependent view. These statistics are calculated for each GEV separately, using the data from the test sets. We calculate these statistics because we believe they are more intuitive to interpret than correlations, as also discussed by Traag and Waltman (2019). These statistics are calculated for each GEV separately.
We also calculate the MAD at the individual publication level. The MAPD does not make sense at the individual level, as there is no size-dependent view at that level.
Note that in a Bayesian model, all estimated parameters have a posterior distribution, not just a point estimate. That is, each estimate has some uncertainty, and this uncertainty is also reflected in any prediction , and hence also in the MA(P)D.
5. RESULTS
In Figures 4 and 5 we show the posterior distribution of the various parameter estimates based on the training data. Note that these estimates combine posterior distributions from across the five different folds, resulting in some slightly visible multimodality for some parameters (e.g., the estimate for β for NCS for GEV 2, Physics). As is clear from Figure 4, the estimates from β are more precisely estimated than the parameter estimates for predicting the nonzero probability, α0 and β0. Only about 10% of NCS scores are equal to 0, and none of the other scores. Hence, α0 is lower for NCS than for the other citation scores. As Economics & Statistics (GEV 13) did not list any metrics from the VQR exercise (i.e., the percentile journal and percentile citation scores), these parameter estimates simply reflect the priors that we used in the model. Overall, most β parameters are positive, or near 0, indicating that the citation scores are estimated to be roughly proportional to the paper value, or somewhat higher.
Parameter estimates of β, α0, and β0 for the various GEV and citation scores.
Parameter estimates σvalue, σcitation, and σreview for the various GEV and citation scores.
Parameter estimates σvalue, σcitation, and σreview for the various GEV and citation scores.
As illustrated in Figure 5, the estimate σvalue is quite consistent across the various citation scores. There is some variability across various GEVs, but σvalue is typically in the range of 0.4–0.5. Similarly, σreview is quite consistent across citation scores, and is typicaly slightly higher than σvalue, around roughly 0.6. There is also some consistency across different citation scores for σcitation, but the estimates for NJS differ quite distinctly from the estimates for the other citation scores. Interestingly, the estimates of σcitation are lower than the estimates of σreview for NJS, suggesting that NJS scores show less variability for any given paper value ϕp than review scores. Again, Economics & Statistics (GEV 13) essentially just shows the prior distribution for the percentile journal and percentile citation scores.
In Figure 6 we show the posterior distribution for one illustrative example paper in the training data, for both the citation score and the review scores. This shows that the distributions can be quite broad, as already suggested in our illustration in Figure 3. Typically, the observed scores are within the 95% credibility interval. In Figures S1 and S2 in the Supplementary material we show more comprehensive posterior predictive checks for the training data. We only show the mean predictions, and do not show the uncertainty of these predictions. Overall, the mean predictions fit the observations reasonably well, both for citation and review scores. There is some clear shrinkage towards the priors, especially for the percentile citation and journal scores. The means that papers with a high score will on average be predicted to score lower, and, vice versa, papers with a lower score will on average be predicted to score higher.
Posterior distributions for review scores and for the citation score (NCS) for one illustrative paper.
Posterior distributions for review scores and for the citation score (NCS) for one illustrative paper.
We now consider the results of the prediction in the test data. We show the mean predicted reviewer score based on NCS versus the observed score of reviewer 2 in a scatter plot (Figure 7) and the predicted reviewer score based on the score of reviewer 1 versus the observed score of reviewer 2 in a scatterplot (Figure 8). It is readily apparent that there are quite some differences, not only between the NCS and peer review but also between the two reviewers themselves. We can similarly show these results at the institutional level in Figures 9 and 10 respectively. Although there is less variability than at the individual level, we still see substantial differences. By calculating the MAD and MAPD, we quantify the level of differences for each of the scores.
Scatterplots of observed reviewer scores of reviewer 2 versus predicted reviewer scores based on NCS. The predicted reviewer score in this plot only depicts the mean of the posterior distribution, so the uncertainty in the prediction is not visible.
Scatterplots of observed reviewer scores of reviewer 2 versus predicted reviewer scores based on NCS. The predicted reviewer score in this plot only depicts the mean of the posterior distribution, so the uncertainty in the prediction is not visible.
Scatterplots of observed reviewer scores of reviewer 2 versus predicted reviewer scores based on reviewer 1 at the individual level. The predicted reviewer score in this plot only depicts the mean of the posterior distribution, so the uncertainty in the prediction is not visible.
Scatterplots of observed reviewer scores of reviewer 2 versus predicted reviewer scores based on reviewer 1 at the individual level. The predicted reviewer score in this plot only depicts the mean of the posterior distribution, so the uncertainty in the prediction is not visible.
Scatterplots of observed reviewer scores of reviewer 2 versus predicted reviewer scores based on NCS at the institutional level, taking a size-independent view. The predicted reviewer score in this plot only depicts the mean of the posterior distribution, so the uncertainty in the prediction is not visible.
Scatterplots of observed reviewer scores of reviewer 2 versus predicted reviewer scores based on NCS at the institutional level, taking a size-independent view. The predicted reviewer score in this plot only depicts the mean of the posterior distribution, so the uncertainty in the prediction is not visible.
Scatterplots of observed reviewer scores of reviewer 2 versus predicted reviewer scores based on reviewer 1 at the institutional level, taking a size-independent view. The predicted reviewer score in this plot only depicts the mean of the posterior distribution, so the uncertainty in the prediction is not visible.
Scatterplots of observed reviewer scores of reviewer 2 versus predicted reviewer scores based on reviewer 1 at the institutional level, taking a size-independent view. The predicted reviewer score in this plot only depicts the mean of the posterior distribution, so the uncertainty in the prediction is not visible.
In Figure 11 we show the distribution of the absolute difference across papers. That is, for each paper, we calculate the absolute difference between the prediction and the observed score from reviewer 2, averaged across the posterior distribution of absolute differences. We then show the distribution of these average absolute differences over all papers. As is clear, there are quite some individual differences between papers. The absolute difference varies between roughly 4 and 7, but also reaches highs of 10 and above. As is clear, most distributions appear quite similar both when using metrics and reviewer scores for prediction.
Distribution of the absolute differences across papers. We here calculate the absolute difference for each paper, averaged across the posterior distribution of absolute differences, and show the distribution of the average absolute differences.
Distribution of the absolute differences across papers. We here calculate the absolute difference for each paper, averaged across the posterior distribution of absolute differences, and show the distribution of the average absolute differences.
In Figure 12 we report the MAD for individual publications. Here, we calculate the absolute difference and average over all papers; that is, we calculate the MAD and show the posterior distribution of the MAD. Overall, the agreement between metrics and review is comparable to the internal agreement between two reviewers. There are clearly some differences between research areas. For example, the agreement is generally relatively high in Physics, while the agreement is lower in Economics & Statistics5. These results also show that indicators based on citations (NCS and the citation percentile) show a similar agreement with peer review as indicators based on journal metrics (NJS and journal percentile). Citations and journal indicators could therefore both provide information about evaluation outcomes. The indicators based on WoS (NCS and NJS) also show a similar agreement with review scores as the more heterogeneous indicators that could be freely chosen from different data sources in the VQR (citation and journal percentiles). This suggests that the heterogeneity of using different data sources does not deteriorate, nor ameliorate, the agreement with review.
Mean Absolute Difference (MAD) between the score of reviewer 2 and predictions of reviewer scores based on various bibliometric indicators and the score of reviewer 1. We here calculate the MAD at the individual level. The error bars report the 95% credibility interval of the posterior distribution of the MAD.
Mean Absolute Difference (MAD) between the score of reviewer 2 and predictions of reviewer scores based on various bibliometric indicators and the score of reviewer 1. We here calculate the MAD at the individual level. The error bars report the 95% credibility interval of the posterior distribution of the MAD.
In Figure 13 we show the distribution of the absolute difference across institutions. That is, it is the counterpart of Figure 11, but at the institutional level. For each institution, we calculate the absolute difference between the prediction and the observed score from reviewer 2, averaged across the posterior distribution of absolute differences. We then show the distribution of these average absolute differences over all institutions. Again, most distributions appear quite similar when using both metrics and reviewer scores for prediction. Overall, the level of differences is lower at the institutional level than at the individual level.
Distribution of the absolute differences across institutes. We here calculate the absolute difference for each institute, averaged across all draws from the posterior distribution, and show the distribution of the average absolute differences.
Distribution of the absolute differences across institutes. We here calculate the absolute difference for each institute, averaged across all draws from the posterior distribution, and show the distribution of the average absolute differences.
In Figures 14 and 15 we report the MAD and MAPD for. Here, we calculate the absolute (percentage) difference and average over all institutions, that is we calculate the MA(P)D, and show the posterior distribution of the MA(P)D. As for individual level, the agreement between metrics and reviews at the institutional level is again comparable to the internal agreement between two reviewers. When comparing the MAD results at the individual publications level with the results at the institutional level (Figure 12 vs. Figure 15), we see that the differences between research areas become less pronounced. Overall, the MAD at the individual publication level is roughly between 4 and 6 for all indicators, including peer review itself, while at the institutional level, the MAD is roughly between 3 and 4 (Figure 14). The MAD is generally higher at the level of individual publications, compared to the institutional level, showing that “errors” indeed tend to “cancel out” at the aggregate level. The MAPD paints a very similar picture (Figure 15) and shows an MAPD roughly between 10 and 20%.
Mean Absolute Difference (MAD) between the score of reviewer 2 and predictions of reviewer scores based on various bibliometric indicators and the score of reviewer 1. We here calculate the MAD at the institutional level by considering the average scores for an institution, thus taking a size-independent view. The error bars report the 95% credibility interval of the posterior distribution of the MAD.
Mean Absolute Difference (MAD) between the score of reviewer 2 and predictions of reviewer scores based on various bibliometric indicators and the score of reviewer 1. We here calculate the MAD at the institutional level by considering the average scores for an institution, thus taking a size-independent view. The error bars report the 95% credibility interval of the posterior distribution of the MAD.
Mean Absolute Percentage Difference (MAPD) between the score of reviewer 2 and predictions of reviewer scores based on various bibliometric indicators and the score of reviewer 1. We here calculate the MAPD at the institutional level by considering the total scores for an institution, thus taking a size-dependent view. The error bars report the 95% credibility interval of the posterior distribution of the MAPD.
Mean Absolute Percentage Difference (MAPD) between the score of reviewer 2 and predictions of reviewer scores based on various bibliometric indicators and the score of reviewer 1. We here calculate the MAPD at the institutional level by considering the total scores for an institution, thus taking a size-dependent view. The error bars report the 95% credibility interval of the posterior distribution of the MAPD.
6. DISCUSSION AND CONCLUSION
We analyzed the agreement between several bibliometric indicators and peer review based on a hierarchical Bayesian model using k-fold cross-validation. The contribution of our analysis is twofold: (a) we analyzed the agreement at the institutional level, instead of the individual publication level; and (b) we also quantified internal reviewer agreement at the institutional level. We found that the agreement between bibliometric indicators is on par with the internal agreement between two reviewers. The agreement between peer review and citation-based indicators was comparable to the agreement between peer review and journal based indicators. These results were obtained while taking into account missing review scores and missing citation scores. Our findings are in line with findings of Baccini, Barabesi, and De Nicolao (2020) who also address the issue of missing review scores. Finally, as expected, the agreement at the institutional level was higher than at the individual publication level.
Our results are relevant in the context of PBRFS. In this context, evaluations typically take place at the level of entire institutions. Our results suggest that using peer review or bibliometric indicators, whether citation-based or journal-based, would yield similar outcomes of the evaluation. The agreement between peer review and bibliometric indicators is not perfect, but the differences that arise are comparable to the differences between reviewers themselves. This suggests that similarly different results could have been obtained if other reviewers had evaluated the publications.
There are several reservations that we should make regarding our results. First, our results are based on using a single reviewer for evaluation. We may expect that using multiple reviewers will increase the internal agreement. In reality, most evaluations do use multiple reviewers. It is possible, therefore, that the agreement between bibliometric indicators and peer review is lower compared to the internal agreement when using multiple reviewers. Based on our hierarchical Bayesian model, we can provide some insight into the internal agreement of peer review when using multiple peer reviewers. In Figure 16 we illustrate what happens with the expected MAD for a single paper. While the expected MAD for a single paper (with these specific parameters) is almost 3, this decreases to about 2 for two reviewers and then to about 1.7 and 1.5 for additional reviewers. Using at least two reviewers therefore seems to bring a relatively large benefit, and would be recommended, while using more reviewers shows diminishing marginal improvements in terms of MAD. To bring down the expected MAD from 3 to 2 requires only a single additional reviewer, but bringing it down to an MAD of 1 requires an additional six reviewers. This is in line with findings in the context of peer review of grant applications, where Forscher, Brauer et al. (2019) suggested that as many as 12 researchers would be needed. Note, however, that this theoretical example is based on the assumption that there is a paper with a fixed value ϕp = 1.5. In reality, we would not know the paper value, and we would infer such a paper value based on the review score(s). The uncertainty in this distribution of the paper value adds to the overall uncertainty and MAD. For that reason, it would be of interest to conduct experiments with more than two reviewers to validate these theoretical expectations. In addition, we now operationalize the value of a paper using a single dimension, while this could perhaps best be thought of as a multidimensional concept. The review scores provided separate evaluations of originality, methodological rigor and impact, and teasing out the different relations with these various dimensions would be of interest.
Illustration of the distribution of the average reviewer score when using one or more reviewers. The distribution of peer review scores is based on the hierarchical Bayesian model introduced earlier. We use the same parameters as in Figure 3, that is, σreview = 0.6 and a paper value ϕp = 1.5. The distribution of peer review scores using multiple reviewers is a convolution of the distribution for a single reviewer. The expected MAD is the expected absolute difference with the expected reviewer score.
Illustration of the distribution of the average reviewer score when using one or more reviewers. The distribution of peer review scores is based on the hierarchical Bayesian model introduced earlier. We use the same parameters as in Figure 3, that is, σreview = 0.6 and a paper value ϕp = 1.5. The distribution of peer review scores using multiple reviewers is a convolution of the distribution for a single reviewer. The expected MAD is the expected absolute difference with the expected reviewer score.
Second, bibliometric indicators and peer review may show certain biases. That is, even if the overall level of agreement is comparable, indicators and peer review may show particular differences. For example, reviewers may be biased towards institutions with a higher reputation, and bibliometric indicators may favor certain types of methodologies or topics over others.
Third, our results are obtained in the specific context of the national Italian research assessment exercise. The assessment exercise, the selection of reviewers, and the publications submitted for evaluation, are all specific to this context. Although we expect that our results are generalizable, and are in line with earlier observations about a low internal agreement of peer reviewers, the results may of course be different in different contexts. Some recent evidence in the context of the U.K. REF quantified the internal agreement of duplicate submissions (Thelwall, Kousha et al., 2022), but it is not directly clear how this compares with our results, especially because this concerns panel assessment, not independent reviewers. It would be great if the internal agreement of postpublication peer review would also be more extensively studied in other contexts.
Additionally, there may be other effects of using indicators instead of peer review. de Rijcke, Wouters et al. (2016) show that researchers may try to improve their evaluation, and they may seek to augment their bibliometric indicators, leading them to “think with indicators” (Müller & de Rijcke, 2017), instead of pursuing scientific research as they see fit. For example, it has been suggested that Italian authors try to improve their citation statistics by citing each other (Baccini, De Nicolao, & Petrovich, 2019). In a seminal study, (Butler, 2003) found that the Australian PBRFS stimulated the production of low-impact articles because only productivity was rewarded. This result has been questioned recently by van den Besselaar, Heyman, and Sandström (2017), who found that impact was actually improving after the introduction of the Australian PBRFS. Regardless of the exact findings, it remains challenging to attribute causality to the possible effects of policies in PBRFS (Aagaard & Schneider, 2017). Our study does not address these other possible effects.
A separate question concerns the causal mechanism that is responsible for the agreement between bibliometric indicators and peer review. This causal mechanism is not entirely clear. A recent study showed that citations are causally influenced by the journal (Traag, 2021). Possibly, citations, journals, and peer review are all influenced by common underlying characteristics of the publication. However, it is also possible that the journal affects the peer review outcome so that a publication would have been evaluated differently had it been published elsewhere (Wilsdon et al., 2015). Similarly, it is also possible that citations might affect the peer review outcome. If bibliometric indicators affect the outcomes of peer review, there are a few possibilities. On the one hand, we might want to minimize the influence of bibliometric indicators so that “peer review” really reflects the expert’s view. On the other hand, such an influence may be desirable and may simply reflect the fact that “informed peer review” has been practiced. However, if reviewers do not do much more than translating bibliometric indicators into peer review outcomes, there may also be little added benefit to peer review in this context.
In conclusion, our results show that the low agreement between a single peer reviewer and a bibliometric indicator is not necessarily worse than the internal agreement between two peer reviewers. The low agreement in itself is therefore no reason to reject the use of bibliometric indicators. There are other arguments against using bibliometric indicators, such as potential biases and potential undesirable effects of using bibliometric indicators, and our results do not apply to these arguments. In addition, when two or more independent reviewers evaluate a publication, we may expect a higher internal agreement of peer review. Finally, our results are only applicable to a context in which a large set of publications are individually evaluated. Other funding systems might take a different approach altogether, and forego the large-scale evaluation of publications.
ACKNOWLEDGMENTS
The authors thank Ludo Waltman for his earlier contributions.
AUTHOR CONTRIBUTIONS
V. A. Traag: Conceptualization, Formal analysis, Methodology, Writing—original draft, Writing—review & editing. M. Malgarini: Conceptualization, Writing—original draft, Writing—review & editing. S. Sarlo: Conceptualization, Data curation, Writing—original draft, Writing—review & editing.
COMPETING INTERESTS
The last two authors are affiliated with ANVUR, the agency tasked with executing the VQR.
FUNDING INFORMATION
No funding was received for this research.
CODE AVAILABILITY
All code necessary to replicate the results in this analysis is available from Traag (2023).
DATA AVAILABILITY
All data necessary to replicate the results in this analysis are available from Traag, Malgarini, and Sarlo (2023).
Notes
See the official document of GEV2—Physics for more details; similar procedures were adopted in the other scientific areas: https://www.anvur.it/wp-content/uploads/2016/02/Criteria%20GEV%2002_English.pdf
The excluded research areas were: (8a) Architecture; (10) Ancient History, Philology, Literature & Art History; (11a) History, Philosophy, Pedagogy; (12) Law; and (14) Political & Social Sciences.
Note that due to some earlier misclassifications of submissions, the sample is slightly less than 10%.
In Mathematics & Computer Science, the MCQ indicator extracted from the MathSciNet database was also used for a limited number of papers.
Note that for Economics & Statistics, the percentile journal and percentile citation score were absent, so these scores reflect very poorly informed predictions of reviewer scores, with essentially very broad posterior prediction distribution of review scores around a mean review score of about 17.5.
REFERENCES
Author notes
Handling Editor: Vincent Larivière