Abstract
We study the citation dynamics of the papers published in three scientific disciplines (Physics, Economics, and Mathematics) and four broad scientific categories (Medical, Natural, Social Sciences, and Arts & Humanities). We measure the uncitedness ratio, namely, the fraction of uncited papers in these data sets and its dependence on the time following publication. These measurements are compared with a model of citation dynamics that considers acquiring citations as an inhomogeneous Poisson process. The model captures the fraction of uncited papers in our collections fairly well, suggesting that uncitedness is an inevitable consequence of the Poisson statistics.
PEER REVIEW
1. INTRODUCTION
The problem of uncited papers became prominent on the launch of the Science Citation Index in 1964. De Solla Price (1965) conjectured that about 10% of all papers remain uncited in the long term. This early estimate proved to be too optimistic and recent studies by Sugimoto and Larivière (2018) showed that the fraction of uncited papers is higher and domain specific. In particular, for the papers published in 1990 and for a citation window of 27 years, the uncitedness ratio ranged from 12% for Medical Sciences to 70% for Arts & Humanities.
The proper assessment of uncitedness is important for research policies (Garfield, 1991; Seglen, 1992; van Noorden, 2017). Information scientists have made a large contribution to the empirical characterization of the number and composition of uncited papers, studying how uncitedness depends on the discipline, document kind, country, and year (Dorta-Gonzalez, Suarez-Vega, & Dorta-Gonzalez, 2020; Hou & Ye, 2020; Thelwall, 2016a; van Leeuwen & Moed, 2005; Wallace, Larivière, & Gingras, 2009). The measurements of uncitedness have been recently reviewed by Nicolaisen and Frandsen (2019) and a summary of the subject has been presented by van Noorden (2017) and Sugimoto and Larivière (2018). It turns out that the uncitedness ratio, namely, the fraction of papers in a collection that remain uncited after a certain period, depends strongly on the length of this period. It is not even clear whether the uncitedness ratio achieves some limiting value over the long term.
Existing empirical models successfully predict the uncitedness ratio for a collection of papers during the first couple of years after publication, but fail to account for it over the long term. In particular, Burrell (2013), Egghe (2013), and Hsu and Huang (2012) analyzed the factors that determine the uncitedness ratio in a collection of papers and claimed a direct relation to the mean number of citations for this collection; van Leeuwen and Moed (2005) related the journal uncitedness ratio to the journals’ impact factor, which is determined by the mean number of citations per paper garnered in the first 1–2 years after publication; whereas Wallace et al. (2009) demonstrated that the uncitedness ratio is strongly affected by the annual growth in the number of publications and their reference list length. Yet, a comprehensive study by Thelwall (2016a) showed a relation between the uncitedness ratio and the shape of the citation distribution for a given collection. Thus, although several factors that affect the uncitedness ratio were properly identified (the mean number of citations, the growing number of publications, the reference list length, and the shape of the citation distribution), the existing models of uncitedness focus only on one or several of these factors and on a short time window comprising a couple of years after publication. A comprehensive model that includes all these factors and predicts the uncitedness ratio in the long run has been missing.
Why do we need a better understanding of uncitedness? To answer a burning question: whether uncited papers are a burden to science or constitute an inherent part of the scientific enterprise (MacRoberts & MacRoberts, 2010; van Noorden, 2017). In other words, do uncited papers exert some influence or not? Seglen (1992) argued that uncitedness is the consequence of the mismatch between the number of publications and the number of references (because citation distributions are highly skewed, the total number of references is insufficient to provide citations for all papers); Wallace et al. (2009) and Burrell (2013) suggested that uncitedness is related to the Poisson statistics of citations. Both approaches converge on the point that uncitedness is an inevitable ingredient of the normal citation process. Thus, a consistent model of citation dynamics will account for uncited papers as well. Our objective is to validate this statement. Indeed, our recently developed model (Golosovsky, 2019, 2021; Golosovsky & Solomon, 2017) captures many attributes of the citation dynamics of research papers, including citation trajectories and citation distributions. In this study, we demonstrate that this model quantitatively captures the uncitedness ratio for three single disciplines and four broad scientific categories.
2. THE MODEL OF CITATION DYNAMICS AND THE UNCITEDNESS RATIO
We present here a short summary of the model of citation dynamics while focusing on the phenomenon of uncitedness. Consider a paper j that belongs to some scientific community. An author of a new publication may cite this paper after finding it in databases, scientific journals, or following recommendations of colleagues or news portals. We name this a direct citation of paper j. An author of another new publication can find paper j in the reference lists of his already selected papers and cite it as well. If paper j was placed into the reference list of a new publication as a result of the copying strategy, we name it indirect citation.1
Thus, in the context of uncitedness, our model reduces to the combination of the fitness model of Caldarelli, Capocci et al. (2002) and the aging model of Wallace et al. (2009).
3. MEASUREMENTS AND COMPARISON WITH MODEL
We measured citation distributions and citation dynamics of papers belonging to several collections, determined the corresponding model parameters, and verified to what extent the model captures the fraction of uncited papers in these collections.
3.1. Single Disciplines
Mathematics, Economics, and Physics papers published in 1984 were retrieved using the Clarivate Web of Science (WoS) database. We considered only articles, letters, and notes written in English and excluded non-English and low-circulation journals which contain many papers whose citations are not covered by the WoS, because, according to our protocol, these papers would be considered uncited. We also excluded reviews, as their citation careers are very different from those of ordinary research papers. The publication year 1984 was chosen in such a way as to provide a long citation window for the cited papers and sufficient coverage for the citing papers.
We analyzed citation trajectories of papers and the structure of their reference lists. These were compared to our stochastic model of citation dynamics (Golosovsky, 2019). The corresponding model parameters and functions, such as the average reference list length R0, the sum of the growth exponents (α + β), the aging function A(t), the average fitness η0, and the parameters that define indirect citations, were estimated for each discipline. We found that the sum of the growth exponents (α + β) is more or less compatible with the direct measurements of Sugimoto and Larivière (2018) which report ≈3% annual growth in the reference list length and ≈4% growth in the number of publications. Although we found that the average reference list length R0 is smaller than the actual reference list length, it should be noted that it counts only those references that can cite the given paper and that are included in the citation database. For WoS, these include research papers and exclude books, conference proceedings, etc. The fraction of these documents in the reference lists of Physics papers is small; hence R0 for Physics matches our independent measurements. However, the fraction of books and conference proceedings in the reference lists of Economics and Mathematics papers is rather large, and that is why the effective R0 for these disciplines is so small.
Figure 1 shows the measured and numerically simulated citation distributions for three disciplines. They are virtually indistinguishable.
As the model relates the uncitedness ratio to the mean number of direct citations Mdir(t), we determined the latter basing on Eq. 6. Figure 2(a) shows that the Mdir(t) dependences do not come to saturation even after 25 years. It is also instructive to compare Mdir(t) to M(t), the mean number of all citations for the same collection of papers. To this end, we introduced the average fraction of direct citations D(t) = and plotted it on Figure 2(b). Although early after the publication of a paper its citations are mostly direct (D ≈ 1), after 10–20 years the overall fraction of direct citations drops to D = 0.3–0.45, depending on the discipline.
Figure 3 shows the uncitedness ratio f0 and its dependence on the time after publication. For a citation window of 25 years, the uncitedness ratios for the Physics, Economics, and Mathematics papers are 7.1%, 14.7%, and 27.3%, correspondingly. Yet, these percentages are not final, because f0 continuously decreases with time and does not come to saturation even after 25 years. The reason for the very slow decay of the uncitedness ratio is not only the time after publication, but also the growth in the number of publications and the average reference list length R0 (see Eqs. 5 and 6).
The continuous lines in Figure 3 show the results of the numerical simulation. As all the model parameters were found from the analysis of cited papers, the very fact that the same model captures the fraction of uncited papers is significant and indicates that, at least for these disciplines, the cited and uncited papers are two sides of the same coin.
3.2. Broad Scientific Categories
We consider here the measurements of uncitedness for the papers published in four broad scientific categories in 1990 (Sugimoto & Larivière, 2018). To compare these data to Eqs. 5 and 6, we need to find the corresponding model parameters. In principle, this can be done by measuring the citation trajectories of papers and comparing them to the full stochastic model of citation dynamics (Golosovsky, 2019; Golosovsky & Solomon, 2017). Here, we did not perform this tedious task: For each category, we only measured the annual mean number of citations M(t) as a function of time and the citation distribution after 27 years. By fitting these two functions we determined all the model parameters.
Although M(t) can diverge with time, R(t) converges to R0 in the long run, namely, r(t)dt = 1. Using this constraint, we found R0 and (α + β) from the measured M(t) dependences and Eq. 8. Figure 4(b) shows the corresponding r(t) dependences. They are remarkably similar, although not identical. To find the remaining parameters of citation dynamics, we used our full stochastic model and the previously found aging function A(t) (Golosovsky, 2021) and fitted the citation distributions for each category in the year 2017. We assumed the log-normal reduced fitness distribution and used its width σ as a fitting parameter. Figure 5 shows that the model accounts for the measured citation distributions4, and Figure 6 demonstrates that the same model captures the uncitedness ratio as well.
3.3. Fitness Distribution
Figure 7 explores the sensitivity of the uncitedness ratio f0(t) to the functional shape of the fitness distribution. It shows the f0(Mdir) dependences where the time after publication is an implicit parameter and Mdir(t) has been calculated according to Eq. 6. Equation 5 states that the function f0(Mdir) is nothing else but the Laplace transform of the reduced fitness distribution ρ(). Figure 7(a) shows that f0(Mdir) dependences for single disciplines correspond to log-normal distribution with the shape factor σ ≈ 1.13, and Figure 7(b) shows that the data for the Medical, Natural, and Social Sciences correspond to log-normal distributions with σ = 1.3–1.4. Notably, for all disciplines and categories, the fitness distributions derived from the time dependence of the uncitedness ratio are the same as those found in the analysis of citation distributions (Figures 1 and 5).
4. COMPARISON TO EXISTING MODELS
In contrast to existing models, we assume a much more realistic scenario of the citation process which takes into account that, in filling the reference lists of their papers, the authors combine two strategies: random search (direct references) and “copying” from the reference lists of the preselected papers (indirect references). When the perspective is shifted to cited papers, these strategies yield direct and indirect citations, correspondingly. Although Burrell (2013), Egghe (2013), and Hsu and Huang (2012) related the uncitedness ratio in a collection of papers published in one year to the mean number of all citations M(t), our model relates it to Mdir(t), the mean number of direct citations.
In summary, our approach to the problem of uncitedness builds on previous theoretical speculations but uses a more realistic scenario of the citation process. First, we replaced M(t) by Mdir(t). Second, we did not postulate any specific shape of the fitness distribution but determined it from the measured citation distributions. Figure 8 shows that the actual fitness distributions are very different from the exponential distribution that was postulated in previous theoretical studies (Burrell, 2013; Egghe, 2013) on an ad hoc basis.
5. DISCUSSION
After achieving a quantitative understanding of the time-dependent uncitedness ratio, we analyze it. A first question is why, given the same citation window, this ratio is discipline-specific. To answer this question, we note that Eq. 5 expresses the uncitedness ratio f0 through the reduced fitness distribution ρ() and the mean number of direct citations Mdir(t). Equation 6 yields that the latter is determined by the aging function A(t), the sum of the growth exponents (α + β), and the average reference list length R0. Of these factors and functions, the one that has the largest variability between the disciplines is R0. In particular, the disciplines with a long reference list, such as Medical and Natural Sciences, tend to have a relatively low uncitedness ratio, whereas the disciplines with a short reference list (Mathematics and Arts & Humanities) have a high uncitedness ratio. Equations 5 and 6 also explain the overall decline of uncitedness during the last century, as documented by Wallace et al. (2009). This can be attributed to the gradual increase of R0 and slower decay of A(t), as has been recently reported by Sinatra et al. (2015) for Physics.
This brings us to the question of why the fraction of uncited papers for some disciplines is so high. Our findings suggest that this is the consequence of the highly skewed fitness distribution, as has been previously conjectured by Seglen (1992). We found that this distribution can be approximated by a log-normal distribution. This distribution frequently occurs in nature as the result of a multiplicative random process and is usually associated with some hierarchical structure. The scientific publication network displays a strong hierarchy: There are breakthrough papers that set a new direction of research and initiate a cascade of follow-up papers, and the latter develop subdirections of this new research and generate new cascades of papers that deal with specialized topics, close the gaps, and tie up the loose ends. The breakthrough papers have high citation potential (fitness) because they are of great interest to a broad audience. The follow-up papers, which deal with more specific research questions, have lower fitness, not due to their quality but because they address a narrower forum of researchers. Thus it is quite natural that such a hierarchical structure of scientific publications, which results from the cascades of papers, is characterized by a lognormal fitness distribution. The width of this distribution hardly varies between the disciplines, because the research style is more or less uniform (a professor usually spends ∼10–15 years pursuing some research direction and this corresponds to two to three generations of graduate students).
The last question is whether there are papers with = 0. To estimate the number of such uncitable papers, Thelwall (2016b) suggested using a zero-inflated log-normal distribution instead of a conventional log-normal distribution. We tried to fit our data using the zero-inflated fitness distribution (this requires an additional fitting parameter—the fraction of zero-fitness papers) and were unable to improve an already very good fit. We conclude that the overwhelming majority of uncited papers in our collections are characterized by finite fitness and have some chance to be cited, provided that there is enough time, and that the decay of attention to old papers is sufficiently slow. Equation 4 indicates that this decay is captured by the factor A(t)e(α+β)t. The aging function A(t) decays very slowly, in such a way that the exponential factor e(α+β)t can compensate for this decay. Indeed, Figure 3 demonstrates that although the uncitedness ratio for scientific papers decreases with time, it does not come to saturation even after 25 years. This lack of saturation implies that, in principle, f0 may achieve a very small value over an extremely long term. This is the situation with patent citations, as the authors of a new patent are required to cite all patents relevant to their invention, however old they are. Thus, the aging function for patent citations decays more slowly than that for scientific papers. It is no surprise that for many categories of US patents the uncitedness ratio, in the long run, is only 2–4% (Gandal, Shur-Ofry et al., 2021), which is significantly lower than the uncitedness ratio for scientific papers.
AUTHOR CONTRIBUTIONS
Michael Golosovsky: Conceptualization, Data Curation, Formal Analysis, Methodology, Writing. Vincent Larivière: Data Curation, Formal Analysis, Writing.
COMPETING INTERESTS
The authors have no competing interests.
FUNDING INFORMATION
This research was not funded.
DATA AVAILABILITY
Citation distributions, the mean number of citations, and the fraction of uncited papers and its time dependence are available at https://doi.org/10.5281/zenodo.5014627.
Notes
A direct reference is an entry in the reference list of a publication that is not cited by any other reference there, while an indirect reference is an entry that is cited by one or more references in this list. When the perspective is shifted to the cited paper, these definitions correspond to direct and indirect citations.
The factor ηjA(t) corresponds to βI in Wallace et al. (2009).
The function r(t) is remarkably stable over time and does not vary much from discipline to discipline (Glanzel, 2004; Golosovsky & Solomon, 2017; Roth, Wu, & Lozano, 2012; Sinatra, Deville et al., 2015).
A broad scientific category aggregates several disciplines with dissimilar citing habits, namely, those with different average reference list lengths R0 and different average fitness η0. Because the reduced fitness , average fitness η0, and the average reference list length R0 do not appear in Eqs. 5 and 6 separately but as a product η0R0, our model can still be applied to such aggregated data sets. However, the reduced fitness distribution ρ() will be replaced with the joint distribution ρ(, η0R0). It is no surprise that the latter turns out to be log-normal. For such a distribution, the variance of the factor η0R0 among the disciplines belonging to one category and the variance of are added together, in such a way that ρ(, η0R0) is broadened. This is the reason why the fitness distribution for categories is somehow broadened in comparison with that found in our analysis of single disciplines.
REFERENCES
Author notes
Handling Editor: Ludo Waltman