## Abstract

Scholarly publications represent at least two benefits for the study of the scientific community as a social group. First, they attest to some form of relation between scientists (collaborations, mentoring, heritage, …), useful to determine and analyze social subgroups. Second, most of them are recorded in large databases, easily accessible and including a lot of pertinent information, easing the quantitative and qualitative study of the scientific community. Understanding the underlying dynamics driving the creation of knowledge in general, and of scientific publication in particular, can contribute to maintaining a high level of research, by identifying good and bad practices in science. In this article, we aim to advance this understanding by a statistical analysis of publication within peer-reviewed journals. Namely, we show that the distribution of the number of papers published by an author in a given journal is heavy-tailed, but has a lighter tail than a power law. Interestingly, we demonstrate (both analytically and numerically) that such distributions match the result of a modified preferential attachment process, where, on top of a Barabási-Albert process, we take the finite career span of scientists into account.

## PEER REVIEW

## 1. INTRODUCTION

One of the core mechanism in the practice of science is the self-examination of a field of research. The validation of a scientific result is always collective, in the sense that it has been scrutinized, criticized, and (hopefully) validated by a sufficient number of peers. Furthermore, any scientific result is permanently subject to new evaluation and might be replaced by more accurate work. At the level of a community, scientists are then used to criticize the work of colleagues and to have their work criticized by them. It is then not surprising that some scientists started to study (and thus somehow critically assess) the scientific community itself (Price, 1963).

The quantitative study of the scientific community, sometimes referred to as *Science of Science* (Fortunato, Bergstrom et al., 2018; Narin, 1976; Price, 1976; van Raan, 2019), is a key step to unravel the underlying behaviors of its composing agents (authors, journals, institutions, etc.). Pioneered by the early works of Lotka (1926), the science of science gained a lot of momentum in the second half of the 20th century, with the creation of the first databases of scientific publications (Garfield, 1955; Merton, 1968; Price, 1965). More recently, the scientometric investigations have been significantly eased by the emergence of large online databases of scientific publications (Web of Science, PubMed, arXiv, …) and the ever-increasing computation power of modern computers. These improvements have allowed the analysis of scientometric indicators on a larger scale (Frandsen & Nicolaisen, 2017; Wang & Waltman, 2016) and with finer resolution in terms of publication units (considering single articles instead of whole journals (e.g., Waltman & van Eck, 2012) and time (Newman, 2001; Egghe & Rousseau, 2000). For a clear historical overview of scientometrics, we refer to van Raan (2019).

The science of science has the potential to help maintaining the quality of research, and is thus a good use of public funding. There are nowadays an increasing number of scientific papers (Bornmann & Mutz, 2015; Price, 1965), combined with the ubiquitous presence of *predatory journals* which publish the papers they receive, charging publication fees, but without performing the fundamental editorial work that guarantees the papers’ quality (e.g., quality and pertinence check, referee process; Bohannon, 2013; Sorokowski, Kulczycki et al., 2017). In such a context, distinguishing bad practices from honest work in scientific publishing becomes more and more challenging. Understanding the underlying dynamics of scientific publication will be instrumental in this endeavor.

The fight against predatory publishing has benefited from the effort of many dedicated citizens, whose initiatives have shown their efficacy (Butler, 2013; Grudniewicz, Moher et al., 2019), as well as their limits (Beall, 2017). With regard to the proliferation of predatory journals, the task of identifying all of them unequivocally is overwhelming. In such a context, the ability to perform a preliminary data-based sanity check of a given journal would allow resources to be focused on the more problematic venues. However, such an approach requires an accurate understanding of the quantitative and qualitative characteristics of scientific journals, which is still scarce.

The quality of a scientist’s work is commonly quantified by two different, but related, measures, namely, their number of papers and the number of citations thereof (summarized in the *h*-index [Hirsch, 2005; Siudem, Żogała-Siudem et al., 2020]). The vast majority of investigations about the scientific publication process are focused on the citation side. These analyses mostly aim to describe how the citation network impacts the number of citations a given paper is (and therefore its authors are) likely to receive. In particular, evidence suggests that citations follow a *cumulative advantage* or *preferential attachment* process, where the more citations a scientist has, the more likely they are to get new citations (Price, 1976). This process leads to a power law (PL) distribution of citations (Eom & Fortunato, 2011; Waltman, van Eck, & van Raan, 2012) or other heavy-tailed distributions (Thelwall, 2016). Indeed, preferential attachment has been proven to lead to heavy-tailed distributions (Krapivsky, Redner, & Leyvraz, 2000), with some refinements to account for the lifetime of a paper (Parolo, Pan et al., 2015).

As early as 1926, Lotka showed that, in the field of chemistry, the number of scientists having published *N* papers is proportional to *N*^{−2} (Lotka, 1926). In other words, he showed that the distribution of the number of papers published by scientists follows a PL. Later on, the same analysis was extended to other fields of science (e.g., Barrios, Borrego et al., 2008; Gupta & Karisiddappa, 1996; Huber & Wagner-Döbler, 2001a, 2001b; Newby, Greenberg, & Jones, 2003; Pal, 2015; Sutter & Kocher, 2001; Wagner-Döbler & Berg, 1999) and refined to more elaborate distributions, such as the *power law with cutoff* (*PLwC*) (Kretschmer & Rousseau, 2001; Saam & Reiter, 1999; Smolinsky, 2017) or the *stretched exponential* distribution (Laherrère & Sornette, 1998). Despite this early start, the number of papers published by a scientist has been less investigated than the number of citations that a paper or a scientist gets.

With the objective of refining these past analyses, in this article we focus on the distribution of the number of papers published by scientists within a given peer-reviewed journal. The distribution of the number of papers is both easily accessible (through any scientific publication database) and informative. Indeed, various characteristics of the publication dynamics within a journal can be extracted from the aforementioned distribution. We illustrate this claim in the striking examples of *Physical Review Letters* and *Physical Review D*, shown in Figure 1, where the analysis of the distribution emphasizes an underlying *preferential attachment* dynamics; the finiteness of scientific careers; and the presence of (very) large groups of scientists in the related fields of physics (see the caption of Figure 1 for a detailed discussion).

As interestingly pointed out by Sekara, Deville et al. (2018), publishing in a peer-reviewed journal (especially in high-impact ones) is more likely if one author of the manuscript has already published in the same journal. Such a process can be interpreted as *preferential attachment*, and an expected outcome of such an observation is a high representation of a few authors in a given journal (Krapivsky et al., 2000). Furthermore, a scientist whose field of research is well aligned with a journal topic is likely to publish a large proportion of their work in this journal, leading again to high representation of a few specialized authors in a given journal.

The heavy-tailedness of the distribution of the number of papers is striking in the histograms (see Figures 1 and 2). Indeed, the tail of the histogram is stronger than the best exponential fit to the data (gray dotted line). However, as we show below, the famous *PL* is not a good fit to the data either, and the actual distribution lies somewhere between an exponential and a PL. In addition to our analysis of the distribution, we propose an adaptation of the preferential attachment law that models the evolution of the number of papers of a set of authors within a journal.

## 2. EMPIRICAL AND FITTED DISTRIBUTIONS

We consider an arbitrary selection of 14 peer-reviewed journals (Table 1), whose data are available on the Web of Science data base (WoS, www.webofscience.com). The selected journals vary in age (from a few decades to more than a century) but are not too young, in order to have sufficiently many papers available, and all of them are still publishing nowadays. Whereas the choice of journals is arbitrary and limited, we tried to cover a diversity of disciplines of the natural sciences and various time spans. The limited sample of journals does not allow us to claim any universality in our results, but we argue that it demonstrates the pertinence of our approach in the quantitative analysis of the scientific publication process.

**Table 1.**

Label . | Journal name (reduction year) . | # authors (reduction) . |
---|---|---|

NAT | Nature* (1950) | 63,791 (3,374) |

PNA | Proceedings of the National Academy of Sciences of the USA** (1950) | 55,849 (2,495) |

SCI | Science* (1940) | 48,928 (4,788) |

LAN | The Lancet* (1910) | 33,416 (3,015) |

NEM | New England Journal of Medicine* (1950) | 27,078 (3,842) |

PLC | Plant Cell (2000) | 20,649 (4,712) |

ACS | Journal of the American Chemical Society* (1930) | 82,223 (5,301) |

TAC | IEEE Transactions on Automatic Control (2000) | 8,911 (3,603) |

ENE | Energy (2005) | 28,920 (4,491) |

CHA | Chaos | 7,409 |

SIA | SIAM Journal on Applied Mathematics | 6,106 |

AMA | Annals of Mathematics | 3,679 |

PRD | Physical Review D | 64,922 |

PRL | Physical Review Letters* | 90,993 |

Label . | Journal name (reduction year) . | # authors (reduction) . |
---|---|---|

NAT | Nature* (1950) | 63,791 (3,374) |

PNA | Proceedings of the National Academy of Sciences of the USA** (1950) | 55,849 (2,495) |

SCI | Science* (1940) | 48,928 (4,788) |

LAN | The Lancet* (1910) | 33,416 (3,015) |

NEM | New England Journal of Medicine* (1950) | 27,078 (3,842) |

PLC | Plant Cell (2000) | 20,649 (4,712) |

ACS | Journal of the American Chemical Society* (1930) | 82,223 (5,301) |

TAC | IEEE Transactions on Automatic Control (2000) | 8,911 (3,603) |

ENE | Energy (2005) | 28,920 (4,491) |

CHA | Chaos | 7,409 |

SIA | SIAM Journal on Applied Mathematics | 6,106 |

AMA | Annals of Mathematics | 3,679 |

PRD | Physical Review D | 64,922 |

PRL | Physical Review Letters* | 90,993 |

We denote by 𝒥 = {NAT, PNA, …, PRL} the set of journals considered (see Table 1 for the list of labels). Within each journal *J* ∈ 𝒥, we index authors by an integer *i* = 1, …, $AJtot$, $AJtot$ being the number of authors who published in journal *J*. Then for each author *i* = 1, …, $AJtot$, we count the number $niJ$ of papers published by author *i* in journal *J* up to year 2017 in the whole WoS database (meaning from year 1900 or the year of the journal’s creation, whichever is the later). This process yields the set of data 𝒟_{J} = {$niJ$ : *i* = 1, …, $AJtot$}, which is a set of $AJtot$ integer numbers. We restrict our investigation to papers labeled as “*Article*” in the WoS data base, to focus on peer-reviewed papers.

_{J}we can compute the number and proportion of authors who published

*n*papers

_{n}

*a*

_{J}(

*n*) = 1. The proportion

*a*

_{J}is represented on logarithmic scales in Figures 1, 2, and A.1, each panel corresponding to a different journal.

**Remark.** Note that we did not take into account the fact that the different papers are co-signed by multiple authors. Consequently, different papers have different “weights” in the data set. We are mostly interested in the number of papers from the point of view of the authors; it is then adequate to count, for each author, the number of papers they signed, independently of the number of coauthors. Refining the analysis and taking into account the number of coauthors on each paper would be the purpose of future work.

Note also that we do not take into account papers published anonymously, which represent a large number of papers in medicine journals in particular.

Finally, for some journals, the number of authors is too large to be downloaded from the WoS database. As a consequence, authors who have published only one or two papers in these journals have to be removed from the data (e.g., NAT, PNA, or SCI, indicated by asterisks in Table 1).

### 2.1. Distribution Fitting

Because of the apparent heavy-tailedness of the distribution, it is tempting to fit a PL. However, as pointed out by Clauset, Shalizi, and Newman (2009), such fitting should be done with care in order to avoid spurious conclusions (Broido & Clauset, 2019). We therefore fit three heavy-tailed distributions and assess the goodness-of-fit of our fitting following Clauset et al. (2009), which is encoded in a *p*-value. Numerical results are summarized in Table 2.

**Table 2.**

. | PL . | PLwC . | Y-S . | ||||
---|---|---|---|---|---|---|---|

α
. | p (%)
. | β
. | γ
. | p (%)
. | ρ
. | p (%)
. | |

NAT | 2.58 | 0.0 | 2.11 | 0.07 | 0.0 | 3.10 | 0.0 |

PNA | 2.53 | 0.0 | 2.30 | 0.02 | 0.0 | 2.83 | 0.0 |

SCI | 2.68 | 0.0 | 2.30 | 0.06 | 16.64 | 3.28 | 0.02 |

LAN | 2.47 | 0.0 | 2.09 | 0.05 | 0.18 | 2.90 | 0.0 |

NEM | 2.76 | 0.0 | 2.36 | 0.07 | 0.2 | 3.43 | 8.82 |

PLC | 2.30 | 0.0 | 1.92 | 0.10 | 13.42 | 3.01 | 0.92 |

ACS | 2.11 | 0.0 | 1.95 | 0.01 | 0.0 | 2.32 | 0.0 |

TAC | 2.08 | 0.0 | 1.84 | 0.04 | 0.0 | 2.51 | 0.02 |

ENE | 2.36 | 0.0 | 2.12 | 0.06 | 0.12 | 3.15 | 0.0 |

CHA | 2.47 | 0.0 | 2.28 | 0.05 | 80.84 | 3.43 | 0.0 |

SIA | 2.49 | 0.0 | 2.20 | 0.08 | 2.24 | 3.49 | 9.06 |

AMA | 2.26 | 0.0 | 1.72 | 0.14 | 0.18 | 2.95 | 0.0 |

PRD | 1.49 | 0.0 | 1.24 | 0.005 | 0.02 | 1.55 | 0.0 |

PRL | 1.73 | 0.0 | 1.52 | 0.005 | 0.12 | 1.80 | 0.0 |

. | PL . | PLwC . | Y-S . | ||||
---|---|---|---|---|---|---|---|

α
. | p (%)
. | β
. | γ
. | p (%)
. | ρ
. | p (%)
. | |

NAT | 2.58 | 0.0 | 2.11 | 0.07 | 0.0 | 3.10 | 0.0 |

PNA | 2.53 | 0.0 | 2.30 | 0.02 | 0.0 | 2.83 | 0.0 |

SCI | 2.68 | 0.0 | 2.30 | 0.06 | 16.64 | 3.28 | 0.02 |

LAN | 2.47 | 0.0 | 2.09 | 0.05 | 0.18 | 2.90 | 0.0 |

NEM | 2.76 | 0.0 | 2.36 | 0.07 | 0.2 | 3.43 | 8.82 |

PLC | 2.30 | 0.0 | 1.92 | 0.10 | 13.42 | 3.01 | 0.92 |

ACS | 2.11 | 0.0 | 1.95 | 0.01 | 0.0 | 2.32 | 0.0 |

TAC | 2.08 | 0.0 | 1.84 | 0.04 | 0.0 | 2.51 | 0.02 |

ENE | 2.36 | 0.0 | 2.12 | 0.06 | 0.12 | 3.15 | 0.0 |

CHA | 2.47 | 0.0 | 2.28 | 0.05 | 80.84 | 3.43 | 0.0 |

SIA | 2.49 | 0.0 | 2.20 | 0.08 | 2.24 | 3.49 | 9.06 |

AMA | 2.26 | 0.0 | 1.72 | 0.14 | 0.18 | 2.95 | 0.0 |

PRD | 1.49 | 0.0 | 1.24 | 0.005 | 0.02 | 1.55 | 0.0 |

PRL | 1.73 | 0.0 | 1.52 | 0.005 | 0.12 | 1.80 | 0.0 |

For each empirical distribution of the number of papers published by an author *i* in journal *J*, we fit an exponential distribution (gray dotted lines in Figures 1 and 2) to emphasize their heavy-tailed behavior. The three heavy-tailed distribution that we fit are

- A
*PL distribution*(black dashed lines in the figures),with$PplniJ=n\alpha =C\alpha n\u2212\alpha ,$(2)*α*> 1 and*C*_{α}∈ ℝ normalizing the distribution; - A
*PLwC*(black dash-dotted lines in the figures),with$PplcniJ=n\beta \gamma =C\beta ,\gamma n\u2212\beta e\u2212\gamma n,$(3)*β*> 1,*γ*> 0, and normalizing constant*C*_{β,γ}∈ ℝ; and - A
*Yule-Simon distribution*(black dotted lines in the figures),with$PysniJ=n\rho =C\rho \rho \u22121Bn\rho ,$(4)*ρ*> 0,*C*_{ρ}∈ ℝ is the normalizing constant, and where B(*x*,*y*) is the*Euler beta function*.

We perform the distribution fitting by optimizing the parameters *α*, *β*, *γ*, and *ρ* with a Maximum Likelihood Estimator (Clauset et al., 2009). The curves of the fitted distributions are plotted in Figures 1, 2, and A.1, and the fitted parameters are given in Table 2. Other distributions (such as log-normal, Lévy, Weibull) were tested and discarded because they were far from matching the data.

### 2.2. Goodness of Fit

To evaluate the goodness of our fits, we again follow Clauset et al. (2009), to which we refer for an in-depth discussion of heavy-tailed distribution fitting. The whole goodness-of-fit estimation is summarized in Figure 3.

Let us denote by *θ*_{J} the parameters of the distribution *P*(*X*; *θ*) (e.g., *θ*_{J} = *α* for the PL distribution), fitted to the data set 𝒟_{J}. We generate 5,000 sets of synthetic data $\mathcal{D}\u02dc$_{i}, *i* = 1, …, 5,000, each of them composed of $AJtot$ = |𝒟_{J}| integer numbers, drawn randomly from the probability distribution *P*_{J} = *P*(*X*; *θ*_{J}). For each of these synthetic data sets $\mathcal{D}\u02dc$_{i}, we perform again an MLE to fit the same distribution *P*(*X*; *θ*), yielding parameters $\theta \u02dc$_{i} and the distribution *P*_{i} = *P*(*X*; $\theta \u02dc$_{i}).

*F*

^{e}, the empirical cumulative distribution function (ECDF) for a given set of data, matches

*F*

^{t}, the theoretical cumulative distribution function (TCDF) of its fitted distribution. We define

_{J}.

*p*-value of the goodness-of-fit is then given by

*Kolmogorov-Smirnov distance*between two cumulative distribution functions

*F*

_{1}and

*F*

_{2}is defined as the maximum difference between them:

*p*is the proportion of synthetic data sets that are further from the theoretical distribution (in the Kolmogorov-Smirnov sense) than the analyzed data set. The fit is rejected if

*p*< 5%, and considered as

*good*otherwise (see Clauset et al. (2009) for more details).

This goodness-of-fit estimation is performed for each journal *J* ∈ 𝒥 and each distribution listed above (PL, PLwC, and Yule-Simon). The results are presented in Table 2 and the resulting distributions together with the data are shown in Figures 1, 2, and A.1.

As can be seen in Figures 1, 2, and A.1, the PL distribution is a poor fit for all data, its *p*-value being zero for all journals. Indeed, for most of the journals, the tail of the data set is lighter than the tail of its PL fit (black dashed lines). For three journals (SCI, PLC, CHA), the *p*-value of the PLwC is larger than 5% and it seems to be a rather good fit, and for two others (NEM and SIA), the Yule-Simon distribution cannot be excluded.

## 3. GENERAL DYNAMICS

We argue that the heavy-tailedness observed in the previous section is likely to be a consequence of a *preferential attachment* or *cumulative advantage* process. Many social processes are ruled by so-called preferential attachment (Jeong, Néda, & Barabási, 2003), also called *cumulative advantage*. Scientific coauthorship (Barabási, Jeong et al., 2002), citations (Eom & Fortunato, 2011; Price, 1976), and performance of scientific institutions (van Raan, 2007) are apparently no exception to the rule. For instance, according to Eom and Fortunato (2011), the probability that a paper will get a new citation at time *t* is proportional to the number of citations this paper already has at time *t*.

Such processes naturally lead to PLs in the relations between characteristics of the systems of interest. For instance, Katz (1999) showed that the number of citations a scientific community gets is a PL of the number of publications in this community, with positive exponent (≈ 1.27). More recently, Bettencourt, Lobo et al. (2010) illustrate that the *Gross Metropolitan Product* of a city is a PL of its population, with positive exponent (≈ 1.126). In a similar spirit, Barabási and Albert (1999) showed that the empirical probability that a web page is targeted by *k* other pages follows a PL with negative exponent (≈ −2.1).

It is reasonable to expect that the evolution of the number of papers published by an author in a given journal is described by a similar preferential attachment process. We support the hypothesis of a preferential attachment or cumulative advantage process by two distinct but similar analysis of publication data.

**Remark.** Notice that even though we refer to the two analyses below as *preferential attachment* and *cumulative advantage*, respectively, these two denominations fundamentally refer to the same general process (Perc, 2014). The main reason for us to use these two denominations is to distinguish the two analyses. Furthermore, the line of reasoning underlying each of our analysis is inspired by the definition of the corresponding notion (“preferential attachment” or “cumulative advantage”).

### 3.1. Preferential Attachment

Heuristically, our first argument is that if an author published a lot of papers in a journal, it means (a) that they write a lot of papers and (b) that their research topic is well aligned with the scope of the journal (for specialized journals), or that the scientific impact of this author’s research matches the standards of the journal (for interdisciplinary journals). Assumptions (a) and (b) together imply that this author is likely to publish again in this journal. We refer to this process as *preferential attachment*.

The above heuristic can be made more rigorous. For a given journal and for *k*, *t* ∈ ℤ_{≥0}, we define

𝒮(

*k*,*t*): the set of all authors who have published*k*papers on December 31 of year*t*− 1;*A*_{k}(*t*) = #𝒮(*k*,*t*): the number of authors in the set 𝒮(*k*,*t*);*N*_{k}(*t*): the number of papers published during year*t*by all the authors in the set 𝒮(*k*,*t*); and*ρ*_{k}(*t*) =*N*_{k}(*t*)/*A*_{k}(*t*) ∈ ℝ: the average number of papers published during year*t*, by the authors in the set 𝒮(*k*,*t*).

*ρ*

_{k}(

*t*) with respect to the number of papers

*k*for years

*t*∈ {1999, …, 2008} for SCI, LAN, and PRL (each point corresponds to one year

*t*and one number of papers

*k*). For each of the three journals, these values have a linear correlation coefficient larger than 0.7, supporting a fairly good linear dependence,

The empirical probability that a new paper is signed by an author with *k* papers is then close to being proportional to *k*. Krapivsky et al. (2000) rigorously proved that, if the relation in Eq. 8 was exactly proportional, then after a long enough time, the distribution of the number of papers over the set of authors would be a PL with exponent *α* ≤ −2. The fact that the relation 8 is not exactly proportional but close to it probably explains that the observed distributions have tails that are heavy, but lighter than the PL, as suggested in Figures 1 and 2.

### 3.2. Cumulative Advantage

The concept of *cumulative advantage*, which is directly related to preferential attachment, has been derived from the seminal work of Merton (1968, 1988) and Price (1976), and the follow-up by Katz (1999). Cumulative advantage emphasizes that an initial advantage leads to a disproportionate advantage in the future. For instance, it has been shown that, if author *i* has twice as many publications as author *j*, then they are likely to get more than twice as many citations (Katz, 1999).

*i*and author

*j*have respectively

*n*

_{i}(

*t*

_{0}) and

*n*

_{j}(

*t*

_{0}) papers in a journal at time

*t*

_{0}, with a ratio

*η*

_{ij}(

*t*

_{0}) =

*n*

_{i}(

*t*

_{0})/

*n*

_{j}(

*t*

_{0}) > 1. Then cumulative advantage means that, at a later time

*t*

_{1}>

*t*

_{0}, the ratio

*η*

_{ij}(

*t*

_{1}) ≥

*η*

_{ij}(

*t*

_{0}), implying that author

*i*gains a disproportional advantage over time. Mathematically speaking, cumulative advantage implies the following equivalences:

*ξ*

_{i}(

*t*,

*s*) =

*n*

_{i}(

*s*)/

*n*

_{i}(

*t*), and where equalities hold if the relation in Eq. 8 is exact.

To support the presence of a cumulative advantage in the publication within the journals SCI, LAN, and PRL, we computed *ξ*_{i}(1999, 2008) for each author who published between 1999 and 2008. The statistics of *ξ*_{i} are shown in Figure 5 as a function of the initial number of papers *n*_{i}(1999). Even though the data are not perfectly conclusive, we clearly observe an increasing trend of *ξ*_{i} as a function of *n*_{i}, suggesting that the relation of Eq. 9 may be satisfied. This observation supports (at least partly) a cumulative advantage process, and henceforth the presence of a PL.

The increasing trends in Figure 5 even suggest a superlinear cumulative advantage (Krapivsky & Krioukov, 2008; Zhou, Wang et al., 2007). Indeed, as mentioned above, if the relation Eq. 8 was exact, *ξ*_{i}(*t*_{0}, *t*_{1}) would be constant with respect to *n*_{i}(*t*_{0}). In such a case, the heavy-tailed distribution observed in Figures 1, 2, and A.1 would be the transient state of the distribution discussed by Krapivsky and Krioukov (2008). A more in-depth analysis of the possibility of a superlinear cumulative advantage could be done, following the calibration approach proposed by Zadorozhnyi and Yudin (2015), but goes beyond the purpose of this article and will be treated in future work.

## 4. KEY PLAYERS

The general distribution of the number of papers per author is quite clear in our analysis: It seems to be somewhere between an exponential distribution and a PL. The PL having the heaviest tail of the three distributions considered (PL, PLwC, and Yule-Simon), we use it to estimate an upper bound on the number of papers published by an author for each journal. Assuming that the data are well described by the PL distribution in Eq. 2, one can compute the number of authors with *n* papers in journal *J*, *A*_{n} ≈ $AJtot$*C*_{α}*n*^{−α}. Setting this number to *A*_{n} = 1, the maximum number of papers is given by *n*_{max} ≈ ($AJtot$*C*_{α})$1\alpha $, determining a theoretical upper bound on the number of papers published by an author for each journal, shown as the vertical dashed lines in Figures 1, 2, and A.1.

In some journals (see e.g., PNA, CHA, SIA, and AMA in Figure 2, and NEM and ACS in Figure A.1), it appears that, some authors, which we refer to as *key players*, publish significantly more papers in a journal than the PL would predict. Note that we checked that these key players are not artifacts due to multiple authors having the same name, which would count as the same person.

To make the data of different journals more comparable, we restricted our investigation to the early years between 1900 (earliest possible in WoS) and the year in parentheses in the second column of Table 1 for our first nine journals in the table. This yields a number of authors comparable to the three following journals in Table 1 (CHA, SIA, and AMA). The reduced number of authors is given in parentheses in the third column of Table 1. The resulting distributions are depicted in Figure 6 and in Figure A.2, and the fitted parameters are detailed in Table 3. It appears from Figures 6 and A.2 that for such reduced number of authors, the overshoot of some authors is more systematic, suggesting that in the early years of scientific journals, there are usually a few very prolific authors publishing in it at a rather high rate.

**Table 3.**

. | PL . | PLwC . | Y-S . | ||||
---|---|---|---|---|---|---|---|

α
. | p (%)
. | β
. | γ
. | p (%)
. | ρ
. | p (%)
. | |

NAT | 2.32 | 29.4 | 2.23 | 0.016 | 6.0 | 2.98 | 0.0 |

PNA | 2.10 | 0.1 | 1.96 | 0.02 | 15.0 | 2.55 | 6.3 |

SCI | 2.44 | 0.0 | 2.13 | 0.09 | 72.0 | 3.37 | 4.7 |

LAN | 2.25 | 0.0 | 1.81 | 0.11 | 30.2 | 2.91 | 2.5 |

NEM | 2.27 | 0.9 | 2.06 | 0.04 | 4.4 | 2.91 | 0.0 |

PLC | 2.59 | 0.0 | 2.12 | 0.16 | 0.3 | 3.82 | 54.7 |

ACS | 2.06 | 0.0 | 1.89 | 0.02 | 0.1 | 2.46 | 64.0 |

TAC | 2.32 | 0.0 | 2.06 | 0.06 | 23.7 | 3.04 | 0.1 |

ENE | 2.69 | 0.8 | 2.50 | 0.06 | 94.5 | 4.06 | 0.0 |

. | PL . | PLwC . | Y-S . | ||||
---|---|---|---|---|---|---|---|

α
. | p (%)
. | β
. | γ
. | p (%)
. | ρ
. | p (%)
. | |

NAT | 2.32 | 29.4 | 2.23 | 0.016 | 6.0 | 2.98 | 0.0 |

PNA | 2.10 | 0.1 | 1.96 | 0.02 | 15.0 | 2.55 | 6.3 |

SCI | 2.44 | 0.0 | 2.13 | 0.09 | 72.0 | 3.37 | 4.7 |

LAN | 2.25 | 0.0 | 1.81 | 0.11 | 30.2 | 2.91 | 2.5 |

NEM | 2.27 | 0.9 | 2.06 | 0.04 | 4.4 | 2.91 | 0.0 |

PLC | 2.59 | 0.0 | 2.12 | 0.16 | 0.3 | 3.82 | 54.7 |

ACS | 2.06 | 0.0 | 1.89 | 0.02 | 0.1 | 2.46 | 64.0 |

TAC | 2.32 | 0.0 | 2.06 | 0.06 | 23.7 | 3.04 | 0.1 |

ENE | 2.69 | 0.8 | 2.50 | 0.06 | 94.5 | 4.06 | 0.0 |

Considering the results of the fitting, in Table 3, we observe better agreements than for the full data sets. This probably indicates that the sample size is not large enough to accurately fit heavy-tailed distributions, which obviously need large samples. The fact that NAT and PNA are well fitted by two distributions also indicates that the reduced data sets are not large enough to be conclusive.

## 5. MODELING

We observe in Figures 1, 2, and A.1 that for old journals where a lot of papers are published, the tail of the histogram has a rather fast decay after a heavy-tailed regime (this is particularly striking in PRL and PRD, Figure 1). We explain this observation by the fact that the number of publications of a given author depends on two parameters: their publication rate and the length of their career. Both these quantities are bounded in practice, and even if it is possible to publish a very large number of papers in a given journal, there is a practical limit to this number. We hypothesize that the decay in the histograms of long-living journals comes from the finiteness of publication rates and career lengths.

To support our hypothesis, we propose a model to generate data sets that mimic the distributions observed above. As discussed, this model is built on two main dynamics. Fundamentally, it is a *preferential attachment* process, where the likelihood that a researcher is in the author’s list of a new paper is proportional to the number of papers this researcher already has in this journal. But in addition, it is refined with a *limited career span*, requiring that after some time, the likelihood that a researcher publishes a new paper decreases to reach zero after they retire.

The model is based on five parameters:

*N*_{y}∈ ℤ_{≥0}: The number of years (i.e., number of iterations) over which the model is run;*N*_{p}∈ ℤ_{≥0}: The number of papers that are published every year in the synthetic journal;*ρ*_{0}∈ [0, 1]: The proportion of papers that are authored by new researchers who have not yet published in the synthetic journal; and*T*_{min},*T*_{max}∈ ℤ_{≥0}: The likelihood that an author publishes a new paper decreases linearly after their*T*_{min}th year of activity, until reaching zero at their*T*_{max}th year of activity. We illustrate this likelihood in Figure 7.

The model is arbitrarily initialized with some number of authors each with a few papers in the synthetic journal, gathered in the data set 𝒟(0) = {*n*_{1}(0), *n*_{2}(0), …, *n*_{A(0)}(0)}. Then for each year *t* ∈ {1, …, *N*_{y}} where the model is run, *N*_{p} papers are attributed randomly either to new authors (i.e., who have not yet published) with probability *ρ*_{0}, or to an existing author with probability 1 − *ρ*_{0}. If it is attributed to an existing author, the probability that it is attributed to author *i* is:

proportional to

*n*_{i}(*t*), the number of papers published by*i*at year*t*; andlinearly decreasing for

*T*_{i}(*t*) ∈ [*T*_{min},*T*_{max}], where*T*_{i}(*t*) is the “*academic age*” of*i*, which is the number of iterations between*t*and the first publication year of*i*.

*i*at year

*t*is given by

*Z*(

*y*) is the appropriate normalizing factor. The actual implementation of this model is available online (Delabays, 2022).

Histograms of the outcome of this model are illustrated in Figure 8 and the fitted parameters are in Table 4. We observe a clear similarity between the histograms for synthetic and real data. Namely, for short lifetime (*N*_{y} = 50), some authors beat the PL and exceed the number of papers that would be expected, as is observed in Figure 2 for CHA, SIA, and AMA. For longer lifetime (*N*_{y} = 150) the tail of the distribution decays and loses its heaviness, similar to PRL and PRD in Figure 1.

**Table 4.**

. | PL . | PLwC . | Y-S . | ||||
---|---|---|---|---|---|---|---|

α
. | p (%)
. | β
. | γ
. | p (%)
. | ρ
. | p (%)
. | |

N_{y} = 50 | 2.05 | 0.0 | 1.94 | 0.013 | 0.0 | 2.44 | 0.2 |

N_{y} = 100 | 2.12 | 0.0 | 2.03 | 0.01 | 0.0 | 2.58 | 0.0 |

N_{y} = 150 | 2.12 | 0.0 | 2.01 | 0.02 | 0.0 | 2.58 | 0.06 |

. | PL . | PLwC . | Y-S . | ||||
---|---|---|---|---|---|---|---|

α
. | p (%)
. | β
. | γ
. | p (%)
. | ρ
. | p (%)
. | |

N_{y} = 50 | 2.05 | 0.0 | 1.94 | 0.013 | 0.0 | 2.44 | 0.2 |

N_{y} = 100 | 2.12 | 0.0 | 2.03 | 0.01 | 0.0 | 2.58 | 0.0 |

N_{y} = 150 | 2.12 | 0.0 | 2.01 | 0.02 | 0.0 | 2.58 | 0.06 |

These observations advocate in favor of the hypothesis that the two main ingredients in the description of the evolution of the authorship within journals are both *preferential attachment* and *finiteness of careers*.

## 6. DISCUSSION

The main observation of our article is the heavy-tailed shape of the distribution of papers, which we explain by a preferential attachment or cumulative advantage process. Heavy-tailedness in distributions related to scientific publications, especially in citation or collaboration networks, has widely been documented (Eom & Fortunato, 2011; Price, 1976). We showed that heavy-tailedness is preserved when restricting the analysis to a single journal.

Interestingly, our analysis suggests that the distribution does not follow a PL, but has a slightly lighter tail. Whereas we have not been able to unequivocally identify a canonical distribution, we demonstrated that a PLwC or a Yule-Simon distribution seem to be better fits to the data than the PL.

We argue that the observed heavy-tailedness of the distribution follows from a preferential attachment process through three pieces of evidence. First, we showed that the probability that an author gets a new paper in a given journal at time *t* is approximately proportional to the number of papers they already have in the very same journal. According to Krapivsky et al. (2000), exact proportionality would lead to a PL. Therefore, it is likely that an approximate proportionality leads to a heavy-tailed distribution.

Second, we emphasized an approximate cumulative advantage process, which also leads to PL behaviors. Whereas both what we refer to as preferential attachment and cumulative advantage are closely related, they display two underlying mechanisms explaining the heavy-tailedness of the distributions.

Finally, we provided a mathematical model for generating synthetic data of number of papers in a given journal, where preferential attachment plays a crucial role. The similarity between the obtained distribution and the observed distributions also supports the claim of the heavy tails being driven by preferential attachment.

Even though there seems to be a pattern in the data analyzed in this article, standard distributions (e.g., PLwC, Yule-Simon) do not perfectly fit the data. More advanced fitting techniques could identify a common distribution for all journals, provided that one exists. A more refined explanation of the approximate preferential attachment taking place in scientific publishing could unravel with more certainty the source of the distributions observed in this article. Even though the preferential attachment has been emphasized in the past, the underlying reasons for this bias are intricate. Disentangling the impact of scientific factors (quality and novelty of the research) and more social ones (rank and reputation of the authors) in the publication process will be a key step towards a fair and square evaluation of scientists and their work.

## AUTHOR CONTRIBUTIONS

Robin Delabays: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing—original draft, Writing—review & editing. Melvyn Tyloo: Conceptualization, Methodology, Writing—review & editing.

## FUNDING INFORMATION

Both authors were partly supported by the Swiss National Science Foundation under grant number 200020_182050. RD was supported by the Swiss National Science Foundation under grant number P400P2_194359.

## COMPETING INTERESTS

The authors have no competing interests.

## DATA AVAILABILITY

The data were extracted from www.webofscience.com and cannot be shared openly. The code for synthetic data generation is available online (Delabays, 2022).

## REFERENCES

### APPENDIX

**Figure A.1.**

**Figure A.2.**

## Author notes

Handling Editor: Ludo Waltman