## Abstract

Cocitation measurements can reveal the extent to which a concept representing a novel combination of existing ideas evolves towards a specialty. The strength of cocitation is represented by its frequency, which accumulates over time. Of interest is whether underlying features associated with the strength of cocitation can be identified. We use the proximal citation network for a given pair of articles (*x*, *y*) to compute *θ*, an a priori estimate of the probability of cocitation between x and y, prior to their first cocitation. Thus, low values for *θ* reflect pairs of articles for which cocitation is presumed less likely. We observe that cocitation frequencies are a composite of power-law and lognormal distributions, and that very high cocitation frequencies are more likely to be composed of pairs with low values of *θ*, reflecting the impact of a novel combination of ideas. Furthermore, we note that the occurrence of a direct citation between two members of a cocited pair increases with cocitation frequency. Finally, we identify cases of frequently cocited publications that accumulate cocitations after an extended period of dormancy.

## 1. INTRODUCTION

Cocitation, “the frequency with which two documents from the earlier literature are cited together in the later literature,” was first described in 1973 (Marshakova-Shaikevich, 1973; Small, 1973). As noted by Small (1973), cocitation patterns differ from bibliographic coupling patterns (Kessler, 1963) but align with patterns of direct citation and frequently cocited publications must have high individual citations.

Cocitation has been the subject of further study and characterization, for example, comparisons to bibliographic coupling and direct citation (Boyack & Klavans, 2010), the study of invisible colleges (Gmür, 2003; Noma, 1984), construction of networks by cocitation (Small & Sweeney, 1985; Small, Sweeney, & Greenlee, 1985), evaluation of clusters in combination with textual analysis (Braam, Moed, & van Raan, 1991), textual similarity at the article and other levels (Colavizza, Boyack, et al., 2018), and the fractal nature of publications aggregated by cocitations (van Raan, 1990).

Cocitations provide details of the relationship between key (highly cited) ideas, and changes in cocitation patterns over time may provide insight into the mechanism with which new schools of thought develop. Implicit in the definition of cocitation are novel combinations of existing ideas, but only some frequently cocited article pairs reflect surprising combinations. For example, two publications presenting the leading methods for the same computational problem may be highly cocited, but this does not reflect a novel combination of ideas. Similarly, two publications describing methods that often constitute part of the same workflow may be highly cocited, but these cocitations are also not surprising. On the other hand, for two articles in different fields, frequent cocitation is generally unexpected.

Novel, atypical, or otherwise unusual combinations of cocited articles have been explored at the journal level (Boyack & Klavans, 2014; Bradley, Devarakonda, et al., 2020; Uzzi, Mukherjee, et al., 2013; Wang, Veugelers, & Stephan, 2017). However, journal-level classifications have limited resolution relative to article-level studies, which may better represent the actual structure and aggregations of the scientific literature (Gómez, Bordons, et al., 1996; Klavans & Boyack, 2017; Milojevic, 2019; Shu, Julien, et al., 2019; Waltman & van Eck, 2012). Accordingly, we sought to discover measurable characteristics of frequently cocited publications from an article-level perspective.

To study frequently cocited articles, we have developed a novel graph-theoretic approach that reflects the citation neighborhood of a given pair of articles. In seeking to determine the degree to which a cocited pair of papers represented a surprising combination, we wished to avoid journal-based field classifications, which present challenges. Instead, we attempted to use citation history to produce an estimate of the probability that a given pair of publications (*x*, *y*) would be cocited. As we focus on the activity before they are first cocited, the “probability” of cocitation is zero, by definition, because there are no cocitations yet. Hence, we approximated cocitation probabilities: We treat an article that cites one member of a cocited pair and also cites at least one article that cites the other member as a proxy for cocitation. Specifically, given a pair of publications *x*, *y*, we construct a directed bipartite graph whose vertex set contains all publications that cite either *x* or *y* previous to their first cocitation. We then compute *θ*, a normalized count of such proxies, and use it to predict the probability of cocitation between *x* and *y*. This approach enables an evaluation that is specific to the given pair of articles, and does so without substantial computational cost, while avoiding definitions of disciplines derived from journals or having to measure disciplinary distances.

To support our analysis, we constructed a data set of articles from Scopus (Elsevier BV, 2019) that were published in the 11-year period, 1985–1995, and extracted the cited references in these articles. Recognizing that frequently cocited publications must derive from highly cited publications (Small, 1973), we identified those reference pairs (33.6 million pairs) for each article in the data set that are drawn from the top 1% most cited articles in Scopus and measured their frequency of cocitation.

To investigate which statistical distributions might best describe the cocitation frequencies in these 33.6 million cocited pairs, we reviewed prior work on distributions of citation frequency (Eom & Fortunato, 2011; Newman, 2003; Price, 1965, 1976; Radicchi, Fortunato, & Castellano, 2008; Redner, 2005; Stringer, Sales-Pardo, & Amaral, 2008, 2010; Wang, Song, & Barabási, 2013). This research has fit the frequency distribution of citation strength sometimes to a power law distribution and other times to a lognormal distribution. A graph of the analogous cocitation data suggests that power law or lognormal distributions are candidates for describing cocitation strength as well and so we, accordingly, investigated that conjecture. Interestingly, Mitzenmacher (2003) notes that the debate between the appropriateness of power law versus lognormal distributions is not confined to bibliometrics, but has been at issue in many disciplines and contexts.

To study how the best-fit distributional function and parameters for cocitation might vary with *θ*, we stratified cocitation frequency data. We also measured whether a direct link exists between two members of a cocited pair (i.e., whether one member of a pair cites the other) and how this property is related to cocitation frequencies. We find that the distribution of cocitation frequencies varies with *θ* and that a power law distribution fits cocitation frequencies more often when *θ* is small, whereas a lognormal distribution fits more often for large *θ*.

A pertinent aspect of cocitation is the rate at which frequencies accumulate. While the citation dynamics of individual publications have been fairly well studied by others, for example, Eom and Fortunato (2011) and Wallace, Larivière, and Gingras (2009), the dynamics of cocited articles are less well studied. Our interest was the special case analogous to the Sleeping Beauty phenomenon (Ke, Ferrara, et al., 2015; van Raan, 2004), which may reflect delayed recognition of scientific discovery and the causes attributed to it (Barber, 1961; Cole, 1970; Garfield, 1970, 1980; Glänzel & Garfield, 2004; Merton, 1963). Thus, we also identified cocited pairs that featured a period of dormancy before accumulating cocitations.

## 2. MATERIALS AND METHODS

### 2.1. Data

Citation counts were computed for all Scopus articles (88,639,980 records) updated through December 2019, as implemented in the ERNIE project (Korobskiy, Davey, et al., 2019). Records with corrupted or missing publication years or classified as “dummy” by the vendor were then removed, resulting in a data set of 76,572,284 publications. Hazen percentiles of citation counts, grouped by year of publication, were calculated for the these data (Bornmann, Leydesdorff, & Mutz, 2013). The top 1% of highly cited publications from each year were combined into a set of highly cited publications consisting of 768,993 publications.

Publications of type “article,” each containing at least five cited references and published in the 11-year period from 1985–1995, were subsetted from Scopus to form a data set of 3,394,799 publications and 51,801,106 references (8,397,935 unique). For each of these publications, all possible reference pairs were generated and then restricted to those pairs where both members were in the set of highly cited publications (above).

For example, the data for 1985 consisted of 223,485 articles after processing as described above. Computing all reference pairs (that were also members of the highly cited publication set of 768,993) from these 223,485 articles gave rise to 2,600,101 reference pairs (Table 1) that ranged in cocitation frequency from 1 to 874 within the 1985 data set; from 1 to 11,949 across the 11-year period 1985–1995; and from 1 to 35,755 across all of Scopus. Collectively, the publications in our 1985–1995 data set generated 33,641,395 unique cocitation pairs, for which we computed cocitation frequencies across all of Scopus (Figure 1).

Year . | Articles . | References . | Cocited pairs . |
---|---|---|---|

1985 | 223,485 | 1,796,502 | 2,600,101 |

1986 | 238,096 | 1,920,225 | 2,840,557 |

1987 | 250,575 | 2,037,654 | 3,180,261 |

1988 | 269,219 | 2,182,571 | 3,406,902 |

1989 | 285,873 | 2,303,481 | 3,793,986 |

1990 | 305,010 | 2,490,909 | 4,546,915 |

1991 | 325,782 | 2,662,005 | 5,039,334 |

1992 | 343,239 | 2,846,607 | 5,622,164 |

1993 | 360,916 | 3,006,374 | 6,121,147 |

1994 | 387,062 | 3,228,240 | 7,022,499 |

1995 | 405,503 | 3,432,228 | 7,626,684 |

Year . | Articles . | References . | Cocited pairs . |
---|---|---|---|

1985 | 223,485 | 1,796,502 | 2,600,101 |

1986 | 238,096 | 1,920,225 | 2,840,557 |

1987 | 250,575 | 2,037,654 | 3,180,261 |

1988 | 269,219 | 2,182,571 | 3,406,902 |

1989 | 285,873 | 2,303,481 | 3,793,986 |

1990 | 305,010 | 2,490,909 | 4,546,915 |

1991 | 325,782 | 2,662,005 | 5,039,334 |

1992 | 343,239 | 2,846,607 | 5,622,164 |

1993 | 360,916 | 3,006,374 | 6,121,147 |

1994 | 387,062 | 3,228,240 | 7,022,499 |

1995 | 405,503 | 3,432,228 | 7,626,684 |

### 2.2. Derivation of *θ*

We now show how we define our prior on the probability of *x* and *y* being cocited, based on the citation graph restricted to publications that cite either *x* or *y* (but not both) up to the year of their first cocitation. Recall that we defined a proxy cocitation of *x* and *y* to be an article that cites one member of the cocited pair (*x*, *y*) and also cites at least one article that cites the other member. The idea behind this definition is that we consider papers that cite *x* as proxies for *x*, and papers that cite *y* as proxies for *y*. Thus, if a paper *a* cites both *x* and *y*′ (where *y*′ is a proxy for *y*), then it is a proxy for a cocitation of *x* and *y*. Similarly, if a paper *b* cites both *y* and *x*′ (where *x*′ is a proxy for *x*), it is also a proxy for a cocitation of *x* and *y*. This motivates the graph-theoretic formulation, which we now formally present.

We fix the pair *x*, *y* and we define *N*(*x*) to be the set of all publications that cite *x* (but do not also cite *y*), and are published no later than the year of the first cocitation of *x* and *y*. We similarly define *N*(*y*). We define a directed bipartite graph with vertex set *N*(*x*) ∪ *N*(*y*). Note that if *x* cites *y* then *x* ∈ *N*(*y*), and similarly for the case where *y* cites *x*. Note also that because we have restricted *N*(*x*) and *N*(*y*) that *N*(*x*) ∩ *N*(*y*) = ∅. We now describe how the directed edge set *E*(*x*, *y*) is constructed. For any pair of articles *a*, *b* where *a* ∈ *N*(*x*) and *b* ∈ *N*(*y*), if *a* cites *b* then we include the directed edge *a* → *b* in *E*(*x*, *y*). Similarly, we include edge *b* → *a* if *b* cites *a*. Finally, if a pair of articles both cite each other, then the graph has parallel edges. By construction, this graph is *bipartite*, which means that all the edges go between the two sets *N*(*x*) and *N*(*y*) (i.e., no edges exist between two vertices in *N*(*x*), nor between two vertices in *N*(*y*)).

Note that by the definition, every edge in *E*(*x*, *y*) arises because of a proxy cocitation, so that the number of proxy cocitations is the number of directed edges in *E*(*x*, *y*). Consider the situation where a publication *a* cites *x* (so that *a* ∈ *N*(*x*)) and also cites *b*_{1}, *b*_{2}, *b*_{3} in *N*(*y*): this defines three directed edges from *a* to nodes of *N*(*y*). We count this as three proxy cocitations, not as one proxy cocitation. Similarly, if we have a publication *b* that cites *y* and also cites *a*_{1}, *a*_{2}, *a*_{3}, *a*_{4} in *N*(*x*), then there are four directed edges that go from *b* to nodes in *N*(*x*) and we will count each of those directed edges as a different proxy cocitation.

*X*| denote the cardinality of a set

*X*, we note |

*E*(

*x*,

*y*)|, (i.e., the number of directed edges that go between

*N*(

*x*) and

*N*(

*y*)), is the number of proxy cocitations between

*x*and

*y*. If no parallel edges are permitted, the maximum number of possible proxy cocitations is |

*N*(

*x*)| × |

*N*(

*y*)|. Under the assumption that both

*N*(

*x*) and

*N*(

*y*) each have at least one article, we define

*θ*(

*x*,

*y*), our prior on the probability of

*x*and

*y*being cocited, as follows:

Note that if parallel edges do not occur in the graph, then *θ*(*x*, *y*) ≤ 1, but that otherwise the value can be greater than 1. Note also that *θ*(*x*, *y*) = 0 if *E*(*x*, *y*) = ∅ (i.e., if there are no proxy cocitations) and that *θ*(*x*, *y*) = 1 if every possible proxy cocitation occurs.

To efficiently calculate *θ*, we used the following pipeline. We copied Scopus data from a relational schema in PostgreSQL into a citation graph from Scopus into the Neo4j 3.5 graph database using an automated Extract Transform Load (ETL) pipeline that combined Postgres CSV export and the Neo4j Bulk Import tool. The graph vertex set is all publications, each with a publication year attribute, and the edge set is all citations between the publications. A Cypher index was created on the publication year. We developed Cypher queries to calculate *θ* and tuned performance by splitting input publication pairs into small batches and processing them in parallel, using parallelization in Bash and GNU Parallel. Batch size, the number of parallel job slots, and other parameters were tuned for performance, with best results achieved on batch sizes varying from 20 to 100 pairs. The results of *θ* calculations were cross-checked using SQL calculations. In the small number of cases where *θ* computed to >1 (above) it was set to 1 for the purpose of this study.

### 2.3. Statistical Calculations

*N*is the total number of pairs of articles and $xio$ is the observed frequency of the

*i*th pair of papers being cocited. Note that this is in general a multiset, as different pairs of articles can have the same cocitation frequency. Let

*n*(

*x*) be the number of times that

*x*appears in

*X*

^{o}(equivalently,

*n*(

*x*) is the number of pairs of articles that are cocited

*x*times), and let

*N*(

*x*) = $\u2211y=x\u221e$

*n*(

*y*) denote the total number of pairs of articles that are cocited at least

*x*times. Then

*x*is a parameter we use to analyze the distribution’s right tail starting at varying frequencies. We describe in this subsection (a) the statistical computations for fitting log-normal and power law distributions to right tails of the observed cocitation frequency distributions as defined by Eq. 1 for various

*x*and (b) how we assessed the quality of those fits. Further, we performed such analyses for various slices of the data, stratifying by

*θ*and other parameters, as is described in Section 3.

*f*(·), following Stringer et al. (2008) and Stringer et al. (2010), while appropriately normalizing for our conditional assessment of the right tail commencing at

*x*:

*μ*and

*σ*are the mean and standard deviation, respectively, of the underlying normal distribution. These probabilities can be computed with the cumulative normal distribution,

We fit distributions to the cocitation frequency data for various extremities of the right tail, as parameterized by *x*, using a maximum (log) likelihood estimator (MLE). We solved for the best-fit distributional parameters for the lognormal distribution, *μ* and *σ*, by modifying a multi-dimensional interval search algorithm from Press, Teukolsky, et al. (2007) and following Stringer et al. (2010). A compiled version of this code using the C++ header file amoeba.h is available on our Github site (Korobskiy et al., 2019).

*x*, which was normalized for our conditional observations of the right tail:

*ζ*(

*α*, 1), as is needed for analysis of the right tail.

*α*,

*X*

^{o}(

*x*) = {

*x*∈

*X*

^{o}:

*x*≥

*x*}, are the observed cocitations with frequencies at least as great as

*x*and

*N*(

*x*) is the number of such cocitations. We solved Eq. 4 to find

*α*using a bisection algorithm.

We used the *χ*^{2} goodness of fit (*χ*^{2}) and the Kolmogorov-Smirnov (K-S) tests to assess the null hypothesis that the distribution of the observed cocitation frequencies and the best-fit lognormal distribution are the same, and similarly for the best-fit power law distribution. We also computed the Kullback-Leibler Divergence (K-L) between the observed data and the best-fit distributions.

Both the *χ*^{2} and K-S tests employed the null hypothesis that the observed cocitation frequencies, *n*(*x*) for *x* ∈ [*x*, ∞), were sampled from the best-fit lognormal or power law distributions, which we denote by *f*_{d}(·|*x*) for *d* ∈ {*LN*, *PL*}, while suppressing the parameters specific to each of the distributions.

*χ*

^{2}statistic was computed by, first, grouping each of the observed cocitation frequencies into

*k*bins, denoted by

*b*

_{i}for

*i*∈ {1, …,

*k*}, and then computing

*O*

_{i}is the observed number of cocitations having frequencies associated with the

*i*th bin,

*E*

_{i}is the expected number of observations for frequencies in bin

*i*, if the null hypothesis was true, in a sample with size equal to the number of observed data points,

*N*(

*x*):

*O*

_{i}and

*E*

_{i}to be approximately equal, with deviations owing to variability due to sampling.

Constructing the bins *b*_{i} requires only that *E*_{i} ≥ 5 for every *i* = 1, …, *k*. Test outcomes are sometimes sensitive to the minimum *E*_{i} permitted, which we will denote by *E*, so we tested with multiple thresholds, including 10, 20, 50, and 70. Furthermore, statistical tests are stochastic: These multiple tests permitted a reduction in the probability of erroneously rejecting or accepting the null hypothesis based on a single test. The distribution of observed cocitation frequencies was skewed right with a long tail, so that aggregating bins to satisfy *E*_{i} ≥ *E* was most critical in the right tail. This motivated a bin construction algorithm that aggregated frequencies in reverse order, starting with the extreme right tail. Algorithm 1 requires a set of the unique observed cocitation frequencies, $X\u02c6$^{o}, which includes the elements of the multiset *X*^{o} without repetition. While Algorithm 1 does not guarantee in general that all bins satisfy *E*_{i} ≥ *E*, that criterion was satisfied for the observed data.

1: i ← 1 |

2: b_{1} = {} |

3: while |$X\u02c6$^{o}| > 0 do |

4: b_{i} ← b_{i} ∪ {max ($X\u02c6$^{o})} |

5: $X\u02c6$^{o} ← $X\u02c6$^{o} \ max ($X\u02c6$^{o}) |

6: ifE_{i} ≥ Ethen |

7: i ← i + 1 |

8: b_{i} ← {} |

9: end if |

10: end while |

1: i ← 1 |

2: b_{1} = {} |

3: while |$X\u02c6$^{o}| > 0 do |

4: b_{i} ← b_{i} ∪ {max ($X\u02c6$^{o})} |

5: $X\u02c6$^{o} ← $X\u02c6$^{o} \ max ($X\u02c6$^{o}) |

6: ifE_{i} ≥ Ethen |

7: i ← i + 1 |

8: b_{i} ← {} |

9: end if |

10: end while |

*F*

^{o}(

*x*|

*x*) = $\u2211i=x\xafx$

*f*

^{o}(

*i*|

*x*), and the best-fit cumulative distribution by

*F*

_{d}(

*x*|

*x*) = $\u2211i=x\xafx$

*f*

_{d}(

*i*|

*x*). The K-S test involves testing the maximum absolute difference between the observed and theorized cumulative distributions,

*n*is the number of observations giving rise to

*F*

^{o}(

*x*|

*x*), against the distribution of such differences between samples from the theorized distribution with the same number of observations,

*n*,

_{d,j}(

*x*|

*x*) is the empirical distribution of sample

*j*of size

*n*(notation suppressed) drawn from

*F*

_{d}(

*x*|

*x*). We generated 100 such random variables $D\u02dc$

_{n}for each test. We reject the null hypothesis if

*D*

_{n}is larger than substantially all of the $D\u02dc$

_{n}, say all but 5%, for equivalence with a

*p*-value of 0.05. The number of $D\u02dc$

_{n}samples drawn yields a

*p*-value with a resolution of 1%.

*θ*using a

*χ*

^{2}test, using the null hypothesis that the cocitation frequency distribution was independent of

*θ*. We initially created a contingency table on

*θ*and cocitation frequency using these bins for

*θ*, {[0.0, 0.2), [0.2, 0.4), [0.4, 0.6), [0.6, 0.8), [0.8, 1.0]}, and logarithmic bins for frequency to accommodate the skewed distributions:

*θ*and frequency increased by having just two intervals for frequency: {[10, 100), [100, 100000]}.

### 2.4. Kinetics of Cocitation

We extended prior work on delayed recognition and the Sleeping Beauty phenomemon (Glänzel & Garfield, 2004; Ke et al., 2015; Li & Ye, 2016; van Raan, 2004) towards cocitation. We have modified the beauty coefficient (B) of Ke et al. (2015) to address cocitations by (a) counting citations to a pair of publications (cocitations) rather than citations to individual papers, (b) setting *t*_{0} (age zero) to the first year in which a pair of publications could be cocited (i.e., the publication year of the more recently published member of a cocited pair), and (c) setting *C*_{0} to the number of cocitations occurring in year *t*_{0}. Rather than calculate awakening time as in Ke et al. (2015), we opted to measure the simpler length of time between *t*_{0} and the first year in which a cocitation was recorded; we label this measurement as the time lag *t*_{l}, so that *t*_{l} = 0 if a cocitation was recorded in *t*_{0}.

## 3. RESULTS AND DISCUSSION

Our base data set, described in Table 1, consists of the 33,641,395 cocited reference pairs (33.6 million pairs) and their cocitation frequencies, gathered from Scopus during the 11-year period from 1985–1995 (Section 2). A striking distribution of cocitation frequencies with a long right tail is observed with a minimum cocitation of 1, a median of 2, and a maximum cocitation frequency of 51,567 (Figure 2). Approximately 33.3 of 33.6 million pairs (99% of observations) have cocitation frequencies ranging from 1–67 and the remaining 1% have cocitation frequencies ranging from 68–51,567. As the focus of our study was cocitations of frequently cited publications, we further restricted this data set to those pairs with a cocitation frequency of at least 10, which resulted in a smaller data set of 4,119,324 cocited pairs (4.1 million pairs) with minimum cocitation frequency of 10, median of 18, and a maximum cocitation frequency of 51,567. To focus on cocitations derived from highly cited publications, *θ* was calculated for all pairs with a cocitation frequency of at least 10. We also note whether one article in a cocitation pair cites the other (connectedness), reporting a pair as “connected” when such a citation occurs, else as “not connected.”

Influenced by the use of linked cocitations for clustering (Small & Sweeney, 1985), we also examined the extent to which members of a cocited pair were also found in other cocited pairs. We found that 205,543 articles contributed to 4.12 million cocited pairs. The highest frequency observed in our data set, 51,567 cocitations, was for a pair of articles from the field of physical chemistry: Becke (1993) and Lee, Yang, and Parr (1988). The members of this pair are not connected and are found in 1,504 cocited pairs with frequencies ranging from 10 to 51,567. The second highest frequency, 28,407 cocitations, was for another pair of articles from the field of biochemistry: Bradford (1976) and Laemmli (1970). Members of this pair are not connected and are found in a staggering 41,909 cocited pairs, 24,558 for the Laemmli gel electrophoresis article and 17,352 for the Bradford protein estimation article. For the latter pair, both articles describe methods heavily used in biochemistry and molecular biology, an area with strong referencing activity, so this result is not entirely surprising.

Having developed *θ*(*x*, *y*) as a prediction of the probability that articles *x* and *y* would be cocited, we first tested whether the distribution of cocitation frequencies was independent of *θ* (Section 2). The null hypothesis that the cocitation frequency distribution was independent of *θ* was rejected with a very small *p*-value: The statistical software indicated a *p*-value with no significant nonzero digits. We next investigated what distribution functions might fit the frequencies of cocitation as *θ* varied.

Based on the long tails of citation frequencies, prior research has assessed the fit of log-normal and power law distributions (Radicchi et al., 2008; Stringer et al., 2008, 2010). We noted long right tails in cocitation frequencies, which, similarly, motivated us to assess the fit of lognormal and power law distributions to cocitation data. Further, we stratified the data according to (a) the minimum frequency for the right tail *x*, (b) *θ*, and (c) whether the two members of each cocitation pair were connected. Figure 3 shows which distribution, if either, fits the data in each slice, based on tests of statistical significance. Note that there were no circumstances where both distributions fit: If one fit, then the other did not.

Statistical tests were not possible for some slices due to an insufficient number of data points. This was the case for certain combinations of large *x*, large *θ*, and cocitations that were not connected. The number of data points obviously decreases as *x* increases, and we found the decrease in the number of data points to be more precipitous when *θ* was large and cocitations were unconnected due to the lighter right tails for these parameter combinations. The graph in the right panel of Figure 4, which has a logarithmic *y*-axis, shows that the number of data points per *θ* interval analyzed decreases most often by more than an order of magnitude from one interval to the next as *θ* increases. Most pairs of publications that are cocited at least 10 times, therefore, have small values of *θ*.

Figure 3 indicates when the null hypothesis of a best-fit lognormal or power law fitting the observed data cannot be rejected. We computed two types of statistics for evaluating the null hypothesis (*χ*^{2} and K-S) and, moreover, we computed the *χ*^{2} statistic for four binning strategies. Figure 3 indicates a distributional fit, specifically, if either the K-S *p*-value is greater than 0.05 or if two or more of the *χ*^{2} statistics are greater than 0.05. While we computed the K-L Divergence (see supplementary material), we did not use these computations for formal statements of distributional fit because they are neither a norm nor determine statistical significance. These K-L computations did, however, support the findings based on formal tests of statistical significance.

Power law distributions fit most often when cocitations are connected (Figure 3), when more extreme right tails are considered, and when cocitations have small values of *θ*. Log-normal distributions fit, conversely, in some circumstances, when a greater portion of the right tail is considered. These observations support the existence of heavy tails for *θ* small, even if a lognormal distribution fits the observed data more broadly. This observation is consistent with our observations of the most frequent cocitations having small *θ* values, as shown in the scatter plot in the left panel of Figure 4.

Mitzenmacher (2003) shows a close relationship between the power law and lognormal distributions vis-à-vis subtle variations in generative mechanisms that determine whether the resulting distribution is a power law or lognormal. The stratified layers in Figure 3, where a lognormal distribution fits for some portion of the right tail and, in the same instance, a power law describes the more extreme tail, may, therefore, be due to a generative mechanism whose parameters are close to those for a power law distribution as well as those for a lognormal distribution.

Table 2 shows the exponents of the best-fit power law distributions when statistical tests indicated that a power law was a good fit and where comparisons were possible among the intervals of *θ*: These were possible for *θ* intervals of [0.0, 0.2) and [0.2, 0.4), for connected cocitations, and right tails commencing at *x* ∈ {200, 250, 300}. The power law exponent *α* in these comparisons was less for *θ* ∈ [0.0, 0.2) than for *θ* ∈ [0.2, 0.4), indicating heavier tails for *θ* small and, therefore, a greater chance of extreme cocitation frequency. Figure 5 shows a log-log plot of the number of cocitations (*y*-axis) exhibiting the counts on the *x*-axis, for *θ* in the interval [0.0, 0.2) (note that both axes employ log scaling). The pattern for points below the 99th percentile clearly indicates that the number of cocitations referenced at a given frequency decreases greatly as the frequency increases. Also, the broadening of the scatter where fewer cocitations are cited more frequently is indicative of a long right tail, as has been observed in other research where lognormal or power law distributions have been fit to data, as in Montebruno, Bennett, et al. (2019).

Right tail cutoff (x)
. | θ
. | Power law exponent (α)
. |
---|---|---|

200 | [0.0, 0.2) | 3.26 |

200 | [0.2, 0.4) | 3.37 |

250 | [0.0, 0.2) | 3.27 |

250 | [0.2, 0.4) | 3.37 |

300 | [0.0, 0.2) | 3.22 |

300 | [0.2, 0.4) | 3.35 |

Right tail cutoff (x)
. | θ
. | Power law exponent (α)
. |
---|---|---|

200 | [0.0, 0.2) | 3.26 |

200 | [0.2, 0.4) | 3.37 |

250 | [0.0, 0.2) | 3.27 |

250 | [0.2, 0.4) | 3.37 |

300 | [0.0, 0.2) | 3.22 |

300 | [0.2, 0.4) | 3.35 |

Perline (2005) warns against fitting a power law function to truncated data. Informally, a portion of the entire data set can appear linear on a log-log plot, while the entire data set would not. He cites instances where researchers have mistakenly characterized an entire data set as following a power law due to an analysis of only a portion of the data, when a lognormal distribution might provide a better fit to the entire data set. Indeed, the scatter plot in Figure 5 is not linear and so, as Figure 3 shows, a power law does not fit the entire data set. This is what Perline calls a weak power law, where a power law distribution function fits the tail but not the entire distribution. Our concern, however, is not with characterizing the distributional function for the entire data set, but with characterizing the features of high frequency cocitations, which by definition means we are concerned with the right tail of the distribution. Moreover, the results avoid confusion between lognormal and power law distribution functions because we have shown not only that a power law provides a statistically significant fit but also that a lognormal distribution function does not fit.

Our analysis found particularly heavy tails that were well fit by power law distributions for small *θ*, in the intervals [0.0, 0.2) and [0.2, 0.4), and for cocitations whose constituents are connected, as shown in Figure 3. The closely related Matthew Effect (Merton, 1968), cumulative advantage (Price, 1976), and the preferential attachment class of models (Albert & Barabási, 2002) provide possible explanations for citation frequencies following a power law distribution for some sufficiently extreme portion of the right tail. For greater values of *θ*, insufficient data in the right tails precludes a definitive assessment in this regard, although one might argue that the lack of observations in the tails is counter to the existence of a power law relationship. It is also noteworthy that the exponents we found for cocitations (Table 2) are close in value to those reported for citations by Price (1976) and Radicchi et al. (2008).

### 3.1. Delayed Cocitations

The delayed onset of citations to a well-cited publication, also referred to as *Delayed Recognition* and *Sleeping Beauty*, has been studied by Garfield, van Raan, and others (Bornmann, Ye, & Ye, 2018; Garfield, 1970; Glänzel & Garfield, 2004; Ke et al., 2015; Li & Ye, 2016; van Raan, 2004; van Raan & Winnink, 2019). We sought to extend this concept to frequently cocited articles (Figure 6). As an initial step, we calculated two parameters (Section 2): (a) the beauty coefficient (Ke et al., 2015) modified for cocited articles and (b) timelag *t*_{l}, the length of time between first possible year of cocitation and the first year in which a cocitation was recorded. We further focused our consideration of delayed cocitations to the 95th percentile or greater of cocitation frequencies in our data set of 4.1 million cocited pairs. Within the bounds of this restriction, 24 cocited pairs have a beauty coefficient of 1,000 or greater and all 24 are in the 99th percentile of cocitation frequencies. Thus, very high beauty coefficients are associated with high cocitation frequencies.

We also examined the relationship of *t*_{l} with cocitation frequencies (Figure 7) and observed that high *t*_{l} values were associated with lower cocitation frequencies. These data appear to be consistent with a report from van Raan and Winnink (van Raan & Winnink, 2019), who conclude that “probability of awakening after a period of deep sleep is becoming rapidly smaller for longer sleeping periods.” Further, when two articles are connected, they tend to have smaller *t*_{l} values compared to pairs that are not connected in the same frequency range.

## 4. CONCLUSIONS

In this article, we report on our exploration of features that impact the frequency of cocitations. In particular, we wished to examine article pairs with high cocitation frequencies with respect to whether they originated from the same school(s) of thought or represented novel combinations of existing ideas. However, defining a discipline is challenging, and determining the discipline(s) relevant to specific publications remains a challenging problem. Journal-level classifications of disciplines have known limitations and while article-level approaches offer some advantages, they are not free from their own limitations (Milojevic, 2019).

Consequently, we designed *θ*, a statistic that examines the citation neighborhood of a pair of articles *x* and *y* to estimate the probability that they would be cocited. Our approach has advantages compared to alternate approaches: It avoids the challenges of journal-level analyses, it does not require a definition of “discipline” (or “disciplinary distance”), it does not require assignment of disciplines to articles, it is computationally feasible, and, most importantly, it enables an evaluation that is specific to a given pair of articles.

We note that when *x* and *y* are from the same subfield, then *θ* may be very large, and conversely, when *x* and *y* are from very different fields, it might be reasonable to expect that *θ* will be small. Thus, in a sense, *θ* may correlate with disciplinary similarity, with large values for *θ* reflecting conditions where the two publications are in the same (or very close) subdisciplines, and small values for *θ* reflecting that the disciplines for the two publications are very distantly related. We also comment that in this initial study, we have not considered second-degree information, that is, publications that cite publications that cite an article of interest.

Our data indicate that the most frequent cocitations occur when cocitations have small values of *θ*, as shown in Figure 4. Our study considered the hypothesis that the frequency distribution is independent of *θ*, but our statistical tests rejected this hypothesis, and showed instead that the frequency distribution is best characterized by a power law for small values of *θ* and connected publications, and in many other regions is best characterized by a lognormal distribution.

The observation that power laws are consistent with small values of *θ* and connected cocitations is consistent with the theory of preferential attachment for these parameter settings. To the extent that preferential attachment is the mechanism giving rise to a power law, this suggests that preferential attachment is, at least, stronger for small *θ* values and connected cocitations than for other parameter combinations, or that preferential attachment is not applicable to other parameter values.

Observing power laws, heavy tails, and pairs with extreme cocitation strength for small values of *θ* (i.e., pairs that have small a priori probabilities of being cocited) may seem, on its face, paradoxical. One possible explanation for the pairs in the extreme right tail with both small *θ* and large cocitation strength is that those pairs represent novel combinations of ideas that, when recognized within the research community, catalyze an increased citation rate, consistent with preferential attachment coupled to time-dependent initial attractiveness (Eom & Fortunato, 2011) as an underlying generative mechanism. However, small values of *θ* do not guarantee a high cocitation count: Indeed, even for small values of *θ*, cocitations with a power law predominantly have relatively low cocitation strength.

We also note the increasing proportion of connected pairs as the percentile for cocitation frequency increases (Figure 2); this pair of parameters appears to be associated with a fertile environment where extremely high cocitation frequencies are possible. This observation raises the question of whether small values of *θ* and connected cocitations are associated with preferential attachment and, if a causal relationship exists, then how do *θ* and cocitation connection provide an environment supporting preferential attachment? A possibility is that one article in a cocited pair citing the other makes the potential significance of the combination of their ideas apparent to researchers. The clear pattern of the highest frequency cocited pairs typically having low *θ* values suggests that these pairs are highly cited and hence impactful because of the novelty in the ideas or fields that are combined (as reflected in low *θ*). However, other factors should be considered, such as the prominence of authors and prestige of a journal (Garfield, 1980) where the first cocitation appears.

We did not apply field normalization techniques when assembling the parent pool of 768,993 highly cited articles consisting of the top 1% of highly cited articles from each year in the Scopus bibliography. Thus, the highly cocited pairs we observe are biased towards high-referencing areas such as biomedicine and parts of the physical sciences (Small & Greenlee, 1980). However, the data set we analyzed has a lower bound of 10 on cocitation frequencies and includes pairs from fields other than those that are high referencing. For example, the maximum *t*_{l} we observed in the data set of 4.1 million pairs was 149 years, and is associated with a pair of articles independently published in 1840, establishing their eponymous Staudt-Clausen theorem (Clausen, 1840; von Staudt, 1840); this pair of articles has apparently been cocited 10 times since their publication. A second pair of articles concerning electron theory of metals (Drude, 1900a, 1900b),was first cocited in 1994, 109 times, with *t*_{l} observed of 94 years. Both cases are drawn from mathematics and physics rather than the medical literature. They are also consistent with the suggestion that the probability of awakening is smaller after a period of deep sleep (van Raan & Winnink, 2019). As we have defined *t*_{l}, with its heavy penalty for early citation, we create additional sensitivity to coverage and data quality especially for pairs with low citation numbers. Indeed, for the Staudt-Clausen pair, a manual search of other sources revealed an article (Carlitz, 1961) in which they are cocited. Both these articles were originally published in German and it is possible that additional cocitations were not captured. Thus, big data approaches that serve to identify trends should be accompanied by more meticulous case studies, where possible. Other approaches for examining depth of sleep and awakening time should certainly be considered (Ke et al., 2015; van Raan, 2004; van Raan & Winnink, 2019). Lastly, using our approach to revisit invisible colleges (Crane, 1972; Price & Beaver, 1966; Small & Sweeney, 1985) seems warranted, as it seems likely that the upper bound of a hundred members predicted by Price and Beaver (1966) is likely to have increased in a global scientific enterprise with electronic publishing and social media.

Finally, we view these results as a first step towards further investigation of cocitation behavior, and we introduce a new technique based on exploring first-degree neighbors of cocited publications; we are hopeful that this graph-theoretic study will stimulate new approaches that will provide additional insights, and prove complementary to other article-level approaches.

## AUTHOR CONTRIBUTIONS

Sitaram Devarakonda: Conceptualization, Methodology, Investigation, Writing—Review & Editing. James Bradley: Conceptualization, Methodology, Investigation, Writing—Original Draft; Writing—Review & Editing. Dmitriy Korobskiy: Methodology, Writing—Review & Editing, Resources. Tandy Warnow: Conceptualization, Methodology, Writing—Original Draft, Writing—Review & Editing. George Chacko: Conceptualization, Methodology, Investigation, Writing—Original Draft, Writing—Review & Editing, Funding Acquisition, Resources, Supervision.

## COMPETING INTERESTS

The authors have no competing interests. Scopus data used in this study was available through a collaborative agreement with Elsevier on the ERNIE project. Elsevier personnel played no role in conceptualization, experimental design, review of results, or conclusions presented. The content of this publication is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or Elsevier. Sitaram Devarakonda’s present affiliation is Randstad USA. His contributions to this article were made while he was a full-time employee of NET ESolutions Corporation.

## SUPPORTING INFORMATION

Supplementary material on K-L calculations is available on our Github site (Korobskiy et al., 2019).

## FUNDING INFORMATION

Research and development reported in this publication was partially supported by federal funds from the National Institute on Drug Abuse (NIDA), National Institutes of Health, U.S. Department of Health and Human Services, under Contract Nos. HHSN271201700053C (N43DA-17-1216) and HHSN271201800040C (N44DA-18-1216). Tandy Warnow receives funding from the Grainger Foundation.

## DATA AVAILABILITY

Access to the bibliographic data analyzed in this study requires a license from Elsevier. Code generated for this study is freely available from our Github site (Korobskiy et al., 2019).

## ACKNOWLEDGMENTS

We thank two anonymous reviewers for their helpful and constructive critique. In addition to support through federal funding, the ERNIE project features a collaboration with Elsevier. We thank our colleagues from Elsevier for their support of the collaboration.

## REFERENCES

## Author notes

Handling Editor: Ludo Waltman