Frequently Co-cited Publications: Features and Kinetics

Co-citation measurements can reveal the extent to which a concept representing a novel combination of existing ideas evolves towards a specialty. The strength of co-citation is represented by its frequency, which accumulates over time. Of interest is whether underlying features associated with the strength of co-citation can be identified. We use the proximal citation network for a given pair of articles (x, y) to compute theta, an a priori estimate of the probability of co-citation between x and y, prior to their first co-citation.Thus, low values for theta reflect pairs of articles for which co-citation is presumed less likely. We observe that co-citation frequencies are a composite of power-law and lognormal distributions, and that very high co-citation frequencies are more likely to be composed of pairs with low values of theta, reflecting the impact of a novel combination of ideas. Furthermore, we note that the occurrence of a direct citation between two members of a co-cited pair increases with co-citation frequency. Finally, we identify cases of frequently co-cited publications that accumulate co-citations after an extended period of dormancy.


Introduction
Co-citation, "the frequency with which two documents from the earlier literature are cited together in the later literature", was first described in 1973 [34,50]. As noted by [50], co-citation patterns differ from bibliographic coupling patterns [28] but align with patterns of direct citation and frequently co-cited publications must have high individual citations.
Co-citation has been the subject of further study and characterization, for example, comparisons to bibliographic coupling and direct citation [6], the study of invisible colleges [24,41], construction of networks by co-citation [52,53], evaluation of clusters in combination with textual analysis [8], textual similarity at the article and other levels [14], and the fractal nature of publications aggregated by co-citations [58].
Co-citations provide details of the relationship between key (highly cited) ideas, and changes in co-citation patterns over time may provide insight into the mechanism with which new schools of thought develop. Implicit in the definition of co-citation is novel combinations of existing ideas, but only some frequently co-cited article pairs reflect surprising combinations. For example, two publications presenting the leading methods for the same computational problem may be highly co-cited, but this does not reflect a novel combination of ideas. Similarly, two publications describing methods that often constitute part of the same workflow may be highly co-cited, but these co-citations are also not surprising. On the other hand, for two articles in different fields, frequent co-citation is generally unexpected. Novel, atypical, or otherwise unusual combinations of co-cited articles have been explored at the journal-level [65,10,7,57]. However, journal-level classifications have limited res-olution relative to article-level studies, which may better represent the actual structure and aggregations of the scientific literature [49,29,63,37,26]. Accordingly, we sought to discover measurable characteristics of frequently co-cited publications from an article-level perspective.
To study frequently co-cited articles, we have developed a novel graph-theoretic approach that reflects the citation neighborhood of a given pair of articles. In seeking to determine the degree to which a co-cited pair of papers represented a surprising combination, we wished to avoid journal-based field classifications, which present challenges. Instead, we attempted to use citation history to produce an estimate of the probability that a given pair of publications (x, y) would be co-cited. Since we focus on the activity before they are first co-cited, the "probability" of co-citation is zero, by definition, since there are no co-citations yet. Hence, we approximated co-citation probabilities: we treat an article that cites one member of a co-cited pair and also cites at least one article that cites the other member as a proxy for co-citation. Specifically, given a pair of publications x, y, we construct a directed bipartite graph whose vertex set contains all publications that cite either x or y previous to their first co-citation. We then compute θ, a normalized count of such proxies, and use it to predict the probability of co-citation between x and y. This approach enables an evaluation that is specific to the given pair of articles, and does so without substantial computational cost, while avoiding definitions of disciplines derived from journals or having to measure disciplinary distances.
To support our analysis, we constructed a dataset of articles from Scopus [19] that were published in the eleven year period, 1985-1995, and extracted the cited references in these articles. Recognizing that frequently co-cited publications must derive from highly-cited publications [50], we identified those reference pairs (33.6 million pairs) for each article in the dataset that are drawn from the top 1% most cited articles in Scopus and measured their frequency of co-citation.
To investigate which statistical distributions might best describe the co-citation frequencies in these 33.6 million co-cited pairs, we reviewed prior work on distributions of citation frequency [47,20,44,45,40,64,55,56,48]. This research has fit the frequency distribution of citation strength sometimes to a power law distribution and other times to a lognormal distribution. A graph of the analogous co-citation data suggests that power law or lognormal distributions are candidates for describing co-citation strength as well and so we, accordingly, investigated that conjecture. Interestingly, [38] notes the debate between the appropriateness of power law versus lognormal distributions is not confined to bibliometrics, but has been at issue in many disciplines and contexts.
To study how the best-fit distributional function and parameters for co-citation might vary with θ, we stratified co-citation frequency data. We also measured whether a direct link exists between two members of a co-cited pair (i.e., whether one member of a pair cites the other) and how this property is related to co-citation frequencies. We find that the distribution of co-citation frequencies varies with θ and that a power law distribution fits co-citation frequencies more often when θ is small, whereas a lognormal distribution fits more often for large θ.
A pertinent aspect of co-citation is the rate at which frequencies accumulate. While citation dynamics of individual publications have been fairly well studied by others, for example, [62,20], the dynamics of co-cited articles are less well studied. Our interest was the special case analogous to the Sleeping Beauty phenomenon [59,27], which may reflect delayed recognition of scientific discovery and the causes attributed to it [35,21,22,15,2,23]. Thus, we also identified co-cited pairs that featured a period of dormancy before accumulating co-citations.

Materials & Methods
Data Citation counts were computed for all Scopus articles (88,639,980 records) updated through December 2019, as implemented in the ERNIE project [30]. Records with corrupted or missing publication years or classified as 'dummy' by the vendor were then removed, resulting in a dataset of 76,572,284 publications. Hazen percentiles of citation counts, grouped by year of publication, were calculated for the these data [4]. The top 1% of highly cited publications from each year were combined into a set of highly cited publications consisting of 768,993 publications.
Publications of type 'article', each containing at least five cited references and published in the 11 year period from 1985-1995, were subset from Scopus to form a dataset of 3,394,799 publications and 51,801,106 references (8,397,935 unique). For each of these publications, all possible reference pairs were generated and then restricted to those pairs where both members were in the set of highly cited publications (above).
For example, the data for 1985 consisted of 223,485 articles after processing as described above. Computing all reference pairs (that were also members of the highly cited publication set of 768,993) from these 223,485 articles gave rise to 2,600,101 reference pairs ( Table 1) that ranged in co-citation frequency from 1 to 874 within the 1985 dataset; from 1 to 11,949 across the 11 year period 1985-1995; and from 1 to 35,755 across all of Scopus. Collectively, the publications in our 1985-1995 dataset generated 33,641,395 unique co-citation pairs, for which we computed co-citation frequencies across all of Scopus.
Derivation of θ We now show how we define our prior on the probability of x and y being co-cited, based on the citation graph restricted to publications that cite either x or y (but not both) up to the year of their first co-citation. Recall that we defined a proxy co-citation of x and y to be an article that cites one member of the co-cited pair (x, y) and also cites at least one article that cites the other member. The idea behind this definition is that we   consider papers that cite x as proxies for x, and papers that cite y as proxies for y. Thus, if a paper a cites both x and y (where y is a proxy for y), then it is a proxy for a co-citation of x and y. Similarly, if a paper b cites both y and x (where x is a proxy for x), it is also a proxy for a co-citation of x and y. This motivates the graph-theoretic formulation, which we now formally present.
We fix the pair x, y and we define N (x) to be the set of all publications that cite x (but do not also cite y), and are published no later than the year of the first co-citation of x and y.
We similarly define N (y). We define a directed bipartite graph with vertex set N (x)∪N (y).
Note that if x cites y then x ∈ N (y), and similarly for the case where y cites x. Note also that since we have restricted N (x) and N (y) that N (x) ∩ N (y) = ∅. We now describe how the directed edge set E(x, y) is constructed. For any pair of articles a, b where a ∈ N (x) and b ∈ N (y), if a cites b then we include the directed edge a → b in E(x, y). Similarly, we include edge b → a if b cites a. Finally, if a pair of articles both cite each other, then the graph has parallel edges. By construction, this graph is bipartite, which means that all the edges go between the two sets N (x) and N (y) (i.e., no edges exist between two vertices in N (x), nor between two vertices in N (y)).
Note that by the definition, every edge in E(x, y) arises because of a proxy co-citation, so that the number of proxy co-citations is the number of directed edges in E(x, y). Consider the situation where a publication a cites x (so that a ∈ N (x)) and also cites b 1 , b 2 , b 3 in N (y): this defines three directed edges from a to nodes of N (y). We count this as three proxy co-citations, not as one proxy co-citation. Similarly, if we have a publication b that cites y and also cites a 1 , a 2 , a 3 , a 4 in N (x), then there are four directed edges that go from b to nodes in N (x) and we will count each of those directed edges as a different proxy co-citation.
Accordingly, letting |X| denote the cardinality of a set X, we note |E(x, y)|, i.e., the number of directed edges that go between N (x) and N (y), is the number of proxy co-citations between x and y. If no parallel edges are permitted, the maximum number of possible proxy co-citations is |N (x)| × |N (y)|. Under the assumption that both N (x) and N (y) each have at least one article, we define θ(x, y), our prior on the probability of x and y being co-cited, as follows: Note that if parallel edges do not occur in the graph, then θ(x, y) ≤ 1, but that otherwise the value can be greater than 1. Note also that θ(x, y) = 0 if E(x, y) = ∅ (i.e., if there are no proxy co-citations) and that θ(x, y) = 1 if every possible proxy co-citation occurs.
To efficiently calculate θ, we used the following pipeline. We copied Scopus data from a relational schema in PostgreSQL into a citation graph from Scopus into the Neo4j 3.5 graph database using an automated Extract Transform Load (ETL) pipeline that combined Postgres CSV export and the Neo4j Bulk Import tool. The graph vertex set is all publications, each with a publication year attribute, and the edge set is all citations between the publications. A Cypher index was created on the publication year. We developed Cypher queries to calculate θ and tuned performance by splitting input publication pairs into small batches and processing them in parallel, using parallelization in Bash and GNU Parallel. Batch size, the number of parallel job slots, and other parameters were tuned for performance, with best results achieved on batch sizes varying from 20 to 100 pairs. The results of θ calculations were cross-checked using SQL calculations. In the small number of cases where θ computed to > 1 (above) it was set to 1 for the purpose of this study.

Statistical Calculations
We denote the observed co-citation frequency data by the multiset where N is the total number of pairs of articles and x o i is the observed frequency of the i th pair of papers being co-cited. Note that this is in general a multi-set, as different pairs of articles can have the same co-citation frequency. Let n(x) be the number of times that x appears in X o (equivalent, n(x) is the number of pairs of articles that are co-cited x times), and let N (x) = ∞ y=x n (y) denote the total number of pairs of articles that are co-cited at least x times. Then where x is a parameter we use to analyze the distribution's right tail starting at varying frequencies. We describe in this subsection (i) the statistical computations for fitting lognormal and power law distributions to right tails of the observed co-citation frequency distributions as defined by (1) for various x and (ii) how we assessed the quality of those fits. Further, we performed such analyses for various slices of the data, stratifying by θ and other parameters, as is described in the Results section.
We used a discrete version of a lognormal distribution to represent integer co-citation frequencies, f (·), following [55] and [56], while appropriately normalizing for our conditional assessment of the right tail commencing at x: where µ and σ are the mean and standard deviation, respectively, of the underlying normal distribution. These probabilities can be computed with the cumulative normal distribution,f using the well-known error function.
We fit distributions to the co-citation frequency data for various extremities of the right tail, as parameterized by x, using a maximum (log) likelihood estimator (MLE). We solved for the best-fit distributional parameters for the lognormal distribution, µ and σ, by modifying a multi-dimensional interval search algorithm from [43] and following [56]. A compiled version of this code using the C++ header file, amoeba.h, is available on our Github site [30].
We fit a discrete power law distribution to the data for various values of x, which was normalized for our conditional observations of the right tail: where the Hurwitz zeta function, is a generalization of the Riemann zeta function, ζ (α, 1), as is needed for analysis of the right tail.
We solved first-order conditions for the (log) MLE to find the best-fit distributional expo- as described in [13] and [25], where X o (x) = {x ∈ X o : x ≥ x}, are the observed cocitations with frequencies at least as great as x and N (x) is the number of such co-citations. We solved (4) to find α using a bisection algorithm.
We used the χ 2 goodness of fit (χ 2 ) and the Kolmogorov-Smirnov (K-S) tests to assess the null hypothesis that the distribution of the observed co-citation frequencies and the best-fit lognormal distribution are the same, and similarly for the best-fit power law distribution. We also computed the Kullback-Leibler Divergence (K-L) between the observed data and the best-fit distributions.
Both the χ 2 and K-S tests employed the null hypothesis that the observed co-citation frequencies, n(x) for x ∈ [x, ∞), were sampled from the best-fit lognormal or power law distributions, which we denote by f d (· |x ) for d ∈ {LN, P L}, while suppressing the parameters specific to each of the distributions.
The usual χ 2 statistic was computed by, first, grouping each of the observed co-citation frequencies into k bins, denoted by b i for i ∈ {1, . . . , k}, and then computing where O i is the observed number of co-citations having frequencies associated with the i-th bin, and E i is the expected number of observations for frequencies in bin i, if the null hypothesis was true, in a sample with size equal to the number of observed data points, N (x): If the null hypothesis was true, then we would expect O i and E i to be approximately equal, with deviations owing to variability due to sampling.
Constructing the bins b i requires only that E i ≥ 5 for every i = 1, . . . , k. Test outcomes are sometimes sensitive to the minimum E i permitted, which we will denote by E, and so we tested with multiple thresholds, including 10, 20, 50, and 70. Furthermore, statistical tests are stochastic: these multiple tests permitted a reduction in the probability of erroneously rejecting or accepting the null hypothesis based on a single test. The distribution of observed co-citation frequencies was skewed right with a long tail, so that aggregating bins to satisfy E i ≥ E was most critical in the right tail. This motivated a bin construction algorithm that aggregated frequencies in reverse order, starting with the extreme right tail. Algorithm 1 requires a set of the unique observed co-citation frequencies,X o , which includes the elements of the multiset X o without repetition. While Algorithm 1 does not guarantee in general that all bins satisfy E i ≥ E, that criterion was satisfied for the observed data.

Algorithm 1 Frequency Bin Construction
end if 10: end while We implemented a K-S test using simulation to generate a sampling distribution to account for the discrete frequency observations [54]. We denote the cumulative distribution of observed co-citation frequencies by F o (x |x ) = x i=x f o (i |x ), and the best-fit cumulative The K-S test involves testing the maximum absolute difference between the observed and theorized cumulative distributions, where n is the number of observations giving rise to F o (x |x ), against the distribution of such differences between samples from the theorized distribution with the same number of observations, n,D is the empirical distribution of sample j of size n (notation suppressed) drawn from F d (x |x ). We generated 100 such random variablesD n for each test. We reject the null hypothesis if D n is larger than substantially all of theD n , say all but 5%, for equivalence with a p-value of 0.05. The number ofD n samples drawn yields a p-value with a resolution of 1%.
We computed the K-L Divergence two ways due to its asymmetry: Separate from the tests above, we tested whether the distribution of co-citation frequencies was independent of θ using a χ 2 test, using the null hypothesis that the co-citation frequency distribution was independent of θ . We initially created a contingency table on θ and cocitation frequency using these bins for θ We, subsequently, aggregated these bins to have an expected number of co-citations in each bin equal to or greater than 5 to account for a decreasing number of observations as θ and frequency increased by having just two intervals for frequency: {[10, 100) , [100, 100000]}.
Kinetics of Co-citation We extended prior work on delayed recognition and the Sleeping Beauty phenomemon [27,59,33,23] towards co-citation. We have modified the beauty coefficient (B) of [27] to address co-citations by: (i) counting citations to a pair of publications (co-citations) rather than citations to individual papers, (ii) setting t 0 (age zero) to the first year in which a pair of publications could be co-cited (i.e., the publication year of the more recently published member of a co-cited pair), and (iii) setting C 0 to the number of co-citations occurring in year t 0 . Rather than calculate awakening time as in [27], we opted to measure the simpler length of time between t 0 and the first year in which a co-citation was recorded; we label this measurement as the timelag t l , so that t l = 0 if a co-citation was recorded in t 0 .

Results and Discussion
Our base dataset, described in Table 1, consists of the 33,641,395 co-cited reference pairs (33.6 million pairs) and their co-citation frequencies, gathered from Scopus during the 11-year period from 1985-1995 (Materials and Methods). A striking distribution of cocitation frequencies with a long right tail is observed with a minimum co-citation of 1, a median of 2, and a maximum co-citation frequency of 51,567 ( Figure 2). Approximately 33.3 of 33.6 million pairs (99% of observations) have co-citation frequencies ranging from 1-67 and the remaining 1% have co-citation frequencies ranging from 68-51,567. Since the focus of our study was co-citations of frequently cited publications, we further restricted this dataset to those pairs with a co-citation frequency of at least 10, which resulted in a smaller dataset of 4,119,324 co-cited pairs (4.1 million pairs) with minimum co-citation frequency of 10, median of 18, and a maximum co-citation frequency of 51,567. In order to focus on co-citations derived from highly cited publications, θ was calculated for all pairs with a co-citation frequency of at least 10. We also note whether one article in a co-citation pair cites the other (connectedness).
Influenced by the use of linked co-citations for clustering [52], we also examined the extent to which members of a co-cited pair were also found in other co-cited pairs. We found that 205,543 articles contributed to 4.12 million co-cited pairs. The highest frequency observed in our dataset, 51,567 co-citations, was for a pair of articles from the field of physical chemistry: Becke (1993) [3] and Lee, Yang, and Parr (1988) [32]. The members of this pair are not connected and are found in a total of 1,504 co-cited pairs with frequencies ranging from 10 to 51,567. The second highest frequency, 28,407 co-citations, was for another pair of articles from the field of biochemistry: [31,9]. Members of this pair are not connected and are found in 41,909 co-cited pairs, 24,558 for the Laemmli gel electrophoresis article and 17,352 for the Bradford protein estimation article. In terms of this second pair, both articles describe methods heavily used in biochemistry and molecular biology, an area with strong referencing activity, so this result is not entirely surprising.
Having developed θ(x, y) as a prediction of the probability that articles x and y would be co-cited, we first tested whether the distribution of co-citation frequencies was independent of θ (Materials and Methods). The null hypothesis that the co-citation frequency distribution was independent of θ was rejected with a very small p-value: the statistical software indicated a p-value with no significant non-zero digits. We next investigated what distribution functions might fit the frequencies of co-citation as θ varied.
Based on the long tails of citation frequencies, prior research has assessed the fit of lognormal and power law distributions [55,47,56]. We noted long right tails in co-citation frequencies, which, similarly, motivated us to assess the fit of lognormal and power law distributions to co-citation data. Further, we stratified the data according to (i) the minimum frequency for the right tail x, (ii) θ, and (iii) whether the two members of each co-citation pair were connected. Figure 3 shows which distribution, if either, fits the data in each slice, based on tests of statistical significance. Note that there were no circumstances where both distributions fit: if one fit, then the other did not.
Statistical tests were not possible for some slices due to an insufficient number of data points. This was the case for certain combinations of large x, large θ, and co-citations that were not connected. The number of data points obviously decreases as x increases, and we found the decrease in the number of data points to be more precipitous when θ was large and co-citations were unconnected due to the lighter right tails for these parameter combinations. The graph in the right panel of Figure 4, which has a logarithmic y-axis, shows that the number of data points per θ interval analyzed decreases most often by more than an order of magnitude from one interval to the next as θ increases. Most pairs of publications that are co-cited at least ten times, therefore, have small values of θ. Figure 3 indicates when the null hypothesis of a best-fit lognormal or power law fitting the observed data can not be rejected. We computed two types of statistics for evaluating the null hypothesis (χ 2 and K-S) and, moreover, we computed the χ 2 statistic for four binning strategies. Figure 3 indicates a distributional fit, specifically, if either the K-S p-value is greater than 0.05 or if two or more of the χ 2 statistics are greater than 0.05. While we computed the K-L Divergence (see supplementary material), we did not use these computations for formal statements of distributional fit because they are neither a norm nor do they determine statistical significance. These K-L computations did, however, support the findings based on formal tests of statistical significance.
Power law distributions fit most often when co-citations are connected (Fig. 3), when more extreme right tails are considered, and when co-citations have small values of θ. Lognormal distributions fit, conversely, in some circumstances, when a greater portion of the right tail is considered. These observations support the existence of heavy tails for θ small, even if a lognormal distribution fits the observed data more broadly. This observation is consistent with our observations of the most frequent co-citations having small θ values, as shown in the scatter plot in the left panel of Figure 4.
Mitzenmacher [38] shows a close relationship between the power law and lognormal distributions vis-à-vis subtle variations in generative mechanisms that determine whether the resulting distribution is power law or lognormal. The stratified layers in Figure 3 where a lognormal distribution fits for some portion of the right tail and, in the same instance, a power law describes the more extreme tail, may, therefore, be due to a generative mechanism whose parameters are close to those for a power law distribution as well as those for a lognormal distribution.  Table 2 shows the exponents of the best-fit power law distributions when statistical tests indicated that a power law was a good fit and where comparisons were possible among the intervals of θ: these were possible for θ intervals of [0.0, 0.2) and [0.2, 0.4), for connected co-citations, and right tails commencing at x ∈ {200, 250, 300}. The power law exponent α in these comparisons was less for θ ∈ [0.0, 0.2) than for θ ∈ [0.2, 0.4), indicating heavier tails for θ small and, therefore, a greater chance of extreme co-citation frequency. Figure 5 shows a log-log plot of the number of co-citations (y-axis) exhibiting the counts on the x-axis, for θ in the interval [0.0, 0.2) (note that both axes employ log scaling). The pattern for points below the 99th percentile clearly indicate that the number of co-citations referenced at a given frequency decreases greatly as the frequency increases. Also, the broadening of the scatter where fewer co-citations are cited more frequently is indicative of a long right tail, as has been observed in other research where lognormal or power law distributions have been fit to data, as in [39].
Perline [42] warns against fitting a power law function to truncated data. Informally, a portion of the entire data set can appear linear on a log-log plot, while the entire data set would not. He cites instances where researchers have mistakenly characterized an entire data set as following a power law due to an analysis of only a portion of the data, when a lognormal distribution might provide a better fit to the entire data set. Indeed, the scatter plot in Figure 5 is not linear and so, as Figure 3 shows, a power law does not fit the entire data set. This is what Perline calls a weak power law where a power law distribution function fits the tail, but not the entire distribution. Our concern, however, is not with characterizing the distributional function for the entire data set, but with characterizing the features of high frequency co-citations, which by definition means we were concerned with the right tail of the distribution. Moreover, the results avoid confusion between lognormal and power law distribution functions because we have shown not only that a power law provides a statistically significant fit, but also that a lognormal distribution function does not fit.
Our analysis found particularly heavy tails that were well fit by power law distributions for small θ, in the intervals [0.0, 0.2) and [0.2, 0.4), and for co-citations whose constituents are connected, as shown in Fig. 3. The closely related Matthew Effect [36], cumulative advantage [45], and the preferential attachment class of models [1] provide a possible explanation for citation frequencies following a power law distribution for some sufficiently extreme portion of the right tail. For greater values of θ, insufficient data in the right tails precludes a definitive assessment in this regard, although one might argue that the lack of observations in the tails is counter to the existence of a power law relationship. It is also noteworthy that the exponents we found for co-citations (Table 2) are close in value to those reported for citations by [45] and [47].
Delayed Co-citations The delayed onset of citations to a well cited publication, also referred to as 'Delayed Recognition' and 'Sleeping Beauty', has been studied by Garfield, van Raan, and others [21,60,27,59,33,23,5]. We sought to extend this concept to frequently cocited articles. As an initial step, we calculated two parameters (Materials and Methods): (1) the beauty coefficient [27] modified for co-cited articles and (2) timelag t l , the length of time between first possible year of co-citation and the first year in which a co-citation was recorded. We further focused our consideration of delayed co-citations to the 95th percentile or greater of co-citation frequencies in our dataset of 4.1 million co-cited pairs. Within the bounds of this restriction, 24 co-cited pairs have a beauty coefficient of 1,000 or greater and all 24 are in the 99th percentile of co-citation frequencies. Thus, very high beauty coefficients are associated with high co-citation frequencies.
We also examined the relationship of t l with co-citation frequencies (Fig. 6) and observed that high t l values were associated with lower co-citation frequencies. These data in appear to be consistent with a report from van Raan and Winnink [60], who conclude that 'probability of awakening after a period of deep sleep is becoming rapidly smaller for longer sleeping periods'. Further, when two articles are connected, they tend to have smaller t l values compared to pairs that are not connected in the same frequency range.  2) The y-axis shows the number of co-cited pairs observed having the citation counts plotted along the x-axis. The tightly clustered plot below the 99th percentile demonstrates a clear pattern of decreasing number of co-cited pairs having an increasing number of citation counts. The scatter plot for the tail above the 99th percentile broadens, indicating a long tail of relatively few co-cited pairs that were cited with extreme frequency.

Conclusions
In this article, we report on our exploration of features that impact the frequency of cocitations. In particular, we wished to examine article pairs with high co-citation frequencies with respect to whether they originated from the same school(s) of thought or represented novel combinations of existing ideas. However, defining a discipline is challenging, and determining the discipline(s) relevant to specific publications remains a challenging problem. Journal-level classifications of disciplines have known limitations and while article-level approaches offer some advantages, they are not free of their own limitations [37].
Consequently, we designed θ, a statistic that examines the citation neighborhood of a pair of articles x and y to estimate the probability that they would be co-cited. Our approach has advantages compared to alternate approaches: it avoids the challenges of journal-level analyses, it does not require a definition of "discipline" (or "disciplinary distance"), it does not require assignment of disciplines to articles, it is computationally feasible, and, most importantly, it enables an evaluation that is specific to a given pair of articles.
We note that when x and y are from the same sub-field, then θ may be very large, and conversely, when x and y are from very different fields, it might be reasonable to expect that θ will be small. Thus, in a sense, θ may correlate with disciplinary similarity, with large values for θ reflecting conditions where the two publications are in the same (or very close) sub-disciplines, and small values for θ reflecting that the disciplines for the two publications are very distantly related. We also comment that in this initial study, we have not considered second-degree information, that is publications that cite publications that cite an article of interest.
Our data indicate that the most frequent co-citations occur when co-citations have small values of θ, as shown in Figure 4. Our study considered the hypothesis that the frequency distribution is independent of θ, but our statistical tests rejected this hypothesis, and showed instead that the frequency distribution is best characterized by a power law for small values of θ and connected publications, and in many other regions is best characterized by a lognormal distribution.
The observation that power laws are consistent with small values of θ and connected cocitations is consistent with the theory of preferential attachment for these parameter settings. To the extent that preferential attachment is the mechanism giving rise to a power law, this suggests that preferential attachment is, at least, stronger for small θ values and connected co-citations than for other parameter combinations, or that preferential attachment is not applicable to other parameter values.
Observing power laws, heavy tails, and pairs with extreme co-citation strength for small values of θ (i.e., pairs that have small a priori probabilities of being co-cited) may seem, on its face, paradoxical. One possible explanation of the pairs in the extreme right tail with both small θ and large co-citation strength is that those pairs represent novel combinations of ideas that, when recognized within the research community, catalyze an increased citation rate, consistent with preferential attachment coupled to time-dependent initial attractiveness [20] as an underlying generative mechanism. However, small values of θ do not guarantee a high co-citation count: indeed, even for small values of θ, co-citations with a power law predominantly have relatively low co-citation strength.
We also note the increasing proportion of connected pairs as the percentile for co-citation frequency increases (Fig. 2); this pair of parameters appears to be associated with a fertile environment where extremely high co-citation frequencies are possible. This observation raises the question of whether small values of θ and connected co-citations are associated with preferential attachment and, if a causal relationship exists, then how do θ and co-citation connection provide an environment supporting preferential attachment? A possibility is that one article in a co-cited pair citing the other makes the potential significance of the combination of their ideas apparent to researchers. The clear pattern of the highest frequency co-cited pairs typically having low θ values suggests that these pairs are highly cited and hence impactful because of the novelty in the ideas or fields that are combined (as reflected in low θ). However, other factors should be considered, such as the prominence of authors and prestige of a journal [22] where the first co-citation appears.
We did not apply field-normalization techniques when assembling the parent pool of 768,993 highly cited articles consisting of the top 1% of highly cited articles from each year in the Scopus bibliography. Thus, the highly co-cited pairs we observe are biased towards highreferencing areas such as biomedicine and parts of the physical sciences [51]. However, the dataset we analyzed has a lower bound of 10 on co-citation frequencies and includes pairs from fields other than those that are high referencing. For example, the maximum t l we observed in the dataset of 4.1 million pairs was 149 years, and is associated to a pair of articles independently published in 1840, establishing their eponymous Staudt-Clausen theorem [12,61]; this pair of articles was apparently co-cited 10 times since their publication. A second pair of articles concerning electron theory of metals [17,18] was first co-cited in 1994 for a total of 109 times, with t l observed of 94 years. Both cases are drawn from mathematics and physics rather than the medical literature. They are also consistent with the suggestion that the probability of awakening is smaller after a period of deep sleep [60]. As we have defined t l , with its heavy penalty for early citation, we create additional sensitivity to coverage and data quality especially for pairs with low citation numbers. Indeed, for the Staudt-Clausen pair, a manual search of other sources revealed an article [11] in which they are co-cited. Both these articles were originally published in German and it is possible that additional co-citations were not captured. Thus, big data approaches that serve to identify trends should be accompanied by more meticulous case studies, where possible. Other approaches for examining depth of sleep and awakening time should certainly be considered [59,27]. Lastly, using our approach to revisit invisible colleges [46,16,52] seems warranted, since it seems likely that the upper bound of a hundred members predicted by [46] is likely to have increased in a global scientific enterprise with electronic publishing and social media.
Finally, we view these results as a first step towards further investigation of co-citation behavior, and we introduce a new technique based on exploring first-degree neighbors of co-cited publications; we are hopeful that this graph-theoretic study will stimulate new approaches that will provide additional insights, and prove complementary to other article level approaches.