## Abstract

The effects of enhancing direct citations, with respect to publication–publication relatedness measurement, by indirect citation relations (bibliographic coupling, cocitation, and extended direct citations) and text relations on clustering solution accuracy are analyzed. For comparison, we include each approach that is involved in the enhancement of direct citations. In total, we investigate the relative performance of seven approaches. To evaluate the approaches we use a methodology proposed by earlier research. However, the evaluation criterion used is based on MeSH, one of the most sophisticated publication-level classification schemes available. We also introduce an approach, based on interpolated accuracy values, by which overall relative clustering solution accuracy can be studied. The results show that the cocitation approach has the worst performance, and that the direct citations approach is outperformed by the other five investigated approaches. The extended direct citations approach has the best performance, followed by an approach in which direct citations are enhanced by the BM25 textual relatedness measure. An approach that combines direct citations with bibliographic coupling and cocitation performs slightly better than the bibliographic coupling approach, which in turn has a better performance than the BM25 approach.

## 1. INTRODUCTION

Community detection in citation networks, which is the topic of this paper, can be performed in order to analyze both the obvious and the more subtle relations between scientific publications, as well as the identification of subfields of science (e.g., Chen & Redner, 2010; Klavans & Boyack, 2017; Waltman & Van Eck, 2012). In the context of networks, communities are clusters of closely connected nodes within a network. Communities of this kind are found not only in citation networks, but also in many other networks, such as biological networks, the World Wide Web, social networks, and collaboration works (Girvan & Newman, 2002).

Citation networks originate from the relationships between citing and cited publications. Community structure can often be observed in these networks, because publications dealing with a given topic tend to cite similar publications with respect to topic. Communities in a citation network thereby contain similar publications regarding a single topic or a set of related topics. For a given field, community detection in a citation network can be used to uncover related publications. The detected subfields, and interrelations between them, might then be useful for researchers and policy makers, because the subfields and their interrelations indicate the whole pattern of the field at a glance.

Although several studies on community detection in citation networks have been performed in recent years, we have not found many such studies that discriminate, based on some notion of importance, between citation relations. However, Small (1997) explored the idea of combining direct citation information with indirect citation information. Persson (2010) used weighted direct citations, where the citations were weighted by shared references and cocitations in order to decompose a citation network. Persson investigated the field of library and information science and obtained meaningful subfields by removing direct citations with weights below a certain threshold and by removal of less frequently cited publications. The study by Fujita, Kajikawa, et al. (2014) constitutes another example of a study using weighted direct citations. Different types of weighted citation networks were studied with regard to detection of emerging research fields, where the weights were based, for instance, on reference lists and keyword similarity. Chen, Fengxia, and Wang (2013) proposed a community discovery algorithm to uncover semantic communities in a citation semantic link network. In that study, direct citations were weighted on the basis of common keywords. A fifth example of a study that discriminates between direct citation relations is the work by Chen, Xiao, Deng, and Zhang (2017). These authors used two publication data sets and modularity-based clustering of publications, and compared clustering solutions obtained on the basis of four approaches, where the main difference between these approaches is how the relatedness of two publications is defined. One of the approaches is based on direct citations, whereas the other three weight the direct citations in three different ways. All of the latter three approaches use textual similarities as weights, and two of them take term position information into account. The study by Chen et al. (2017) inspired us to perform another study, in which we investigated the relative clustering solution accuracy of nine publication–publication relatedness measures (Ahlgren, Chen, et al., 2019).

One can distinguish between two types of methods used for citation network community detection. One type consists of methods based only on the topological structure of the network, that is, the arrangement of publications (nodes) and citation relations (links) (e.g., Boyack & Klavans, 2014; Chen & Redner, 2010; Haunschild, Schier, et al., 2018; Kajikawa, Yoshikawa, et al., 2008; Klavans & Boyack, 2017; Kusumastuti, Derks, et al., 2016; Ruiz-Castillo & Waltman, 2015; Sjögårde & Ahlgren, 2018, 2020; Subelj, Van Eck, & Waltman, 2016; Waltman & Van Eck, 2012; Yudhoatmojo & Samuar, 2017), whereas the other type consists of methods that also use publication content, represented by text. To take both topological structure and content into account in an analysis of citation networks might be fruitful. This has been done, as we have seen, in community detection analyses and with regard to direct citations (Chen, Fengxia, & Wang, 2013; Chen et al., 2017; Fujita et al., 2014), but it has also been done in studies in which bibliographic coupling or cocitation have been used as citation relations (e.g., Ahlgren & Colliander, 2009; Glänzel & Thijs, 2017; Meyer-Brötz, Schiebel, & Brecht, 2017; Yu, Wang, et al., 2017). However, taking both topological structure and content into account has also been done in studies not involving community detection. Cohn and Hofmann (2001) described a joint probabilistic model for modeling the contents and interconnectivity of publication collections such as sets of research publications, and Hamedani, Kim, and Kin (2016) presented a novel method called SimCC that considers both citations and content in the calculation of publication–publication similarity.

Even if the last two papers referred to in the preceding paragraph did not involve community detection in citation networks, they provide ideas that can be used for community detection in such networks. Indeed, in this study we use both topological structure and content information in citation networks to detect communities. We build on the earlier work by Chen et al. (2017) on the weighting of citation relations, as well as on the work by Waltman, Boyack, et al. (2017, 2019) on a principled methodology for evaluating the accuracy of clustering solutions using different relatedness measures. In this study, which is an extension of the study performed by Ahlgren et al. (2019), the effects of enhancing direct citations, with respect to publication–publication relatedness measurement, by indirect citation relations and text relations on clustering accuracy are analyzed. In total, we investigate seven approaches, compared to six in Ahlgren et al. (2019). In one of these, direct citations are enhanced by both bibliographic coupling and cocitation, whereas text relations are used to enhance direct citations in another approach. We also include an indirect citation relations enhancing approach that takes direct citation relations within an extended set of publications into account. We include in the study, for comparison reasons, each approach that is involved in the enhancement of direct citations. We also introduce a methodology by which overall relative clustering solution accuracy can be studied. This methodology was not used in Ahlgren et al. (2019).

Compared to the study by Chen et al. (2017), a considerably larger publication set is used in our study, as well as a more sophisticated evaluation methodology, in which an external subject classification scheme, Medical Subject Headings (MeSH), is used. MeSH is one of the most sophisticated publication-level classification schemes available. Moreover, in contrast to the earlier work, we use a different approach regarding the combination of direct citations and text relations. Compared to Waltman et al. (2017, 2019), these authors did not evaluate hybrid relatedness approaches (approaches combining citation and text relations). Further, citation-only approaches were only compared to other such approaches in their analysis, and the same was the case for text-only approaches. An advantage of our study, however, is that comparisons across such approach groups could be made due to the use of MeSH as an independent evaluation criterion.

The remainder of the paper is organized as follows. In the next section, we deal with data and methods, whereas the results of the study are reported in the third section. In the final section, we provide a discussion as well as conclusions.

## 2. DATA AND METHODS

Because direct citations are used in the study, we needed a sufficiently long publication period. We decided to use a five-year period, namely 2013–2017. Initially, a set of 4,260,452 MEDLINE—the largest subset of PubMed—publications were retrieved from PubMed, where the query included a reference to the publication period. The following query was used: *MEDLINE[SB] AND (“2013/01/01”[PDat] : “2017/12/31”[PDat])*. From the initially retrieved set, we filtered out those publications with a print year in the interval 2013–2017, which yielded a set of 4,191,763 publications. Because PubMed does not contain citation relations between publications, we also use Web of Science (WoS) data. The next step was then to match, using PMID data, each publication in this set of publications to publications included in the in-house version of the WoS database available at the Centre for Science and Technology Studies (CWTS) at Leiden University, which yielded a set of 3,577,358 publications. From this latter set, we selected each publication *p* such that *p* satisfies each of the following four conditions:

- 1.
*p*has a WoS publication year in the period 2013–2017. - 2.
*p*is of WoS document type*Article*or*Review*. - 3.
*p*has both an abstract and a title with respect to its WoS record. - 4.
*p*has a citation relation to at least one publication*p*′ such that*p*′ satisfies points 1–3 in this list.

A total of 2,941,119 publications satisfied all four conditions. However, 10 of these publications were removed, because they are not indexed with MeSH descriptors in PubMed. Such descriptors are needed by our evaluation methodology (see subsection 2.3). Our final publication set, *P*_{MEDLINE}, then consists of 2,941,109 publications.

### 2.1. Investigated approaches

As stated above, we compare seven approaches to publication community detection in this study. The main difference between the approaches is how the relatedness of two publications is defined. Five of the approaches—DC (direct citations), EDC (extended direct citations), BC (bibliographic coupling), CC (cocitation), and DC-BC-CC (combination of direct citations, bibliographic coupling, and cocitation)—use only citation relations. Of the remaining two approaches, BM25 and DC-BM25, BM25 uses only text relations, whereas DC-BM25 combines direct citations with text relations. We now describe the seven approaches in more detail.

#### DC

*i*and

*j*, $rijDC$, is defined as

*c*

_{ij}is 1 if

*i*cites

*j*, 0 otherwise. Thus, the relatedness is 1 if there is a direct citation from

*i*to

*j*or such a relation from

*j*to

*i*, otherwise the relatedness is 0.

#### EDC

*P*

_{MEDLINE}, but also direct citation relations within an extended set of publications. Let

*N*be the number of publications under consideration, the so-called

*focal*publications in the terminology of Waltman et al. (2017, 2019). In order to cluster the focal publications 1, …,

*N*, we also take the publications

*N*+ 1, …,

*N*

^{EXT}into account, where each

*j*(

*j*=

*N*+ 1, …,

*N*

^{EXT}) has a direct citation relation with at least two of the focal publications. The relatedness of

*i*and

*j*, $rijEDC$, where

*i*= 1, …,

*N*and

*j*= 1, …,

*N*

^{EXT}, is defined as

*c*

_{ij}and

*c*

_{ji}are as in Eq. 1. Thus, the same relatedness measure is used in the EDC approach as in the DC approach. However, the former approach also considers direct citation relations between the focal publications and the additional

*N*

^{EXT}−

*N*publications. Note that direct citation relations are not considered within the additional publications (

*i*takes values in the set {1, …,

*N*}. In this study,

*N*

^{EXT}−

*N*= 7,899,313, and the additional publications are published in the period 1980–2012. Thus, because the focal publications are published in the period 2013–2017, each additional publication is cited by at least two focal publications.

#### BC

Here, the relatedness of *i* and *j*, $rijBC$, is defined as the number of shared cited references in *i* and *j*, where only cited references pointing to publications covered by the CWTS in-house version of WoS are taken into account.

#### CC

The relatedness of *i* and *j*, $rijCC$, is defined as the number of publications that cite both *i* and *j*.

#### BM25

The first step in this approach is to identify terms in the titles and abstracts of the publications in *P*_{MEDLINE}. Here a *term* is defined as a noun phrase: a sequence *s* of words of length *n* (*n* ≥ 1) such that (a) each word in *s* is either a noun or an adjective, and (b) *s* ends with a noun. The part-of-speech tagging algorithm provided by the Apache OpenNLP 1.5.2 library is used to identify the nouns and adjectives. Plural and singular noun phrases are regarded as the same term, and shorter terms appearing in longer terms are not counted.

*N*be the number of publications under consideration (in our case,

*N*is equal to |

*P*

_{MEDLINE}| = 2,941,109) and

*m*the number of unique terms occurring in the

*N*publications. Let

*o*

_{il}be the number occurrences of term

*l*in publication

*i*, and

*n*

_{l}the number of publications in which term

*l*occurs. Further,

*I*(

*o*

_{il}> 0) = 1 if

*o*

_{il}> 0 and 0 otherwise. The relatedness of

*i*and

*j*, $rijBM25$, is then defined as

_{l}is the inverse document frequency of term

*l*,

*d*

_{j}the length of publication

*j*, and $d\xaf$ the mean length of the

*N*publications.

*k*

_{1}and

*b*are parameters with respect to term frequency saturation and publication length normalization, respectively. For the values of these, we followed Boyack et al. (2011) and Waltman et al. (2017, 2019), and thereby used 2 and 0.75 for

*k*

_{1}and

*b*, respectively. Note that it is possible that $rijBM25$ ≠ $rjiBM25$, that is, the BM25 measure is not symmetrical. It follows from Eq. 3 that $rijBM25$ > 0 if and only if there is at least one term occurring in both

*i*and

*j*.

#### DC-BC-CC

*i*and

*j*, $rijDC\u2010BC\u2010CC$, as

*α*, in agreement with Waltman et al. (2017, 2019). Note, in contrast to DC and EDC, that the relatedness value of

*i*and

*j*in DC-BC-CC (and in DC-BM25, see below) can be positive without a direct citation between

*i*and

*j*.

#### DC-BM25

*i*and

*j*, $rijDC\u2010BM25$, as

*α*in the following way. The average across all BM25 relatedness values greater than 0 is calculated, an average that turned out to be equal to 50. By setting α to 50, the DC values are put on the same scale as the BM25 relatedness values, in an average sense. By setting α to 25 (100), less (more) emphasis would be put on DC. We use all these three α values in our analysis.

When calculating $rijX$, *X* ∈ {BC, CC, BM25, DC-BC-CC, DC-BM25}, we only consider the *k*-nearest neighbors to *i* (i.e., the *k* publications with the highest relatedness values with *i*). If *j* is not among the *k* publications with the highest relatedness values with *i*, $rijX$ = 0. Here, *k* is set to 20. For a sensitivity analysis, we refer the reader to Waltman et al. (2019). We apply the *k*-nearest neighbors technique for efficiency reasons. However, we do not apply this technique in DC or EDC, because computer memory requirements are relatively modest for these two approaches.

In contrast to DC, we do not enhance EDC by BC and CC. The reason for this is that BC and CC are both indirectly taken into account in the EDC approach due to the requirement for inclusion among the focal publications. To see this, consider a publication *p* that meets the requirement to be added to the extended set of publications (i.e., *p* has a direct citation relation with at least two of the focal publications). Now, because, in our case, *p* is published before year 2013 (the start publication year in our study), *p* is cited by at least two focal publications, and thereby *p* gives rise to a bibliographic coupling relation between at least two focal publications. If *p* had been published after year 2017 (which, however, is not the case in the study), *p* would cite at least two focal publications, and thereby give rise to a cocitation relation between at least two focal publications.

### 2.2. Normalization of the relatedness measures and clustering of publications

For all seven approaches, the corresponding relatedness measures are normalized. The *normalized relatedness* of publication *i* with publication *j* is the relatedness of *i* with *j*, divided by the total relatedness of *i* with all other publications that are considered. Now, without normalization, clustering solutions obtained using different relatedness measures, but associated with the same value of the resolution parameter of the clustering (see below in this section), might be far from satisfying the requirement that, with regard to accuracy, the compared solutions should have the same granularity, where the *granularity* of a solution is defined as the number of publications divided by the sum of the squared cluster sizes (Waltman et al., 2017, 2019). With the indicated normalization, the granularity requirement can be assumed to be approximately satisfied by the solutions. However, to further deal with the granularity issue, granularity–accuracy plots (GA plots) are used in the study (Waltman et al., 2017, 2019). GA plots are described in the section on evaluation of approach performance below.

In this study, we use the Leiden algorithm (Traag, Waltman, & Van Eck, 2018, 2019) to generate a series of clustering solutions for each of the relatedness measures. The Leiden algorithm is used to maximize the Constant Potts Model as quality function (Traag, Van Dooren, & Nesterov, 2011; Waltman & Van Eck, 2012). However, in EDC, an adjusted quality function is used in order to accommodate the nonfocal publications *N* + 1, …, *N*^{EXT} (Waltman et al., 2019). After maximization of the adjusted quality function, the cluster assignments of the nonfocal publications are disregarded, because we are only interested in the cluster assignments of the focal publications (i.e., the publications in *P*_{MEDLINE}). Using different values of the resolution parameter γ (0.000001, 0.000002, 0.000005, 0.00001, 0.00002, 0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002), we obtain 11 clustering solutions for each relatedness measure. Compared to our earlier study (Ahlgren et al., 2019), we exclude the clustering solutions for the two largest resolution values used in that study (0.005 and 0.01). These clustering solutions have around 300,000 and 500,000 clusters, respectively, and most of the clusters consist of fewer than 10 publications. From a practical point of view, the utility of these detailed cluster solutions can be questioned, and we believe it makes sense to exclude them.

The normalization of the relatedness measures transforms these measures to nonsymmetrical counterparts. However, the clustering methodology we use requires that the relatedness values are symmetrical. We solve this issue in the following way. Let $r\u02c6ijX$ denote the relatedness of *i* with *j* with respect to approach *X* ∈ {DC, EDC, BC, CC, BM25, DC-BC-CC, DC-BM25} after normalization of $rijX$. The relatedness value for *i* and *j* given as input to the clustering algorithm is $r\u02c6ijX$ + $r\u02c6jiX$ (i.e., the sum of the two normalized relatedness values). Clearly, then, the relatedness values are made symmetrical before being given as input to the clustering algorithm.

### 2.3. Evaluation of approach performance

For the evaluation of the performance of the seven approaches, an external and independent subject classification scheme, MeSH, is used. MeSH descriptors and subheadings are used to index publications in PubMed. MeSH contains more than 28,000 descriptors that are arranged hierarchically by subject categories, with more-specific descriptors arranged beneath broader descriptors (U.S. National Library of Medicine, 2019a). MeSH descriptors can be designated as major, indicating that they correspond to the major topics of the publication, whereas nonmajor descriptors are added to reflect additional topics substantively discussed within the publication. Further, approximately 80 subheadings (or qualifiers) can be used by the indexer to qualify a descriptor. Subheadings are thus not standalone terms and are only used in conjunction with a descriptor to describe specific aspects of the descriptor that are pertinent to the publication. For example, the descriptor “Ectopia Lentis” can be combined with the subheading “surgery” to specify that the publication deals with surgical treatment of the displacement of the eye’s crystalline lens. Descriptors will usually be indexed with one or more subheadings.

The assignment of MeSH descriptors and subheadings to publications is based on a manual reading of these publications by human indexers (U.S. National Library of Medicine, 2019b). Relatedness measurement based on MeSH, described below, thus differs substantially from the seven evaluated relatedness approaches, as the latter are based on directly observable features in the publications (words and references), whereas assigned MeSH descriptors and subheadings are the result of a human intellectual indexing process, whose aim is to produce standardized subject descriptions.

*freq*(

*desc*

_{i}) denote the frequency of descriptor

*i*(here calculated over all MEDLINE publications published within the period 2013–2017). Then

*descendants*(

*desc*

_{i}) is the set of descriptors that are children, direct or indirect, to descriptor

*i*in the MeSH tree.

*s*+ (

*s*×

*m*), where

*s*and

*m*are the total number of unique MeSH descriptors and the total number of unique

^{1}subheadings in the data set, respectively. The vector position for the

*i*th descriptor is given by (

*m*+ 1) ×

*i*−

*m*and the corresponding

*weight for publication l*(ω

_{i}(

*l*)) is defined as

*j*th subheading connected to the

*i*th descriptor is given by (

*m*+ 1) ×

*i*−

*m*+

*j*and the corresponding

*weight for publication l*(ϕ

_{ji}(

*l*)) is defined as

Note that many descriptor–subheading pairs are nonsensical and will never exist in practice, and the subheading in such a pair will thus always take on the value 0 in the vectors.

We estimate the relatedness between the publications by the cosine similarity (Salton & McGill, 1983) between their corresponding vectors as defined above. As in the case of calculating relatedness in BC, CC, BM25, DC-BC-CC, and DC-BM25, and for the same reason, we apply the *k*-nearest neighbors technique. As in these five approaches, *k* is set to 20. We then normalize the cosine similarities in the same way as we normalize the relatedness measures of all seven approaches, resulting in $r\u02c6ijMeSH$. Finally, the publications in *P*_{MEDLINE} are clustered based on the normalized cosine similarities using the same clustering methodology, and the same set of values of the resolution parameter, as for the seven approaches.

*l*th (1 ≤

*l*≤ 11) clustering solution for

*X*∈ {DC, EDC, BC, CC, BM25, DC-BC-CC, DC-BM25, MeSH}, where the accuracy is based on MeSH cosine similarity, symbolically $AXl\u2223MeSH$, is defined as follows (Waltman et al., 2017, 2019):

*i*,

*j*∈

*P*

_{MEDLINE}, $ciXl$ ($cjXl$) is a positive integer denoting the cluster to which publication

*i*(

*j*) belongs with respect to the

*l*th clustering solution for

*X*,

*I*($ciXl$ = $cjXl$) is 1 if its condition is true, otherwise 0, and $r\u02c6ijMeSH$ the normalized MeSH cosine similarity of

*i*with

*j*. Recall that DC-BC-CC (DC-BM25) has two (three) variants, α ∈ {1, 5} (α ∈ {25, 50, 100}), and that we thereby, in total, work with 11 relatedness measures. Note that we want to compare, with respect to clustering solution accuracy, the 10 measures distinct from MeSH. However, we also include clustering solutions based on the MeSH cosine similarity in a part of the evaluation exercise (cf. Section 3.1). The accuracy results obtained for MeSH give an upper bound for the results that can be obtained when the relatedness measures of the seven approaches are used to cluster the publications and accuracy is based on MeSH cosine similarity. We remind the reader that the value of the resolution parameter γ is held constant across the seven approaches and MeSH regarding the

*k*th clustering solution.

We visualize the evaluation results by using GA plots. The use of such plots is a way to counteract the difficulty that the requirement that, with regard to accuracy, the compared clustering solutions should have the same granularity is only approximately satisfied. In a GA plot, the horizontal axis represents granularity (as defined above), whereas the vertical axis represents accuracy. For a given approach, such as DC, a point in the plot represents the accuracy and granularity of a clustering solution, obtained using a certain resolution value of γ. Further, a line is connecting the points of the approach, where accuracy values for granularity values between points are estimated by the technique Piecewise Cubic Hermite Interpolation. Based on the interpolations, the performance of the approaches can be compared at a given granularity level. The interpolation technique is described in the Appendix.

## 3. RESULTS

In this section, we first present performance results for the seven tested approaches using GA plots. We then deal with relative overall approach performance, where a summary value based on interpolated accuracy values is obtained for each of the 10 relatedness measures.

### 3.1. Performance results: GA plots

We present three figures containing GA plots. The first plot contains curves for DC and the other citation-based approaches, the second for DC and the text-based approaches, whereas the last plot contains curves for DC and the best performing approaches. As should be clear from section 2, MeSH is consistently used as the evaluation criterion. Note that all three plots contain a curve also for MeSH, where such a curve represents an upper bound for the performance of the seven approaches. One might ask what the meaning, in terms of number of clusters, of different granularity levels is. When the granularity is around 0.0001, a clustering solution typically has 500 significant clusters (defined as the number of clusters with 10 or more publications). When the granularity is around 0.001 (0.01), a clustering solution typically has 5,000 (50,000) significant clusters.

The GA plot of Figure 1 visualizes the accuracy results of enhancing DC by indirect citations. The performance of EDC and the combination of DC with BC and CC (α = 1, 5), as well as the performance of DC, BC, and CC, is shown. CC exhibits the worst performance among the citation-based approaches. EDC has the best performance, followed by DC-BC-CC (α = 5). BC performs slightly worse than DC-BC-CC (α = 1), and DC is outperformed by all three approaches in which DC is enhanced by indirect citation relations.

In Figure 2, a GA plot that shows the results of enhancing DC by BM25, and thereby by textual relations, is given (α = 25, 50, 100). The plot also shows the performance of DC and BM25. BM25 performs better than DC, but is outperformed by all three DC-BM25 variants. Of these, those with α equal to 50 and 100 perform about equally well, and better than the variant that puts less emphasis on DC (α = 25).

Our final GA plot (Figure 3) shows the performance of DC and the best performing approaches, namely EDC, DC-BC-CC (α = 5), and DC-BM25 (α = 100). Extended direct citations (i.e. EDC) and enhancing DC by BM25 yield the best performance. DC-BC-CC, where DC is enhanced by the combination of BC and CC, then performs worse than DC-BM25, whereas DC, as we already know (Figures 1 and 2), has the worst performance. Although the lines of EDC and DC-BM25 are for a large part overlapping in Figure 3, it seems that EDC performs slightly better than DC-BM25 for clustering solutions with a higher granularity (thus solutions with a higher number of clusters). This difference is further studied in the next subsection.

### 3.2. Performance results: Relative overall clustering solution accuracy

In this subsection, we complement the picture of relative performance given in the preceding subsection. We do this by introducing a methodology that results in one numerical value per relatedness measure. This value, which summarizes the relative clustering solution accuracy for the corresponding measure, is introduced as an approximate measure for easier comprehension of GA plots.

*p*

_{j}(

*x*) denote the interpolation function for the

*j*th (1 ≤

*j*≤ 10) relatedness measure

^{2}, where

*x*is a granularity value and Piecewise Cubic Hermite Interpolation (see Appendix) is used. We then define the average interpolated accuracy value with respect to

*x*,

*p*

_{Avg}(

*x*), as

*m*, in this context, is equal to 10.

*a*and

*b*be the minimum and maximum values, respectively, such that for each relatedness measure

*j*,

*p*

_{j}(

*a*), and

*p*

_{j}(

*b*) are defined (extrapolation is not used). Let

*s*

^{l}= (

*a*, …,

*b*) be a sequence of

*l*evenly spaced values between

*a*and

*b*, and let $sil$ denote the

*i*th value in

*s*

^{l}. Then a reasonable summary value for the relative clustering solution accuracy of relatedness measure

*j*is defined as

For a given relatedness measure *j*, and for each value $sil$ in *s*^{l}, the interpolated accuracy value with respect to $sil$ is divided by the average interpolated accuracy value with respect to $sil$ across the relatedness measures. Then the mean across the *l* ratios is obtained, and constitutes the summary value for the relative clustering solution accuracy of relatedness measure *j*. Note that *acc*_{j} = 1 corresponds to average performance. In the study, *l* was set to 500.

The bar chart of Figure 4 visualizes the relative overall clustering solution accuracy of the 10 relatedness measures. The measures, corresponding to the bars, are ordered descending from left to right according to their accuracy values (Eq. 14). Further, the color of a bar indicates measure type. The red bar corresponds to direct citations (DC), the two blue bars to indirect citations (BC and CC), the three green bars to DC enhanced by indirect citations (the two DC-BC-CC variants and EDC), the purple bar to textual relations (BM25), and the three orange bars to DC enhanced by textual relations (the three variants of DC-BM25). The horizontal dotted line indicates average performance.

EDC has the highest overall performance, an outcome that provides additional information compared to the GA plot of Figure 3. Similarly, from the point of view of overall performance, DC-BM25 (α = 100) performs better than DC-BM25 (α = 50) (cf. the GA plot of Figure 2). The overall performance order of the two DC-BC-CC variants and BC agrees with the GA plot of Figure 1, and the overall performance order of DC, CC, and BM25 agrees with the GA plots of Figures 1 and 2. In general, then, our conclusions based on the relative clustering solution accuracy values are in line with the conclusions that can be drawn based on the GA plots.

## 4. DISCUSSION AND CONCLUSIONS

We have analyzed the effects of enhancing direct citations, with respect to publication–publication relatedness measurement, by indirect citation relations and text relations on clustering solution accuracy. We used an approach based on MeSH, one of the most sophisticated publication-level classification schemes available, as the independent evaluation criterion. Seven approaches were investigated, and the results show that using extended direct citations (EDC), as well as enhancing direct citations (DC) with bibliographic coupling (BC) and cocitation (CC) or text relations (BM25), gives rise to substantial performance gains relative to DC. The best performance was obtained by EDC, followed by DC-BM25 and DC-BC-CC. Thus, in our analysis, extended direct citations give the best performance and, interestingly, enhancing direct citations by text relations gives rise to better performance compared to enhancing direct citations by bibliographic coupling and cocitation.

The poor performance of CC has been observed in earlier research (Klavans & Boyack, 2017; Waltman et al., 2017, 2019) and was expected. Clearly, a publication that has not received any citations is not cocited with another publication, and can therefore not be adequately clustered. In the study by Klavans and Boyack (2017), in which a more expansive EDC variant was used compared to our variant, EDC yielded more accurate clusters than BC. In this respect, our study reinforces the results of Klavans and Boyack (2017).

Waltman et al. (2017, 2019) compared DC, EDC, BC, CC, and DC-BC-CC (α = 1, 5), using BM25 as the evaluation criterion and a considerably smaller publication set than the publication set of our analysis. Our results for these citation-based approaches demonstrate the same pattern as the results of these authors. This supports the robustness of the results for the five citation-based approaches, because the two studies used different publication sets and different evaluation criteria.

In our study, BM25 is outperformed by EDC. Boyack and Klavans (2018), though, concluded that clusters that were obtained on the basis of the text-only relatedness measures used in their study are as accurate as those that were obtained on the basis of EDC. However, a different evaluation criterion, compared to ours, was used in the study.

Chen et al. (2017) used the TF-IDF term weighting approach combined with the cosine similarity measure in order to weight direct citations by textual similarities. We tested the same approach (without taking term position information into account), as well as an approach in which BM25 is used for the weighting of direct citations. These two approaches, called *DC-TF-IDF and DC-BM25* (weighted links), were outperformed, though, by DC-BM25, DC-BC-CC and BC. Note that, for DC-TF-IDF and DC-BM25 (weighted links), and in contrast to DC-BM25, a necessary (but not sufficient) condition for obtaining a positive relatedness value for two publications *i* and *j* is that there is a direct citation from *i* to *j*, or conversely.

A limitation of our study is that it could be argued that the MeSH approach is not fully independent of relatedness measures based on text in abstracts and titles of publications, because the indexers who assign MeSH terms to publications partially rely on the title and full text of publications. Therefore, the MeSH approach might not be fully independent of the BM25 and DC-BM25 approaches. However, MeSH constitutes a controlled vocabulary, whereas BM25 makes use of an uncontrolled vocabulary, the source of which is the authors of the publications. In view of this, we believe that the MeSH approach is sufficiently different from approaches that make use of terms in abstracts and titles.

For an enhancement of EDC by BM25, which intuitively is reasonable, we obtained corresponding results in the study. These showed that EDC-BM25 performed almost as well as the best performing approach (EDC). However, for efficiency reasons, we had to use a methodology that deviates from that used in EDC. Due to demanding computer memory requirements, we needed to apply the *k*-nearest neighbor technique in the case of EDC-BM25. This was not needed in the case of EDC. We suspect that this is the reason behind the somewhat counterintuitive result that EDC-BM25 did not outperform the other approaches.

Finally, as it does not follow that two clustering solutions with similar accuracy also have similar groupings of publications into clusters, in future studies we aim to further compare the clustering solutions to deepen the insight into how solutions based on different relatedness measures diverge.

## AUTHOR CONTRIBUTIONS

Per Ahlgren: Conceptualization, Methodology, Formal analysis, Writing—original draft, Writing—review & editing. Yunwei Chen: Conceptualization, Methodology, Writing—original draft, Writing—review & editing. Cristian Colliander: Conceptualization, Methodology, Software, Formal analysis, Writing—original draft, Writing—review & editing, visualization. Nees Jan van Eck: Conceptualization, Methodology, Software, Formal analysis, Writing—original draft, Writing—review & editing.

## COMPETING INTERESTS

The authors have no competing interests.

## FUNDING INFORMATION

The article processing charge (APC) is covered by the National Key Research and Development Program of China (Grant No. 2017YFB1402400).

## DATA AVAILABILITY

The data used in this paper were partly obtained from the WoS database produced by Clarivate Analytics. Due to license restrictions, the data cannot be made openly available. To obtain WoS data, please contact Clarivate Analytics (https://clarivate.com/products/web-of-science).

## ACKNOWLEDGMENTS

We would like to thank two anonymous reviewers for their valuable comments on an earlier version of this paper.

## Notes

^{1}

A group of MeSH descriptors that are routinely added to most articles, “check tags,” are concepts of potential interest, regardless of the general subject content of the article (examples are “Human” and “Adult”). We do not include such check tags in any calculations.

^{2}

We do not consider our relatedness evaluator, MeSH, in this part of the evaluation exercise.

## REFERENCES

### APPENDIX: PIECEWISE CUBIC HERMITE INTERPOLATION

In our context, we want an interpolation function that is smooth in the following sense: The function belongs to the class *C*^{1} (i.e., it is differentiable and its derivate is continuous). Moreover, we want the interpolation function to be shape preserving. This is connected to the fact that monotonicity must be guaranteed, because, for any relatedness measure, an increase in granularity will always cause a decrease in accuracy (Waltman et al., 2017, 2019). Linear interpolation will not do (not smooth), a high-order polynomial will not do (not smooth), and “standard” spline interpolation might not do (monotonicity not guaranteed). In this study, we use Piecewise Cubic Hermite Interpolation, an interpolation technique that satisfies the first condition (membership in the class *C*^{1}) indicated above. The monotonicity condition is satisfied by our use of Eq. 17 below. We now describe the interpolation technique in question.

*x*

_{i},

*y*

_{i}) (

*i*= 1, …,

*n*), where

*x*

_{i}<

*x*

_{i+1}(

*i*= 1, …,

*n*− 1), a piecewise interpolation function

*p*(

*x*) ∈

*C*

^{1}[

*x*

_{1},

*x*] is defined such that for

_{n}*i*= 1, …,

*n*

*d*

_{i}are the approximations to the derivatives of

*f*at

*x*

_{i}, and where

*f*is the underlying, unknown function that we want to approximate with interpolation. Now, let Δ

_{i}= $yi+1\u2212yixi+1\u2212xi$ and

*h*

_{i}= (

*x*

_{i+1}−

*x*

_{i}), then for each

*i*= 1, …,

*n*− 1

*x*

_{i},

*x*

_{i+1}] (e.g., Fritsch & Carlson, 1980) and is the function used in this study.

*p*(

*x*) is monotonic in each interval. One straightforward method, which guarantees monotonicity and which we use in this study, is given by Fritsch and Butland (1984). For

*i*= 2, …,

*n*− 1

_{i}= $13$ (1 + $hihi\u22121+hi$) and thus Eq. 17 gives the weighted harmonic mean between Δ

_{i−1}and Δ

_{i}so that the relative spacing between the data points are considered. Eq. 17 is only valid (for preserving monotonicity) if Δ

_{i−1}Δ

_{i}> 0; that is, if Δ

_{i−1}and Δ

_{i}have the same sign and are distinct from zero. If this is not the case one sets

*d*

_{i}to zero. This should never be the case in our context, however.

## Author notes

Handling Editor: Vincent Larivière