A principled methodology for comparing relatedness measures for clustering publications

There are many different relatedness measures, based for instance on citation relations or textual similarity, that can be used to cluster scientific publications. We propose a principled methodology for evaluating the accuracy of clustering solutions obtained using these relatedness measures. We formally show that the proposed methodology has an important consistency property. The empirical analyses that we present are based on publications in the fields of cell biology, condensed matter physics, and economics. Using the BM25 text-based relatedness measure as evaluation criterion, we find that bibliographic coupling relations yield more accurate clustering solutions than direct citation relations and co-citation relations. The so-called extended direct citation approach performs similarly to or slightly better than bibliographic coupling in terms of the accuracy of the resulting clustering solutions. The other way around, using a citation-based relatedness measure as evaluation criterion, BM25 turns out to yield more accurate clustering solutions than other text-based relatedness measures.


Introduction
Clustering of scientific publications is an important problem in the field of bibliometrics.Bibliometricians have employed many different clustering techniques (e.g., Gläser, Scharnhorst, & Glänzel, 2017;Šubelj, Van Eck, & Waltman, 2016).In addition, they have used various different relatedness measures to cluster publications.These relatedness measures are typically based on either citation relations (e.g., direct citation relations, bibliographic coupling relations, or co-citation relations) or textual similarity, or sometimes a combination of the two.
Which relatedness measure yields the most accurate clustering of publications?
Two perspectives can be taken on this question.One perspective is that there is no absolute notion of accuracy (e.g., Gläser et al., 2017).Following this perspective, each relatedness measure yields clustering solutions that are accurate in their own right, and it is not meaningful to ask whether one clustering solution is more accurate than another one.For instance, different citation-based and text-based relatedness measures each emphasize different aspects of the way in which publications relate to each other, and the corresponding clustering solutions each provide a legitimate viewpoint on the organization of the scientific literature.The other perspective is that for some purposes it is useful, and perhaps even necessary, to assume the existence of an absolute notion of accuracy (e.g., Klavans & Boyack, 2017).When this perspective is taken, it is possible, at least in principle, to say that some relatedness measures yield more accurate clustering solutions than others.
We believe that both perspectives are useful.From a purely conceptual point of view, the first perspective is probably the more satisfactory one.However, from a more applied point of view, the second perspective is highly important.In many practical applications, users expect to be provided with a single clustering of publications.Users typically have some intuitive idea of accuracy and, based on this idea of accuracy, they expect the clustering provided to them to be as accurate as possible.In this paper, we take this applied viewpoint and we therefore focus on the second perspective.
Identifying the relatedness measure that yields the most accurate clustering of publications is challenging because of the lack of a ground truth.There is no perfect classification of publications that can be used to evaluate the accuracy of different clustering solutions.For instance, suppose we study the degree to which a clustering solution resembles an existing classification of publications (e.g., Haunschild, Schier, Marx, & Bornmann, 2018).The difficulty then is that it is not clear how discrepancies between the clustering solution and the existing classification should be interpreted.
Such discrepancies could indicate shortcomings of the clustering solution, but they could equally well reflect problems of the existing classification.
As an alternative, the accuracy of clustering solutions can be evaluated by domain experts who assess the quality of different clustering solutions in a specific scientific domain (e.g., Šubelj et al., 2016).This approach has the difficulty that it is hard to find a sufficiently large number of experts who are willing to spend a considerable amount of time making a detailed assessment of the quality of different clustering solutions.Moreover, the knowledge of experts will often be restricted to relatively small domains, and it will be unclear to what extent the conclusions drawn by experts generalize to other domains.
In this paper, we take a large-scale data-driven approach to compare different relatedness measures based on which publications can be clustered.The basic idea is to cluster publications based on a number of different relatedness measures and to use another more or less independent relatedness measure as a benchmark for evaluating the accuracy of the clustering solutions.This approach has already been used extensively in a series of papers by Kevin Boyack, Dick Klavans, and colleagues.
They compared different citation-based relatedness measures (Boyack & Klavans, 2010;Klavans & Boyack, 2017), including relatedness measures that take advantage of full-text data (Boyack, Small, & Klavans, 2013), as well as different text-based relatedness measures (Boyack et al., 2011).To evaluate the accuracy of clustering solutions, they used grant data, textual similarity (Boyack & Klavans, 2010;Boyack et al., 2011Boyack et al., , 2013)), and more recently also the reference lists of 'authoritative' publications, defined as publications with at least 100 references (Klavans & Boyack, 2017). 1ur aim in this paper is to introduce a principled methodology for performing analyses similar to the ones mentioned above.We restrict ourselves to the use of one specific clustering technique, namely the technique introduced in the bibliometric literature by Waltman and Van Eck ( 2012), but we allow the use of any measure of the relatedness of publications.For two relatedness measures  and , our proposed methodology offers a principled way to evaluate the accuracy of clustering solutions obtained using the two measures, where a third relatedness measure  is used as the evaluation criterion.Unlike approaches taken in earlier papers, our methodology has an important consistency property.This paper is organized as follows.In Section 2, we introduce our methodology for evaluating the accuracy of clustering solutions obtained using different relatedness measures.In Section 3, we discuss the relatedness measures that we consider in our analyses.We report the results of the analyses in Section 4. We present comparisons of different citation-based and text-based relatedness measures that can be used to cluster publications.Our analyses are based on publications in the fields of cell biology, condensed matter physics, and economics.We summarize our conclusions in Section 5.

Methodology
To introduce our methodology for evaluating the accuracy of clustering solutions obtained using different relatedness measures, we first discuss the quality function that we use to cluster publications.We then explain how we evaluate the accuracy of a clustering solution, and we analyze the consistency of our evaluation framework.Finally, we discuss the importance of using an independent evaluation criterion.Publications are assigned to clusters by maximizing a quality function.We focus on the quality function of Waltman and Van Eck (2012).This quality function is given by where (   =    ) equals 1 if    =    and 0 otherwise and where  ≥ 0 denotes a so-called resolution parameter.The higher the value of this parameter, the larger the number of clusters that will be obtained.Hence, the resolution parameter  determines the granularity of the clustering.An appropriate value for this parameter can be chosen based on the specific purpose for which a clustering of publications is intended to be used.For some purposes it may be desirable to have a highly granular clustering, while for other purposes a less granular clustering may be preferable.Sjögårde andAhlgren (2018, 2019) proposed approaches for choosing the value of the resolution parameter  that allow clusters to be interpreted as research topics or specialties.
The quality function in (1) can also be written as where    denotes the number of publications assigned to cluster , that is, We also refer to    as the size of cluster .
In the network science literature, the above quality function was proposed by Traag, Van Dooren, and Nesterov (2011), who referred to it as the constant Potts model.The quality function is closely related to the well-known modularity function introduced by Newman and Girvan (2004) and Newman (2004).However, as shown by Traag et al. (2011), it has the important advantage that it does not suffer from the so-called resolution limit problem (Fortunato & Barthélemy, 2007).Waltman and Van Eck (2012) introduced the above quality function in the bibliometric literature.In the field of bibliometrics, the quality function has been used by, among others, Boyack and Klavans (2014), Klavans and Boyack (2017), Perianes-Rodriguez and Ruiz-Castillo (2017), Ruiz-Castillo and Waltman (2015), Sjögårde andAhlgren (2018, 2019), Small, Boyack, and Klavans (2014), and Van Eck and Waltman (2014).

Evaluating the accuracy of a clustering solution
Suppose that we have three relatedness measures , , and , and suppose also that we have used relatedness measures  and  to cluster a set of publications.
Furthermore, suppose that we want to use relatedness measure  to evaluate the accuracy of the clustering solutions obtained using relatedness measures  and .Let  | denote the accuracy of a clustering solution obtained using relatedness measure  (with  =  or  = ), where the accuracy is evaluated using relatedness measure .We define  | as The clustering solution obtained using relatedness measure  is considered to be more accurate than the clustering solution obtained using relatedness measure  if  | >  | , and the other way around.
The above approach for evaluating the accuracy of a clustering solution favors less granular solutions over more granular ones.Of all possible clustering solutions, the least granular solution is the one in which all publications belong to the same cluster.According to (4), this least granular clustering solution always has the highest possible accuracy.There can be no other clustering solution with a higher accuracy.
In order to perform meaningful comparisons, (4) should be used only for comparing clustering solutions that have the same granularity.
How do we determine whether two clustering solutions have the same granularity?We could require that both clustering solutions have been obtained using the same value for the resolution parameter .Alternatively, we could require that both clustering solutions consist of the same number of clusters.We do not take either of these approaches.Instead, we require that the sum of the squared cluster sizes is the same for two clustering solutions.In other words, two clustering solutions obtained using relatedness measures  and  have the same granularity if If ( 5) is satisfied, (4) can be used to compare in an unbiased way the clustering solutions obtained using relatedness measures  and .On the other hand, if (5) is not satisfied, a comparison based on (4) will be biased in favor of the less granular clustering solution.In practice, obtaining two clustering solutions that satisfy (5) typically will not be easy.For both clustering solutions, it may require a significant amount of trial and error with different values of the resolution parameter .In the end, it may turn out that (5) can be satisfied only approximately, not exactly.We will get back to this issue in Subsection 4.3.
A conceptual motivation for the evaluation framework introduced in this subsection is presented in Appendix A.1.This motivation is based on an analogy with the evaluation of the accuracy of different indicators that provide estimates of values drawn from a probability distribution.

Consistency of the evaluation framework
The choice of the accuracy measure defined in (4) and the granularity condition presented in (5) may seem quite arbitrary.However, provided that we use the quality function defined in (1), this choice has an important justification.Suppose that the accuracy of clustering solutions is evaluated using some relatedness measure .Our choice of the accuracy measure in (4) and the granularity condition in (5) then guarantees that of all possible clustering solutions of a certain granularity the solution obtained using relatedness measure  will be the most accurate one.In other words, it is guaranteed that  | ≥  | for any relatedness measure .This is a fundamental consistency property that we believe should be satisfied by any sound framework for evaluating the accuracy of clustering solutions obtained using different relatedness measures.
Suppose for instance that we have three clustering solutions, all of the same granularity, one solution obtained based on direct citation relations between publications, another one obtained based on bibliographic coupling relations, and a third one obtained based on co-citation relations.Suppose also that the accuracy of the clustering solutions is evaluated based on direct citation relations.It would then be a rather odd outcome if the clustering solution obtained based on bibliographic coupling or co-citation relations turned out to be more accurate than the solution obtained based on direct citation relations.In our evaluation framework, it is guaranteed that there can be no such inconsistent outcomes.When the accuracy of clustering solutions is evaluated based on direct citation relations, the clustering solution obtained based on direct citation relations will always be the most accurate one.We refer to Appendix B for a formal analysis of this important consistency property.

Independent evaluation criterion
As already mentioned in Section 1, the approach that we take in this paper is to surprising.This illustrates the importance of using an independent evaluation criterion.The more the relatedness measure used for evaluation can be considered to be independent of the relatedness measures being evaluated, the more informative the evaluation will be.
In Appendix A.2, we provide a further demonstration of the importance of using an independent evaluation criterion.

Relatedness measures
We now discuss the relatedness measures that we consider in this paper.We first discuss relatedness measures based on citation relations, followed by relatedness measures based on textual similarity.We also discuss the so-called top  relatedness approach as well as the idea of normalized relatedness measures.

Citation-based relatedness measures
Below we discuss a number of citation-based approaches for determining the pairwise relatedness for a set of  publications.We use   to indicate whether publication  cites publication  (  = 1) or not (  = 0).
The relatedness of publications  and  based on direct citation relations is given by   DC = max (  ,   ).The relatedness of publications  and  based on bibliographic coupling relations equals the number of common references in the two publications.This can be written as where the summation extends over all publications in the database that we use, not only over the  publications for which we aim to determine their pairwise relatedness.
As is well known, co-citation can be seen as the opposite of bibliographic coupling.The relatedness of publications  and  based on co-citation relations equals the number of publications in which publications  and  are both cited.In mathematical terms, where the summation again extends over all publications in the database that we use.
The above approaches for determining the relatedness of publications may also be combined.This results in where  denotes a parameter that determines the weight of direct citation relations relative to bibliographic coupling and co-citation relations.A direct citation relation may be considered a stronger signal of the relatedness of two publications than a bibliographic coupling or co-citation relation (Waltman & Van Eck, 2012), and therefore one may want to give more weight to a direct citation relation than to the two other types of relations.This can be achieved by setting  to a value above 1.The idea of combining different types of citation-based relations is not new.This idea was also explored by Small (1997) and Persson (2010).
In addition to the above citation-based approaches for determining the relatedness of publications, we also consider a so-called extended direct citation approach.Like the ordinary direct citation approach, the extended direct citation approach takes into account only direct citation relations between publications.However, direct citation relations are considered not just within the set of  focal publications but within an extended set of publications.In addition to the  focal publications, the extended set of publications includes all publications in our database that have a direct citation relation with at least two focal publications.The technical details of the extended direct citation approach are somewhat complex.These details are discussed in Appendix C. We note that an approach similar to our extended direct citation approach was also used by Boyack and Klavans (2014) and Klavans and Boyack (2017).

Text-based relatedness measures
We consider two text-based approaches for determining the relatedness of publications.We use   to denote the number of occurrences of term  in publication .To count the number of occurrences of a term in a publication, only the title and abstract of the publication are considered, not the full text.Part-of-speech tagging is applied to the title and abstract of the publication to identify nouns and adjectives.
The part-of-speech tagging algorithm provided by the Apache OpenNLP 1.5.2 library is used.A term is defined as a sequence of nouns and adjectives, with the last word in the sequence being a noun.No distinction is made between singular and plural nouns, so neural network and neural networks are regarded as the same term.Furthermore, shorter terms embedded in longer terms are not counted.For instance, if a publication contains the term artificial neural network, this is counted as an occurrence of artificial neural network but not as an occurrence of neural network or network.
Finally, no stop word list is used, so there are no terms that are excluded from being counted.
A straightforward text-based measure of the relatedness of publications  and  is given by We refer to this as relatedness based on common terms.The denominator in (10) aims to reduce the influence of frequently occurring terms.The parameter  in the denominator determines the extent to which the influence of these terms is reduced.If  = 0, no reduction in the influence of frequently occurring terms takes place.On the other hand, if  = 1, the influence of frequently occurring terms is strongly reduced, following a so-called fractional counting approach (Perianes-Rodriguez, Waltman, & Van Eck, 2016).Boyack et al. (2011) identified BM25 as one of the most accurate text-based relatedness measures for clustering publications.We therefore also include BM25 in our analysis.BM25 originates from the field of information retrieval, where it is used to determine the relevance of a document for a search query (Sparck Jones, Walker, & Robertson, 2000a, 2000b).Following Boyack et al. (2011), we use BM25 as a textbased measure of the relatedness of publications.The BM25 relatedness measure is defined as where (  > 0) equals 1 if   > 0 and 0 otherwise and where   and  ̅ denote, respectively, the length of publication  and the average length of all  publications.
We define the length of a publication as the total number of occurrences of terms in the publication.This results in IDF  in (11) denotes the inverse document frequency of term , which we define as where   denotes the number of publications in which term  occurs, that is, The BM25 relatedness measure in (11) depends on the parameters  1 and .
Following Boyack et al. (2011), we set these parameters to values of 2 and 0.75, respectively.Unlike all other relatedness measures that we consider in this paper, the BM25 relatedness measure is not symmetrical.In other words,   BM25 does not need to be equal to   BM25 .

Top 𝑴 relatedness approach
Our interest focuses on large-scale clustering analyses that may involve hundreds of thousands or even millions of publications.These analyses impose significant challenges in terms of computing time and memory requirements.In particular, in these analyses, it may not be feasible to store all non-zero relatedness values in the main memory of the computer that is used.
To deal with this problem, we use the top  relatedness approach.This approach is quite similar to the idea of similarity filtering typically used by Kevin Boyack and Dick Klavans (e.g., Boyack & Klavans, 2010;Boyack et al., 2011).In the top  relatedness approach, only the top  strongest relations per publication are kept.The remaining relations are discarded.We use ̃   to denote the relatedness of publications  and  based on relatedness measure  after discarding relations that are not in the top  per publication.This means that ̃   =    if publication  is among the  publications that are most strongly related to publication  and that ̃   = 0 otherwise.
Relatedness of a publication with itself is ignored.Hence, ̃   = 0 if  = .In general, ̃   will not be symmetrical.
In most of the analyses presented in this paper, we use a value of 20 for , although we also explore alternative values.We apply the top  relatedness approach to all our relatedness measures except for the measures based on (extended) direct citation relations.As pointed out by Waltman and Van Eck (2012), the use of direct citation relations has the advantage of requiring only a relatively limited amount of computer memory, and therefore there is no need to use the top  relatedness approach when working with direct citation relations.Applying the top  relatedness approach in the case of direct citation relations would also be problematic because all relations are equally strong, making it difficult to decide which relations to keep and which ones to discard.Hence, in the case of direct citation relations, we simply have ̃  DC =   DC for all publications  and .

Normalization of relatedness measures
We also normalize all relatedness measures.The normalized relatedness of publication  with publication  equals the relatedness of publication  with publication  divided by the total relatedness of publication  with all publications.
Hence, the normalized relatedness of publication  with publication  based on relatedness measure  is given by This normalization was also used by Waltman and Van Eck (2012).The idea of the normalization is that relatedness values of publications in different fields of science should be of the same order of magnitude, so that clusters in different fields will be of similar size.Without the normalization, citation-based relatedness values for instance can be expected to be much higher in the life sciences than in the social sciences.In a clustering analysis that involves both publications in the life sciences and publications in the social sciences, this would result in life science clusters being systematically larger than social science clusters.The normalization in (15) can be used to correct for such differences between fields.The normalization also has the advantage that, regardless of the choice of a relatedness measure, a specific value of the resolution parameter  will always yield clustering solutions that have approximately the same granularity.
All results presented in the next section are based on normalized relatedness measures.

Results
We start the discussion of the results of our analyses by explaining the data collection and the way in which publications were clustered.We then introduce the idea of granularity-accuracy plots.Next, we present a comparison of different citation-based relatedness measures that can be used to cluster publications.This is followed by a comparison of different text-based relatedness measures.

Data collection
Data was collected from the Web of Science database.We used the in-house version of the Web of Science database available at the Centre for Science and Technology Studies at Leiden University.This version of the database includes the Science Citation Index Expanded, the Social Sciences Citation Index, and the Arts & Humanities Citation Index.
Like in our earlier work (e.g., Klavans & Boyack, 2017;Waltman & Van Eck, 2012), our final interest is in clustering all publications available in the database that we use, without restricting ourselves to certain fields of science.However, to keep the analyses presented in this paper manageable, we restricted ourselves to three specific fields.We selected all publications of the document types article and review that appeared in the period 2007-2016 in journals belonging to the Web of Science subject categories Cell biology, Physics, condensed matter, and Economics.Our aim was to cover three broad scientific domains, namely the life sciences, the physical sciences, and the social sciences.The subject categories Cell biology, Physics, condensed matter, and Economics were chosen because they cover these three domains and because they are relatively large in terms of the number of publications they include.
The number of publications is 252,954 in cell biology, 272,935 in condensed matter physics, and 172,690 in economics.
The relatedness measures discussed in Section 3 were calculated for the selected publications.Two comments need to be made.First, in determining bibliographic coupling relations between publications, only common references to publications indexed in our Web of Science database were considered.This database includes publications starting from 1980.Common references to non-indexed publications (e.g., books, conference proceedings publications, and PhD theses) were not taken into account.Non-indexed publications were not considered in the extended direct citation approach either.Second, when we collected the data in Spring 2017, our database included a limited number of publications from 2017.These publications were not used in determining co-citation relations between publications.They also were not considered in the extended direct citation approach.

Clustering of publications
For each of our three fields (i.e., cell biology, condensed matter physics, and economics), the selected publications were clustered based on each of our relatedness measures.Clustering was performed by maximizing the quality function presented in (1).To maximize the quality function, we used an iterative variant (Waltman & Van Eck, 2013) of the well-known Louvain algorithm (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008).Five iterations of the algorithm were performed.In addition, to speed up the algorithm, we employed ideas similar to the pruning idea of Ozaki, Tezuka, and Inaba (2016) and the prioritization idea of Bae, Halperin, West, Rosvall, and Howe (2017).Our algorithm is a predecessor of the recently introduced Leiden algorithm (Traag, Waltman, & Van Eck, 2018), which was not yet available when we carried out our analyses.In general, our algorithm will not be able to find the global maximum of the quality function, but it can be expected to get close to the global maximum.
Different levels of granularity were considered.For each relatedness measure, we obtained ten clustering solutions, each of them for a different value of the resolution parameter .The following values of  were used: 0.00001, 0.00002, 0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, and 0.01.Because of the normalization discussed in Subsection 3.4, the same values of  could be used for all relatedness measures.Without the normalization, different values of  would need to be used for each of the relatedness measures.

Granularity-accuracy plots
A difficulty of the evaluation framework presented in Subsection 2.2 is the requirement that the clustering solutions being compared have exactly the same granularity.This requirement, which is formalized in the condition in (5), is hard to meet in practice.Clustering solutions obtained using different relatedness measures but the same value of the resolution parameter  will approximately satisfy (5), but the condition normally will not be satisfied exactly.
To deal with this problem, we propose a graphical approach based on the idea of granularity-accuracy (GA) plots.Using a GA plot, relatedness measures can be compared despite differences in granularity between clustering solutions.The horizontal axis in a GA plot represents the granularity of a clustering solution.We define the granularity of a clustering solution obtained using relatedness measure  as Two clustering solutions that have the same granularity according to (16) indeed satisfy the condition in (5).The vertical axis in a GA plot represents the accuracy of a clustering solution as defined in (4).Clustering solutions are plotted in a GA plot based on their granularity and accuracy.Lines are drawn between clustering solutions obtained using the same relatedness measure but different values of the resolution parameter .We use a logarithmic scale for both the horizontal and the vertical axis in a GA plot.
In the interpretation of a GA plot, one should be aware that for any relatedness measure an increase in granularity will always cause a decrease in accuracy.This is a mathematical necessity in our evaluation framework, and therefore it is not something one should be concerned about.A GA plot can be interpreted by comparing the accuracy of different relatedness measures at a specific level of granularity.As explained above, clustering solutions obtained using different relatedness measures normally do not have exactly the same granularity.However, in a GA plot, lines are drawn between different clustering solutions obtained using the same relatedness measure, providing interpolations between these solutions.Based on such interpolations, the accuracy of different relatedness measures can be compared at a specific level of granularity.These comparisons can be performed at different levels of granularity.Sometimes different levels of granularity will yield inconsistent results, with for instance relatedness measure  outperforming relatedness measure  at one level of granularity and the opposite outcome at another level of granularity.In other cases, consistent results will be obtained at all levels of granularity.For instance, relatedness measure  may consistently outperform relatedness measure , regardless of the level of granularity.
In the next two subsections, GA plots will be used to compare different citationbased and text-based relatedness measures.

Comparison of citation-based relatedness measures
For each of our three fields (i.e., cell biology, condensed matter physics, and economics), Figure 1 presents a GA plot for comparing the DC, BC, CC, DC-BC-CC, and EDC citation-based relatedness measures discussed in Subsection 3.1.In the case of the DC-BC-CC relatedness measure, two values of the parameter  are considered,  = 1 and  = 5.The BM25 text-based relatedness measure discussed in Subsection 3.2 is used as the evaluation criterion.obtained when this relatedness measure is used to cluster publications are also included in the GA plots.These results provide an upper bound for the results that can be obtained using the citation-based relatedness measures.(Recall from Subsection 2.3 that the highest possible accuracy is obtained when publications are clustered based on the same relatedness measure that is also used as the evaluation criterion.)All relatedness measures (except for DC and EDC; see Subsection 3.3) use a value of 20 for the parameter  of the top  relatedness approach.
To interpret the GA plots in Figure 1, it is important to have some understanding of the meaning of the different levels of granularity.For each of our three fields, a clustering solution consists of several hundreds of significant clusters when the granularity is around 0.001, where we define a significant cluster as a cluster that includes at least ten publications.A granularity around 0.01 corresponds with several thousands of significant clusters.As can be seen in Figure 1, the results obtained for cell biology, condensed matter physics, and economics are quite similar.Using BM25 as the evaluation criterion, CC has the worst performance of all citation-based relatedness measures.This is not surprising.Uncited publications have no co-citation relations with other publications and therefore cannot be properly clustered.This is an important explanation of the bad performance of CC.The bad performance of CC is in line with recent results of Klavans and Boyack (2017).DC outperforms CC but is outperformed by all other citation-based relatedness measures.The performance of DC is especially weak in cell biology.The disappointing performance of DC in all three fields is an important finding, in particular given the increasing popularity of DC in recent years.BC, DC-BC-CC, and EDC all perform about equally well.DC-BC-CC and EDC seem to slightly outperform BC, but the difference is tiny, especially in cell biology and condensed matter physics.Likewise, there is hardly any difference between the parameter values  = 1 and  = 5 for DC-BC-CC.Our finding that BC and EDC perform about equally well differs from results of Klavans and Boyack, who found that an approach similar to EDC significantly outperforms BC.Our results are based on a more principled evaluation framework and a different evaluation criterion than the results of Klavans and Boyack, which most likely explains why our findings are different from theirs.
To test the sensitivity of our results to the value of the parameter  of the top  relatedness approach, Figure 2   (with  = 1) is used as the evaluation criterion.Results obtained when this relatedness measure is used to cluster publications are also included in the GA plots.
These results provide an upper bound for the results that can be obtained using the text-based relatedness measures.All relatedness measures use a value of 20 for the parameter  of the top  relatedness approach.
The results presented in Figure 3 for cell biology, condensed matter physics, and economics are very similar.Using DC-BC-CC as the evaluation criterion, BM25 outperforms CT, regardless of the value of the parameter .The good performance of BM25 is in agreement with the results of Boyack et al. (2011).By far the worst performance is obtained when CT is used with the parameter value  = 0.0.This confirms the importance of reducing the influence of frequently occurring terms.
However, CT with the parameter value  = 0.5 outperforms CT with the parameter value  = 1.0.Hence, the influence of frequently occurring terms should not be reduced too strongly.

Conclusions
The problem of clustering scientific publications involves significant conceptual and methodological challenges.We have introduced a principled methodology for evaluating the accuracy of clustering solutions obtained using different relatedness measures.Our methodology can be applied to evaluate the accuracy of clustering solutions obtained using two relatedness measures  and , where a third relatedness measure  is used as the evaluation criterion.Preferably, relatedness measure  should be as independent as possible from relatedness measures  and .Relatedness measures  and  for instance may be citation-based relatedness measures, and relatedness measure  may be a text-based relatedness measure (or the other way around).
The empirical results that we have presented are based on a large-scale analysis of publications in the fields of cell biology, condensed matter physics, and economics indexed in the Web of Science database.We have used our proposed methodology, complemented with a graphical approach based on so-called GA plots, to compare different citation-based relatedness measures that can be used to cluster publications.
Using the BM25 text-based relatedness measure as the evaluation criterion, we have found that co-citation relations and direct citation relations yield less accurate clustering solutions than a number of other citation-based relatedness measures.
Bibliographic coupling relations, possibly combined with direct citation relations and co-citation relations, can be used to obtain more accurate clustering solutions.The socalled extended direct citation approach yields clustering solutions with an accuracy that is similar to or even somewhat higher than the accuracy of clustering solutions obtained using bibliographic coupling relations.The other way around, we have compared different text-based relatedness measures using a citation-based relatedness measure (obtained by combining direct citation relations, bibliographic coupling relations, and co-citation relations) as the evaluation criterion.BM25 has turned out to yield more accurate clustering solutions than the other text-based relatedness measures that we have studied.
We have also analyzed the use of the so-called top  relatedness approach.This approach can be used to reduce the amount of computing time and computer memory needed to cluster publications.We have found that the use of the top  relatedness approach does not decrease the accuracy of clustering solutions.In fact, in the case of text-based relatedness measures, the accuracy of clustering solutions may even increase.
In this paper, we have adopted the perspective that it is useful to assume the existence of an absolute notion of accuracy.Given the lack of a ground truth, the accuracy of a clustering solution cannot be directly measured.However, by assuming the existence of an absolute notion of accuracy, our methodology allows the accuracy of a clustering solution to be evaluated in an indirect way.An alternative perspective is that there is no absolute notion of accuracy and that it is not meaningful to ask whether one clustering solution is more accurate than another one (e.g., Gläser et al., 2017).From this perspective, clustering solutions obtained using different relatedness measures each provide a legitimate viewpoint on the organization of the scientific literature.We fully acknowledge the value of this alternative perspective, and we recognize the need to better understand how clustering solutions obtained using different relatedness measures offer complementary viewpoints.Nevertheless, from an applied point of view focused on practical applications, we believe that there is a need to evaluate the accuracy of clustering solutions obtained using different relatedness measures and to identify the relatedness measures that yield the most accurate clustering solutions.This motivates our choice to make the assumption the existence of an absolute notion of accuracy.For those who consider this assumption to be problematic, we would like to suggest that the results provided by our methodology could be given an alternative interpretation that does not depend on this assumption.Instead of interpreting the results in terms of accuracy, they could be interpreted in terms of the degree to which different relatedness measures yield similar clustering solutions.
The most obvious direction for future research is to apply our methodology to a broader set of relatedness measures.Examples include relatedness measures based on full-text data, grant data, and keyword data (e.g., MeSH terms).Some of this work is already ongoing (Boyack & Klavans, 2018) However, this is not possible, since  1  , … ,    is defined as the clustering solution that maximizes (2) for relatedness measure  and resolution parameter   .We therefore have a contradiction.This proves that  | ≥  | .
A minor qualification needs to be made.In practice, heuristic algorithms are usually used to maximize the quality function in (2).There is no guarantee that these algorithms are able to find the global maximum of the quality function (see Subsection 4.2).In exceptional cases, this might cause the consistency of our evaluation framework to be violated.

2. 1 .
Quality function for clustering publications Consider a set of  publications.Let    ≥ 0 denote the relatedness of publications  and  (with  = 1, … ,  and  = 1, … , ) based on relatedness measure , and let    ∈ {1,2, … } denote the cluster to which publication  is assigned when publications are clustered based on relatedness measure .
cluster publications based on a number of different relatedness measures and to use another more or less independent relatedness measure to evaluate the accuracy of the clustering solutions.More specifically, we use a text-based relatedness measure to evaluate the accuracy of different clustering solutions obtained using citation-based relatedness measures, and the other way around, we use a citation-based relatedness measure to evaluate the accuracy of different clustering solutions obtained using textbased relatedness measures.Importantly, we are not interested in evaluating citation-based clustering solutions using a citation-based relatedness measure or text-based clustering solutions using a text-based relatedness measure.Such evaluations are of little interest because the relatedness measure used for evaluation is not sufficiently independent of the relatedness measures being evaluated.For instance, when direct citation relations are used to evaluate the accuracy of different clustering solutions obtained using citationbased relatedness measures, the clustering solution obtained based on direct citation relations will be the most accurate one.The evaluation simply shows that the clustering solution obtained based on direct citation relations is best aligned with an evaluation criterion based on direct citation relations, which of course is not DC = 1 if publication  cites publication  or the other way around and   DC = 0 if neither publication cites the other.

Figure 1 .
Figure 1.GA plots for comparing citation-based relatedness measures.The BM25 text-based relatedness measure is used as the evaluation criterion.
presents a GA plot in which the DC-BC-CC citationbased relatedness measure (with  = 1) is compared for different values of .The BM25 text-based relatedness measure is again used as the evaluation criterion.Only the field of condensed matter physics is considered.As can be seen in Figure2, our results are rather insensitive to the value of .

Figure 2 .
Figure 2. GA plot for comparing the DC-BC-CC citation-based relatedness measure (with  = 1) for different values of the parameter  of the top  relatedness approach.The BM25 text-based relatedness measure is used as the evaluation criterion.

Figure 3 .
Figure 3. GA plots for comparing text-based relatedness measures.The DC-BC-CC citation-based relatedness measure (with  = 1) is used as the evaluation criterion.

Figure 4 .
Figure 4. GA plot for comparing the BM25 text-based relatedness measure for different values of the parameter  of the top  relatedness approach.The DC-BC-CC citation-based relatedness measure (with  = 1) is used as the evaluation criterion. .