Abstract
Global algorithms have taken precedence in bibliometrics as approaches to the reconstruction of topics from networks of publications. They partition a large set of publications, and the resulting disjoint clusters are then interpreted as individual topics. This is at odds with a sociological understanding of topics as formed by the participants working on and being influenced by them, an understanding that is best operationalized by algorithms prioritizing cohesion rather than separation, by using local information and by allowing topics to overlap. Thus, a different kind of algorithm is needed for topic reconstruction to be successful. Local algorithms represent a promising solution. In this paper, we present for consideration a new Multilayered, Adjustable, Local Bibliometric Algorithm (MALBA), which is in line with sociological definitions of topics and reconstructs dense regions in bibliometric networks locally. MALBA grows a subgraph from a publications seed by either interacting with a fixed network data set or querying an online database to obtain up-to-date linkage information. New candidates for addition are evaluated by assessing the links in two data models. Experiments with publications on the h-index and with ground truth data positioned in a data set of AMO physics illustrate the properties of MALBA and its potential.
PEER REVIEW
1. INTRODUCTION
Reconstructing scientific topics from networks of papers is a primary focus of bibliometric research, with applications in both science studies and science policy. The dominant approach applies global algorithms—algorithms that partition the whole network by optimizing a global quality function—with data models based on direct citation or bibliographic coupling. The clusters produced by these approaches are interpreted as representing topics (Gläser, Glänzel, & Scharnhorst, 2017; Šubelj, van Eck, & Waltman, 2016). The concept “topic,” its definition, and its theoretical background are rarely discussed in these approaches.
This disregard of theory is problematic because it decouples the bibliometric reconstruction of topics from the sociological discussion of their role in the production of scientific knowledge, which renders bibliometric methods useless for sociological purposes. The disconnection of empirical approaches from theory also undermines validation attempts because the general difficulty of constructing a ground truth on which methods can be validated is exacerbated by a lack of consensus about the theoretical referent of such a ground truth. Attempts to circumvent these problems by determining the convergent validity of methods with a comparison of their outcomes implicitly assume that the different methods operationalize the same concept. This assumption cannot be verified unless the concept is explicated.
One of the causes of that problem appears to be that bibliometrics is “stuck” with a few data models and algorithms, which makes attempts to operationalize theoretical concepts run the risk of not finding suitable bibliometric approaches at all. While this may be the case for data models, because the number of bibliographic metadata from which they can be constructed does not change, the same cannot be said for algorithms. Bibliometrics uses only very few of the constantly growing number of algorithms for the detection of communities in networks. An operationalization of theoretical concepts by choosing or developing suitable algorithms seems possible. Therefore, looking for alternatives to the currently used algorithms seems worthwhile.
With this paper, we turn to a class of algorithms that has rarely been considered in bibliometrics so far. We present a local algorithm and demonstrate that it meets the criteria for algorithms that operationalize a sociological concept of “topic” because it constructs subgraphs whose properties correspond to properties of the theoretical concept. While these subgraphs cannot be considered as topics without further extensive evaluation, the algorithm proves to be a useful tool for the exploration of publication networks.
We start from a theoretical discussion of the concept “topic” and show that the sociological research tradition on thematic structures in science implies that topics are constructed by the researchers working on them and are thus fluid and overlap. From these and other properties it follows that subgraphs of publication networks that represent topics should be constructed using local information, should be cohesive, and should overlap. Summarizing a previous extensive analysis (Held, 2022), we argue that such subgraphs cannot be constructed by global algorithms. Not surprisingly, established bibliometric approaches to topic reconstruction, which all utilize this type of algorithm, proved unsuccessful when applied to “ground truths” that were defined by scientists (Held, Laudel, & Gläser, 2021).
We operationalize this concept of topics by using two data models that partially represent their properties and by developing a local algorithm for the reconstruction of topics. Local algorithms—algorithms that grow clusters from seed subgraphs until a condition for their termination is met—appear to be a promising solution to the tension between theoretically derived properties of topics and the common output of currently used algorithms. Some local algorithms use quality functions that maximize cohesion, which corresponds to the sociological understanding of topics as shared perspectives of researchers.
The algorithm we present in this paper is a Multilayered, Adjustable, Local Bibliometric Algorithm (MALBA), which utilizes ideas from existing local algorithms and implements them for bibliometric purposes. We explore its behavior by expanding seed graphs in two different fields, namely bibliometrics and atomic and molecular optics in physics (AMO physics). The latter seeds correspond to the ground truths we previously constructed for the test of global algorithms (Held et al., 2021). Initial results reveal interesting stable patterns in bibliometric networks which are easily reconstructed but whose correspondence to topics remains to be established. Regardless of this outstanding validation, the local approach raises interesting questions and opens new avenues for research into structures of publication networks.
The paper is organized as follows. The Section 2 provides a sociological definition of topics and how it is operationalized by algorithms. Section 3 reviews existing local algorithms, compares their potential with that of global algorithms for operationalizing the definition, and presents our new algorithm. We continue to explain our experiments (Section 4), followed by their results (Section 5), a discussion (Section 6), and conclusions (Section 7).
2. A SOCIOLOGICAL DEFINITION OF TOPICS AND ITS OPERATIONALIZATION BY ALGORITHMS
An important methodological starting point of our search for approaches to topic reconstruction is the premise that these approaches, as procedures of empirical identification, operationalize a concept. If this concept and the way in which it is operationalized are not made explicit, operationalization still occurs but is applied to an implicit concept, which most likely is an everyday notion that generalizes bibliometricians’ experiences with topics in their own field. If no theoretical concept is operationalized, the topic reconstruction exercise is decoupled from theory and cannot contribute to research purposes.
While there are few well-defined sociological concepts ready for bibliometric operationalization, some definitions can be derived from the literature. For the concept “topic,” we follow Havemann, Gläser, and Heinz (2017, p. 1091) in defining a topic as “a focus on theoretical, methodological or empirical knowledge that is shared by a number of researchers and thereby provides these researchers with a joint frame of reference for the formulation of problems, the selection of methods or objects, the organization of empirical data, or the interpretation of data.” The researchers who share such a frame of reference form a scientific specialty or scientific community: that is, a collective that jointly advances the shared knowledge and has a collective identity (self-perception) of jointly advancing that knowledge (Gläser, 2019, pp. 421–423; Whitley, 2000).
This joint activity is based on dense communication (Kuhn, 2012 [1962], pp. 19, 23, 177) because community members offer each other contributions for further use in publications and frequently discuss their work in formal and informal settings. It is also based on, and strengthens, the thematic similarity of community members’ work because they work with and contribute to the same body of knowledge (Gläser, 2006). As publications commonly contain several knowledge claims, which are likely to address different topics (Cozzens, 1985; Amsterdamska & Leydesdorff, 1989), it is likely that researchers work on more than one topic, and that topics are likely to overlap in publications. Finally, from the definition of a topic as a joint frame of reference, it follows that a topic is first and foremost a topic to those who work on it. The insider perspective of what constitutes a topic is likely to deviate from outsider perspectives (i.e., perspectives of colleagues from other communities).
These properties of topics, which can be derived from theoretical considerations and have been confirmed empirically, should correspond to properties of subgraphs in publication networks if these subgraphs are meant to represent topics. The operationalization of properties of topics by bibliometric algorithms is determined by the latter’s use of information and by the properties of subgraphs they identify as communities in networks (Table 1).
Theoretical properties . | Operationalization by algorithms . |
---|---|
Shared focus of researchers | Prioritize local information |
Allow for variable size of topics | |
Overlap | Allow for pervasive overlaps of subgraphs |
Variation in structures | Allow for structural variation in subgraphs |
Theoretical properties . | Operationalization by algorithms . |
---|---|
Shared focus of researchers | Prioritize local information |
Allow for variable size of topics | |
Overlap | Allow for pervasive overlaps of subgraphs |
Variation in structures | Allow for structural variation in subgraphs |
Dense communication and thematic similarity can be operationalized by algorithms prioritizing the cohesion of subgraphs as a criterion for identifying them as communities. The definition of a topic as a shared focus of researchers suggests the importance of an insider perspective on a topic, which can be operationalized by prioritizing the use of local information (information about a subgraph and its environment) for its delineation (Held, 2022). The same property also suggests that topics may have different sizes depending on the number of researchers who share the focus, which can be operationalized by algorithms allowing for different sizes of topics. The overlap of topics can be operationalized by algorithms producing communities that overlap pervasively (i.e., not only in their boundary nodes), and structural variation of topics can be operationalized by algorithms producing communities with different structures or at least not prioritizing a particular structure of a community.
3. A LOCAL DENSITY-MAXIMIZING ALGORITHM FOR THE ANALYSIS OF BIBLIOMETRIC NETWORKS
3.1. State of the Art
Both frequent communication and thematic similarity can be operationalized as finding dense subgraphs in publication networks. The general problem of discovering dense subgraphs in networks has been taken up by different branches of science, with different perspectives on the problem. Computer science has been working on a problem named dense subgraph discovery for decades (Charikar, 2000; Deng & Xiang, 2022; Khuller & Saha, 2009; Lee, Ruan et al., 2010). The first important applications included the search for highly connected subgraphs in the World Wide Web, which were considered as “web communities” suggesting the existence of a common “topic” (Gibson, Kleinberg, & Raghavan, 1998). Algorithms for the efficient extraction of these subgraphs, such as various (relaxed) versions of cliques, have been developed (Lee et al., 2010, p. 311). These algorithms differed from each other in the definitions of density they applied, which were based on the degrees of a subgraph’s nodes (e.g., high average degree), on a definition of density as number of nodes divided by a (high) number of edges, or by a distance measure of how easily nodes can reach each other (Lee et al., 2010). Because these algorithms reconstruct dense network structures, their possible fields of application include social studies of cliques (Kegen, 2015). They are also interesting algorithms for bibliometric purposes, not least because the cliques in networks to which the algorithms are applied are known to overlap (Palla, Derényi et al., 2005). However, as their approach to density maximization searches for very specific network structures, they are limited in their ability to allow for structural variability of subgraphs. Furthermore, their approach is global, because they typically use global network statistics to identify dense or the densest subgraphs (Tsourakakis, Bonchi et al., 2013). This also holds for algorithms which find “local dense subgraphs” near seed nodes (Deng & Xiang, 2022), because also here statistics from the entire graph are taken into consideration to determine the subgraph.
Another approach to measuring density in subgraph structure is implemented by community1 detection algorithms developed in network science. Typically, these algorithms implement a notion of the classical idea of a network community, having “more edges ‘inside’ […] than edges linking vertices […] with the rest of the graph” (Fortunato, 2010). This definition includes a more relaxed notion of density, which leaves much room for interpretation, an ambiguity that led to different algorithmic implementations of the general idea. Two groups of community detection algorithms can be distinguished, namely global and local algorithms2. Global community detection algorithms (GCDAs) use statistics of the entire network, and often construct a partition of the entire network composed of network communities. When these algorithms partition a network, the problem of having many edges inside the community (cohesion) and few between the community and its environment (separation) requires striking a compromise, because in a partition neither cohesion nor separation can be optimized individually (Fortunato & Hric, 2016). This task of creating a partition is solved by many GCDAs by constructing well-separated communities (Held, 2022).
In contrast to the GCDA approach that partitions the whole network, local community detection algorithms (LCDAs) start from a seed in the network and explore its immediate surrounding to grow a community around it. This allows for a truly local community definition, which is in accordance with the local property of topics, as well as overlapping communities. LCDAs are easily able to maximize either separation or cohesion of a subgraph independently because they disregard the network beyond the immediate environment of a subgraph. Thus, LCDAs can either apply the general idea of community detection by identifying dense network regions surrounding the seed that are well separated from the rest of the network or they can be used to identify particularly dense or cohesive network regions around a seed without considering their separation. A review of LCDAs showed, however, that not all LCDAs can be considered local to the same degree, and that other properties render them more or less suitable for the task of reconstructing topics (Baltsou, Christopoulos, & Tsichlas, 2022)3. For example, most LCDAs evaluate a community’s quality by its separation from its environment, which is frequently measured as conductance (outward edges divided by volume; Hamann, Röhrs, & Wagner, 2017) or local modularity (Clauset, 2005, p. 2). Only a few local algorithms maximize cohesion. These include, among others:
the Local Tightness Expansion algorithm (LTE), which grows the subgraph by adding nodes that increase the subgraph’s “tightness” (number of shared neighbors of nodes inside the subgraph compared to number of neighbors of nodes inside and outside of the subgraph) and also uses “tightness” as a termination criterion (Huang, Chen et al., 2021); and
the Triangle Based Community Expansion (TCE) algorithm (Hamann et al., 2017), which adds nodes when they have a large share of triangular relationships with the subgraph compared to the nodes’ degree, but uses a separation-oriented criterion (conductance) for termination.
The general idea to expand a paper set using citation relations is well known in bibliometrics and information retrieval. Applications include the notion of a “citation viewpoint” from the perspective of an initial paper set considering outgoing citations (“backward expansion”) or incoming citations (“forward expansion”) (Chen, 2018; Chen, Lin, & Zhu, 2006), Garfield’s “algorithmic history” (Garfield, Pudovkin, & Istomin, 2003), and the process of seed expansion in the context of information retrieval (Zitt & Bassecoulard, 2006; Zitt, Lelu et al., 2019, pp. 55–58). However, no dedicated algorithm has yet been developed that uses these ideas for the analysis of publication networks.
Another idea that is relevant for the operationalization of the topic concept is to extend the community detection approach to multilayered networks. This idea can be used with both GCDAs and LCDAs. It needs more attention because, as indicated by Held et al. (2021), one data model might not suffice to reconstruct all types of topics in a publication network. Various global and local multilayered approaches have been developed in the past years (Huang et al., 2021). Multilayered approaches have been used for topic reconstruction in “hybrid” approaches that combine reference-based and lexical data models (Thijs, 2019, pp. 221–222).
3.2. Comparing the Potential of Global and Local Algorithms for Operationalizing the Definition of Topics
GCDAs are very popular in bibliometrics because they efficiently partition an entire network. They have also been frequently applied in topic reconstruction exercises (e.g., Sjögårde & Ahlgren, 2018; Velden, Boyack et al., 2017). However, their ability to operationalize the properties of topics identified in the previous section is rather limited, and they compare badly to LCDAs in this respect (Table 2). Held (2022) has shown that the global orientation of GCDAs can come at a price of identifying noncohesive structures as communities. GCDAs have further deficiencies regarding the operationalization of topic reconstruction, including the fact that all nodes are treated equally in the sense that each must be allocated to a community (Held, 2022, p. 1074), which is something that was recently questioned by Park, Tabatabaee et al. (2023), and that most of them produce disjunct clusters. GCDAs used in bibliometrics (and most GCDAs in general) do not prioritize the cohesion of subgraphs or local information. While they allow for a variable size of topics, their ability to produce overlaps is limited to the boundary nodes of clusters. Structural variance is possible but is a by-product rather than something that can be controlled.
Operationalization demands . | GCDAs . | LCDAs . |
---|---|---|
Prioritize cohesion of subgraphs | No—Cohesion not priority | Yes—Possible and implemented in some algorithms |
Prioritize local information | No—Global information prioritized | Yes—Only local information used |
Allow for variable size of topics | Yes—Community size varies | Yes—Community sizes vary |
Allow for pervasive overlaps of subgraphs | No—If at all overlaps in boundary nodes | Yes—Pervasive overlaps because subgraphs are grown independently of each other |
Allow for structural variation in subgraphs | Yes—But structural variance is by-product | Yes—variability through local structural exploration |
Operationalization demands . | GCDAs . | LCDAs . |
---|---|---|
Prioritize cohesion of subgraphs | No—Cohesion not priority | Yes—Possible and implemented in some algorithms |
Prioritize local information | No—Global information prioritized | Yes—Only local information used |
Allow for variable size of topics | Yes—Community size varies | Yes—Community sizes vary |
Allow for pervasive overlaps of subgraphs | No—If at all overlaps in boundary nodes | Yes—Pervasive overlaps because subgraphs are grown independently of each other |
Allow for structural variation in subgraphs | Yes—But structural variance is by-product | Yes—variability through local structural exploration |
In contrast, LCDAs may meet all the demands of operationalization. While many LCDAs also prioritize separation over cohesion, some of them do not, which indicates that the prioritization of cohesion can be easily achieved. Per definition, all LCDAs use only local information in the construction of subgraphs. LCDAs also allow for varying community sizes. Pervasive overlaps are possible because the subgraphs are grown independently of each other. For the same reason, structural variance can be accommodated by letting different subgraphs grow differently for structural exploration.
3.3. The MALBA Algorithm
We present for consideration MALBA, which is inspired by the LTE algorithm and the idea of algorithms applied to multilayer data models. MALBA constructs cohesive communities in networks of papers by iteratively growing a subgraph from a seed (i.e., it operates locally). It can reconstruct overlapping subgraphs because each subgraph is grown independently of all others.
MALBA operates in a multilayered network and adds publications to the subgraph if they are densely connected in at least one of the two data models “direct citation” or “bibliographic coupling” (Figure 1). Thus, both the dense communication (DC model) and the density of thematic similarity between papers (bibliographic coupling—BC) are considered in the exploration. Dense communication between researchers working on a topic has been identified as one of its properties (see Section 2). This includes both informal communication, which is difficult to operationalize bibliometrically, and formal communication through publications. Citations can be assumed to represent communication channels, at least in the statistical aggregate. Therefore, dense communication should be reflected in above-average subgraph cohesion in direct citation (DC) networks even though publications cover only part of a community’s communication. BC, usually measured as Salton’s cosine of shared references between two papers, is the most common measure of the thematic similarity of two papers (Glänzel & Czerwon, 1996, p. 198). The reasoning behind this is that two papers A and B that are bibliographically coupled draw on the same referenced papers as sources of knowledge. While this must not always be the case and different knowledge claims from the same source can be used by the citing papers (Martyn, 1964), again the argument can be made that in the statistical aggregate, bibliographic coupling is likely to indicate thematic similarity. Our algorithm is, to our knowledge, the first that combines subgraph expansion steps that use different bibliometric data models. Its use with only one data model is of course possible.
The local expansion in the DC and BC networks enables the identification of both the common knowledge base of the subgraph (via “backward expansion”/“DCout”) and of those papers which use the subgraph as knowledge base (via “forward expansion”/“DCin”). MALBA can also be applied to other (combinations of) data models (e.g., to Small’s combined linkage: Small, 1997). In this paper, only DC and BC links are considered.
The community definition of MALBA is as follows: A community is a subgraph of a publication network whose growth terminated because no publications can be added that meet the preset cohesion criteria in the chosen data models. The separation of the subgraph from its neighborhood is considered only collaterally because papers that are not connected well enough to be included are in turn better separated from the subgraph.
The thresholds for the density of connections (DCin, DCout, and BC) are adjustable by the user:
DCin threshold (fraction): share of references of a citing publication that are in subgraph
DCout threshold (number): a reference cited by subgraph x times
BC threshold (fraction): share of references of a citing publication which overlap with subgraph’s references4
Higher thresholds focus on reconstructing denser regions, while lower thresholds enable the exploration of less dense regions of a network. The criteria for adding publications do not include any measures of the cohesion of the subgraph. Therefore, the growth of the subgraph does not necessarily increase its cohesion, which means that structures of subgraphs vary depending on the seeds chosen. The current implementation of the algorithm was written in Java (clojure) by Bastian Steudel5.
In Figure 3, the first cycle of three steps of MALBA according to Figure 1 is depicted for the h-index seed of seven publications (see Section 4.2). The left and right columns symbolize operations in the direct citation and bibliographic coupling layer, respectively. The example shows the growing subgraph (blue) and the nodes connected to it in the two data models (gray). MALBA starts the first cycle by evaluating the DC layer and in this example searches for publications at or above the threshold DCin 0.55 (0 are found), then searches for publications at or above the BC threshold of 0.95, where four are found and added (green). In the last step of the first cycle, DCout, one publication is found and added, which is cited at least 11 times by the subgraph (green): the original Hirsch paper of 2005.
The interface of MALBA allows two modes of applying the algorithm. In one mode, it can be applied to publication networks derived from databases by growing seeds selected from these networks. Here, MALBA can support the exploration of networks by identifying dense regions. In this mode, the BC links for the BC expansion step are calculated not directly from a separate BC network, but from the provided DC network. Thus, likely not all of a publication’s references are considered, only those from the designated DC network.
Alternatively, the algorithm can be used to explore a publication database (which must contain links between publications and their cited sources) directly by starting from a seed and searching the database for densely connected publications. In this case, MALBA utilizes all information about a subgraph’s environment that exists in the database but provides less information about less well-connected publications. Because typically databases only store a DC network for storage reasons, the BC links are calculated in an analogous manner. In this paper, and in the current version of MALBA, only references that are source items in the database are considered. This can lead in fields of low coverage (i.e., many nonsource references) to an increased likelihood that the remaining source references—which are responsible for the BC links to the subgraph—lead to the addition of publications not belonging to the subgraph’s topic. Even though including nonsource items is an option that can be implemented in MALBA, we decided to exclude them completely from the calculations, because we do not know enough about their role regarding dense communication and thematic similarity.
The user can affect the operation of MALBA in four ways:
By deciding to work with a dedicated network or to explore a database.
By choosing the seed subgraph that serves as starting point for MALBA. The seed has a strong influence on the subgraph through its size, through the region of the network in which it is located, and through its internal structure. Publications belonging to the seed must be bibliographically (well-) coupled or must cite each other.
By deciding on the thresholds. The interface offers the option of automatically identifying the thresholds for DC and BC that return the largest subgraph that can be grown out of the seed with the algorithm terminating. However, the user can also set thresholds manually to achieve an earlier termination of the algorithm; or can set lower thresholds to explore a less dense region of the network (in which case the user must manually terminate the algorithm). Higher thresholds focus on reconstructing denser regions, while lower thresholds also allow the reconstruction of less dense regions. An additional threshold is the number of references a publication needs to have to be considered in the BC expansion. A threshold of eight is currently implemented to avoid the inclusion of publications with few references that are all shared with the subgraph.
By implementing other data models (e.g., complete linkage).
4. EXPERIMENTS
4.1. Strategy
Proposing a new type of algorithm for bibliometrics for the exploration of publication networks confronts us with a general challenge. How to validate our algorithm? The challenge occurs because
there is no established standard of validation for algorithms in bibliometrics (neither for validating the algorithms nor for validating their outcomes);
we do not know any other algorithm for the exploration of publication networks;
the only other LCDA (proposed by Havemann et al. (2017)) has only been used by the authors proposing it, and has only been used twice;
developing a strategy for a comparison with other algorithms used in bibliometrics is not straightforward at all because GCDAs and LCDAs operate differently; which means
a comparison with the commonly used global algorithms in bibliometrics on the same ground truth would disadvantage one of the algorithms depending on whether the ground truth is used as the seed or not (see below).
To illustrate the behavior of our algorithm compared to a commonly used global algorithm, experiment 1 applies both MALBA and the Leiden algorithm to a small bibliometric network. In experiment 2, the h-index topic was chosen because it makes it easier to understand publications included in the subgraph and in its environment. In the results we discuss the reasons why publications may be included or excluded, the impact of seed sizes, and the impact of thresholds. Experiment 3 is based on a previous study of Held et al. (2021), where the validity of two commonly applied global algorithms for topic reconstruction was assessed by identifying several ground truths for topics in a network of papers and testing whether they were reconstructed by the algorithms’ network partitions. However, we cannot directly compare our results with Held et al. (2021) for two reasons. On the one hand, a comparison would systematically disadvantage the global algorithm if a ground truth were to be used both as seed for MALBA and as evaluation criterion, and would systematically disadvantage MALBA if a ground truth were to be excluded from the seed and MALBA would be expected to reconstruct it. On the other hand, a comparison with global algorithms is impossible because MALBA cannot fully cover a network, with the trivial exception of low thresholds leading to one large community covering the whole network.
To nevertheless facilitate a comparison with the bibliometric GCDAs assessed in Held et al. (2021), the same ground truths are used for evaluating MALBA’s results. Instead of testing whether ground truths are recovered by MALBA, we grow communities from individual-level ground truths of the same topic and test for overlaps of communities. If perspectives of researchers are shared, as the definition of “topic” demands, communities grown from independent seeds belonging to this topic should substantially overlap. Hence, in experiment 3 we use mutually exclusive ground truths of a topic in AMO physics (see Held et al. (2021) for details) as seeds, let MALBA grow their surrounding cohesive subgraphs, and assess their mutual overlaps. A substantial overlap would support the claim that the researchers work on the same topic and provide an indication of its reconstruction.
4.2. Data and Methods
MALBA can be applied either to a database that contains up-to-date information on a set of publications (online, experiment 2) or to specially constructed publication networks (offline, experiments 1 and 3). For experiment 1, we use the direct citation network of a biomedical review paper (Sharma, Suk, & Kim, 2021) and its references. These 147 publications are clustered by the Leiden algorithm (using CPM as the quality function and resolution values of 0.05 and 0.007). For comparison, a subset of two publications from these 147 publications is used as the seed for MALBA. Here, only DC expansion is used to run MALBA until termination to produce results that are comparable to those of the Leiden algorithm.
In experiment 2, the algorithm starts from a seed set of seven most highly cited publications in the WoS that have “h-index” in their title (Table 3), and searches for densely connected publications in the stable version of the bibliometric database of the Competence Centre for Bibliometrics (version January 2023, https://bibliometrie.info/).
# . | Seed publications . | Citations . |
---|---|---|
1 | Jin et al. (2007). The R-and AR-indices: Complementing the h-index. Chinese Science Bulletin, 52(6), 855–863. | 438 |
2 | Bornmann et al. (2008). Are there better indices for evaluation purposes than the h index? […] JASIST, 59(5), 830–837. | 338 |
3 | Alonso et al. (2009). h-Index: A review focused in its variants, computation […] fields. Journal of Informetrics, 3(4), 273–289. | 538 |
4 | Bornmann/Daniel (2007). What do we know about the h index? JASIST, 58(9), 1381–1385. | 345 |
5 | Hirsch (2007). Does the h index have predictive power? PNAS, 104(49), 19193–19198. | 640 |
6 | Costas/Bordons (2007). The h-index: Advantages, limitations […]. Journal of Informetrics, 1(3), 193–203. | 309 |
7 | Bar-Ilan (2008). Which h-index?—A comparison of WoS, Scopus and Google Scholar. Scientometrics, 74(2), 257–271. | 446 |
# . | Seed publications . | Citations . |
---|---|---|
1 | Jin et al. (2007). The R-and AR-indices: Complementing the h-index. Chinese Science Bulletin, 52(6), 855–863. | 438 |
2 | Bornmann et al. (2008). Are there better indices for evaluation purposes than the h index? […] JASIST, 59(5), 830–837. | 338 |
3 | Alonso et al. (2009). h-Index: A review focused in its variants, computation […] fields. Journal of Informetrics, 3(4), 273–289. | 538 |
4 | Bornmann/Daniel (2007). What do we know about the h index? JASIST, 58(9), 1381–1385. | 345 |
5 | Hirsch (2007). Does the h index have predictive power? PNAS, 104(49), 19193–19198. | 640 |
6 | Costas/Bordons (2007). The h-index: Advantages, limitations […]. Journal of Informetrics, 1(3), 193–203. | 309 |
7 | Bar-Ilan (2008). Which h-index?—A comparison of WoS, Scopus and Google Scholar. Scientometrics, 74(2), 257–271. | 446 |
In the third experiment, we use an existing data set of publications in atomic, molecular and optical (AMO) physics, which has been used in a previous study (Held et al., 2021). It consists of 96,137 publications and spans the years 1990–2005. The previous study investigated the reconstruction of individual-level ground truths for topics using different clusterings of global algorithms of this field data set. The same individual-level ground truths will also be used in this study for the seeds to grow the publication sets on the AMO physics data set. This includes publications by 10 researchers on the topic of Bose-Einstein Condensation (BEC), which these researchers confirmed to be BEC publications in interviews. These mutually exclusive publication sets vary in size between nine and 34 publications (see Table 4 for details of the seeds).
Researcher . | Seed size . | Median of seed’s publications’ citations . | Seed’s average degree (DC) . | Seed’s average degree (BC) . | Thresholds used DCout/BC/DCin . | Terminal subgraph size . |
---|---|---|---|---|---|---|
BEC Researcher 4 | 13 publications | 11 | 37 | 1,815 | 12/0.95/0.60 | 79 |
BEC Researcher 8 | 15 publications | 7 | 22 | 1,526 | 6/0.86/0.84 | 159 |
BEC Researcher 6 | 9 publications | 163 | 360 | 3,030 | 11/0.98/0.81 | 777 |
BEC Researcher 5 | 12 publications | 12 | 35 | 1,000 | 5/0.94/0.94 | 935 |
BEC Researcher 9 | 34 publications | 16 | 42 | 1,473 | 6/0.92/0.97 | 1,036 |
BEC Researcher 7 | 15 publications | 7 | 21 | 1,004 | 5/0.98/0.89 | 1,765 |
BEC Researcher 1 | 9 publications | 12 | 44 | 2,372 | 5/0.94/0.94 | 1,949 |
BEC Researcher 3 | 13 publications | 42 | 75 | 2,136 | 5/0.94/0.94 | 1,980 |
BEC Researcher 2 | 24 publications | 18 | 85 | 2,189 | 8/0.97/0.86 | 2,007 |
BEC Researcher 10 | 20 publications | 15 | 39 | 1,461 | 5/0.94/0.94 | 2,270 |
Researcher . | Seed size . | Median of seed’s publications’ citations . | Seed’s average degree (DC) . | Seed’s average degree (BC) . | Thresholds used DCout/BC/DCin . | Terminal subgraph size . |
---|---|---|---|---|---|---|
BEC Researcher 4 | 13 publications | 11 | 37 | 1,815 | 12/0.95/0.60 | 79 |
BEC Researcher 8 | 15 publications | 7 | 22 | 1,526 | 6/0.86/0.84 | 159 |
BEC Researcher 6 | 9 publications | 163 | 360 | 3,030 | 11/0.98/0.81 | 777 |
BEC Researcher 5 | 12 publications | 12 | 35 | 1,000 | 5/0.94/0.94 | 935 |
BEC Researcher 9 | 34 publications | 16 | 42 | 1,473 | 6/0.92/0.97 | 1,036 |
BEC Researcher 7 | 15 publications | 7 | 21 | 1,004 | 5/0.98/0.89 | 1,765 |
BEC Researcher 1 | 9 publications | 12 | 44 | 2,372 | 5/0.94/0.94 | 1,949 |
BEC Researcher 3 | 13 publications | 42 | 75 | 2,136 | 5/0.94/0.94 | 1,980 |
BEC Researcher 2 | 24 publications | 18 | 85 | 2,189 | 8/0.97/0.86 | 2,007 |
BEC Researcher 10 | 20 publications | 15 | 39 | 1,461 | 5/0.94/0.94 | 2,270 |
The seeds were used for experiments with MALBA using the following parameters:
Maximum subgraph size: 3,000
Maximum parents6: 3,000
Minimum number of references for BC calculation: 8
5. RESULTS
In all experiments, one of three types of growth dynamics of a subgraph could be observed (see typical growth curves in Figure 4). Depending on the thresholds used, MALBA:
converges rapidly (rapid convergence, typically in fewer than five iterations);
converges smoothly (smooth convergence, after ∼5–15 iterations); or
grows exponentially without any convergence (exponential growth).
We performed first experiments with the AMO data set regarding the influence of the exclusion of high-degree nodes on the tendency to exponential growth. Excluding high-degree nodes (highly cited publications/publications with more than 100 references) from the network reduces the probability of exponential growth of the subgraph. For example, when reviews with many references are added to the subgraph, this enlarges the pool of references and often triggers exponential growth. Very highly cited papers show a similar pattern. Excluding such papers enables the use of lower thresholds, which in turn may lead to larger terminal subgraphs. The experiments with the AMO data set indicate that this is a viable option. As there are arguments for and against the relevance of such high-degree publications for a topic’s communication, decisions about their exclusion must assess the respective epistemic context.
The existence of smooth convergence is an interesting finding in itself. Passing the thresholds is becoming easier during the expansion of a subgraph because the number of subgraph members that may be cited by or may cite a publication outside the subgraph grows, as does the number of references in the subgraph. This suggests that a subgraph should always grow exponentially. However, empirically we find a small threshold range at which the subgraph terminates. This indicates interesting patterns in bibliometric network structures: in particular, strong variations of the density of bibliometric networks. Figure 4 plots an example of using a seed from experiment 3, where small variations of the thresholds change smooth convergence to exponential growth or rapid convergence, respectively. Furthermore, the plot exemplifies the possibility of using a terminal subgraph again as the seed (with slightly altered thresholds) for a second run of MALBA to obtain a larger terminal subgraph.
Regarding the three thresholds, we found the following general results:
setting DCout too high diminishes the possibility of detecting the knowledge base of the subgraph. If the subgraph does not yet have a large-enough knowledge base (in the form of references), high thresholds for DCout will likely prevent the subgraph from growing with any of the expansion steps. Smooth convergence leads to different terminal subgraphs when this parameter is varied in a certain range. When the automatic parameter search of MALBA for the largest subgraph finds a high value for DCout (e.g., >10), this suggests that the environment of the subgraph includes many highly cited references which would fuel the exponential growth of the subgraph. A higher value of DCout prevents these references from being included.
The DCin and BC parameters must be set relatively high for smooth convergence (both roughly on a similar level typically around 80% ± 20%), and smooth convergence can typically only be achieved in a small range (ranges within less than ±5%).
5.1. Comparing MALBA With a Global Algorithm
The direct citation network of the 147 publications cited by a biomedical review is shown in all four plots of Figure 5, with the review paper itself separating the two groups. On the left side of the plots, those references are positioned that do not cite other referenced papers. On the right side those referenced publications that also frequently cite each other are shown. MALBA was applied using a small seed of two publications (red, left-hand side), because using only the review paper as seed did not let a subgraph grow at all. In the first iteration (Step 1, upper left), a few publications are added (blue). The terminal subgraph produced by MALBA includes 117 publications (lower left). Here the results show that MALBA constructs a dense subgraph and does not include less linked publications (gray). It does not include any of the weakly linked papers to the left of the review paper. The Leiden algorithm as representative of a global algorithm was applied on the same network using two different resolution values (the two plots on the right-hand side). It assigns every publication to a cluster. The lower resolution leads to one cluster containing all well-linked publications on the right side and many clusters containing only one or few weakly linked publications on the left side of the review. The higher resolution makes the algorithm construct many clusters on both sides of the review paper.
5.2. Exploration of Publications on the h-index
Figure 6 shows the experiments with different seeds to grow subgraphs of publications on the h-index in a WoS database. If only the seminal paper by Hirsch (2005) from which the h-index topic emerged is used as the seed, the subgraph does not grow at all. This is not surprising because the original Hirsch paper is a publication outside the field of bibliometrics from which the topic grew. In contrast, Hirsch’s publication of 2007 is better connected to bibliometrics, and using it as the seed lets the subgraph terminate at 13 publications. When only some of the seven seed publications are used, the subgraph terminates at sizes between 450 and 660 publications. Using all seven leads to a steep increase in size. Here, the maximal subgraph contains 1,018 publications7, at the thresholds DCin = 0.53, BC = 0.95, DCout = 11. This means that at each stage of growth, publications were added to the subgraph if at least 53% of their references that are also source items were publications in the subgraph, if they shared at least 95% of their references that are also source items with the subgraph’s publications, or if they were cited at least 11 times by the subgraph. Lowering any of these thresholds slightly (e.g., DCin to 0.50) leads to exponential growth of the subgraph without termination (i.e., the queries get too large to be processed). Smaller subgraphs can be obtained by increasing the thresholds. When the seed size is further increased to 15–20 publications and the same thresholds are used, very similar subgraphs emerge (with differences of fewer than 50 publications).
In the surrounding of this subgraph, we find false negatives (FN)—publications that address the h-index but are not included—and true negatives (TN). An example of an FN is the study by Bertoli-Barsotti and Lando (2015). It has only 42% of its references in the subgraph and thus did not pass the DCin threshold of 55%. Nor did it pass the thresholds for DCout or BC, as it is cited by the subgraph only three times and shares only 71% of its 54 source item references, respectively. Another example of an FN is the study by Opthof and Wilde (2009), which is cited only 10 times by the subgraph, not meeting the DCout threshold of 11. The FNs demonstrate that any density threshold is bound to create “near misses,” (i.e., that a definitive delineation of a topic is not possible that way)8. The publication by Opthof and Wilde is representative of the usage and discussion of the h-index in medical fields (whose publications constitute about 20% of the subgraph) as indicated by the area with the red cross in the subgraph in Figure 6).
Most of the publications that were not included in the subgraph were TNs, e.g. the paper by Kosmulski (2018), which has 45% of its references in the subgraph but is not predominantly an h-index publication. Another TN is the study by Abramo, D’Angelo, and Rosati (2013), which has only six of 21 source item references that are publications in the subgraph and is not cited by the subgraph but shares 81% of its source item references with publications in the subgraph. The paper deals with the h-index only marginally.
The subgraph of 1,018 publications also includes very few false positives, which were hardly linked via direct citation links but were added via BC. This includes Lehmann, Jackson, and Lautrup (2005), which shares all of its source item references with the subgraph (BC = 1.0), but has no link to the h-index topic. If, however, the nonsource items of this publication could be considered in the calculation of BC, it would likely not have been added to the subgraph.
We used this subgraph as a seed for a second run of MALBA with thresholds DCin = 0.97, BC = 0.95, DCout = 5. This led to a termination at 2,985 publications. After this increase by more than 1,900 publications, some FNs which were found after the first run are still FNs, for example Bertoli-Barsotti and Lando (2015). The large increase in publications after the second run led to the inclusion of some previous FNs (the study by Opthof and Wilde, for example) but also leads to the addition of false positives. For example, the abovementioned study by Abramo et al. (2013) is now included in the larger subgraph. This could indicate that either the subgraph is now too large or that more false positives have to be accepted to cover all publications on the h-index. Further work is needed to assess the connection between subgraph size, FNs, and TNs.
Figure 7 shows the distribution of the 1,018 publications over publication years and the number of publications with the keyword “h-index” for each year. The patterns clearly differ, which means that the growth of the subgraph is not influenced by the increasing number of publications on a topic.
5.3. Individual-Level Ground Truths as Seeds for the Exploration of BEC Publications
In a third set of experiments, we used publications that were categorized as BEC publications by their authors as seeds for growing subgraphs in a predefined publication network. Table 4 lists statistics for the seeds. The thresholds which led to termination of the growth of each subgraph indicate the density of the region of the network that was explored. The size of the terminal subgraphs varies considerably. The largest eight subgraphs range from 777 to 2,270 publications, whereas subgraphs of Researchers 4 and 8 grew only to sizes of 79 and 159, respectively.
The subgraphs overlap pervasively (Table 5). Eight subgraphs overlap in 529 publications, which represent 16% of the total 3,348 publications of all subgraphs. The two smallest subgraphs overlap only a little and are not considered here. Considering six of the subgraphs, our results show an overlap of more than 1,000 publications between them, which is almost a third of all publications. These results suggest that a substantial part of a collectively shared BEC topic is recovered by the overlap of separately grown subgraphs. Further support for this assumption is provided by our previous qualitative and quantitative research on BEC, which suggests a size of the BEC topic between 1,500 and 3,000 publications for the time from 1990 to 2005 (Held et al., 2021).
Overlap in # subgraphs . | Numbers of publications in overlap (rel) . | Cumulative numbers of publications in overlap (rel) . |
---|---|---|
10 | 0 | 0 |
9 | 0 | 0 |
8 | 529 (0.16) | 529 (0.16) |
7 | 218 (0.07) | 747 (0.22) |
6 | 307 (0.09) | 1,054 (0.31) |
5 | 188 (0.06) | 1,242 (0.37) |
4 | 274 (0.08) | 1,516 (0.45) |
3 | 306 (0.09) | 1,822 (0.54) |
2 | 943 (0.28) | 2,765 (0.83) |
No overlap | 583 (0.17) | 3,348 (1.0) |
Overlap in # subgraphs . | Numbers of publications in overlap (rel) . | Cumulative numbers of publications in overlap (rel) . |
---|---|---|
10 | 0 | 0 |
9 | 0 | 0 |
8 | 529 (0.16) | 529 (0.16) |
7 | 218 (0.07) | 747 (0.22) |
6 | 307 (0.09) | 1,054 (0.31) |
5 | 188 (0.06) | 1,242 (0.37) |
4 | 274 (0.08) | 1,516 (0.45) |
3 | 306 (0.09) | 1,822 (0.54) |
2 | 943 (0.28) | 2,765 (0.83) |
No overlap | 583 (0.17) | 3,348 (1.0) |
6. DISCUSSION
The local growth algorithm MALBA is a new method for the exploration of the surrounding of a paper set in different user-defined bibliometric data models. In particular, the dense neighborhood of a seed can be reconstructed, with a transparent and controllable tool. All the experiments we presented show promising results that warrant future analyses of MALBA’s potential.
6.1. Assessment of the Experiments
Our experiments show that in a small range of threshold combinations the algorithm terminates at limited subgraph sizes. Thresholds set below that range lead to an exponential growth of the subgraph. This behavior of MALBA indicates that it may follow the general connectedness of science without adding information. There are thresholds below which it makes no sense to assume that the terminal subgraph represents a coherent thematic structure because it covers tens of thousands of publications. Above a certain threshold range, no growth occurs at all. Thus, setting thresholds is necessary when locally grown subgraphs are to be used for the exploration of publication networks. The existence of a narrow range of thresholds that leads to smooth convergence is somewhat counterintuitive given that these thresholds can be exceeded more easily the more the subgraph grows. The only explanation for this behavior of MALBA we see is that the density of publication networks varies so strongly that MALBA can include some regions of the networks only at the price of exponential growth. The limited growth of subgraphs for such a set of thresholds indicates that a seed is surrounded over some distance by a dense area in one or both data models that can be reconstructed by a cohesion-oriented algorithm.
Furthermore, the experiments show that the regions of similar density surrounding the seeds that make up the terminal subgraph are often rather small and range from hundreds to a few thousand publications maximum. An important reason for that is that our algorithm is not suitable for reconstructing less dense regions of a network because publications in these regions do not exceed thresholds in the range that prevents exponential growth. Thus, MALBA’s focus might be more on the cores of topics, which, however, future work still has to validate. Additionally, we see two more reasons why subgraphs do not grow to sizes that would make them plausible representations of topics. Most publications belonging to a topic may not be included in the data models. This is less of a problem when MALBA is applied to an online database where no ex-ante field delineation is necessary (Boyack, 2017). Alternatively, a topic might not be represented by density fluctuations in data models at all. This, however, would render all attempts to construct topics in networks of papers useless.
Another important decision made by users concerns the size of the seed. Some of our results indicate that larger seeds lead on average to larger terminal subgraphs. The experiments on the h-index, however, revealed a range of seed size (approximately between seven and 30 publications) that hardly altered the final subgraph. This might be explained by the choice of highly cited h-index publications as seeds, which are all “very close to each other” (i.e., they belong to the same dense network region). Still larger seeds require higher thresholds for smooth convergence. As these thresholds are stricter in terms of dense communication (DC) and thematic similarity (BC), the terminal subgraph is likely to be thematically more homogenous.
Connectedness and thus thematic homogeneity are also relevant for the choice of the seed. Our experiments show that thematically less homogenous seeds—seeds with fewer internal connections—will lead either to exponential growth or to very small subgraphs. This happens because no parameter combination can be found that enables a subgraph growth into all “thematic directions.” Structurally speaking, while no single connected dense subgraph can be found, there are probably several separate subgraphs that are not densely connected to each other.
6.2. Preliminary Assessment of MALBA
MALBA’s behavior in the reconstruction of the dense surrounding of a paper set in different bibliometric data models can be exploited for the structural exploration of publication networks. This reconstruction of regions of different density starting from thematically related seeds can aid our understanding of the bibliometric representation of topics in various data models. Setting lower thresholds seems to make MALBA respond to the “background noise” of a field, which cannot be distinguished from the background noise of other fields. When MALBA reaches these less dense regions of a network, it makes the subgraph grow exponentially. If MALBA does indeed reconstruct topics (which is our hope but still requires further validation), its ability to adjust the thresholds of two data models would make it possible to gauge topics’ different representations in the data models. Topics might be differently represented in the DC model or the BC model. This kind of analysis might be extended because other data models can be included in MALBA as well. Adjusting the focus on a particular data model is not only useful to learn about the seed’s neighborhood but may increase the size of the terminal subgraph.
Contrasting our algorithm with other algorithms also reveals limitations. First, while common topic reconstruction algorithms can just be run on giant bibliometric networks without any domain knowledge, for MALBA to reconstruct meaningful subgraphs a thematically informed choice of seeds is necessary. This requires somewhat more effort in the application. Second, MALBA cannot easily explore all regions of a network, because less dense regions are ignored by our algorithm and the terminal subgraphs are rather small, and to cover an entire network needs very frequent runs of the algorithm, making it much less efficient than other common algorithms.
6.3. Are All Publications Equal?
The experiments have illustrated MALBA’s behavior of excluding some publications in the surrounding of the subgraph and in the network in general. This stands in stark contrast to the dominant topic reconstruction exercises performed with global algorithms (Velden et al., 2017), where all publications are included and cluster sizes are usually in the range of several thousand or tens of thousands of publications. It could be argued that both approaches have their merit. On the one hand, the inclusion of all publications in a network in the final mapping result might be justified as a useful abstraction from all evaluative considerations. On the other hand, it has already quite early been cautioned against treating every publication as though it would contribute equally to knowledge generation (i.e., performing undue abstractions from the research content (Whitley, 1970, pp. 61–64)). While this warning is rather general and less addressed to algorithms, it points to a tension in contemporary times between evaluative and structural bibliometrics. Evaluative bibliometrics has demonstrated over and over again that not all publications are equal in their scientific value. But structural bibliometrics includes them all and often gives them “equal votes” when reconstructing topics. From the perspective of the sociology of science and the process of knowledge generation, there is no justification for these “equal votes” and algorithms that can differentiate between publications having more and less relevance for a topic should be more trustworthy.
7. CONCLUSIONS
MALBA iteratively builds subgraphs in networks of publications for the purpose of structural exploration of publication networks. It is the first local algorithm of this kind that incorporates simple structural evaluations in the direct citation and bibliographic coupling neighborhood of a subgraph, with all important decisions remaining transparently in the hands of the user. Our experiments with MALBA have revealed that the dense neighborhood of a publication set can be reconstructed, which is very much in line with Kuhn’s idea of dense communication and thematic similarity characterizing scientific communities. With the presented approach users are able to easily conduct a structural investigation of the citation neighborhood of a publication set, considering different types of links, which can aid our understanding of the bibliometric representation of topics in various data models. Another practical value of MALBA is that it can also be used to search for publications related to a publications seed of interest without knowing much about a topic in advance. Currently, the algorithm only interacts with the Web of Science database. Negotiations with open data providers are ongoing.
Important future work until we can apply MALBA as a topic reconstruction algorithm with certainty includes several empirical questions:
To what degree do the reconstructed subgraphs of this algorithm correspond to topics?
(Why) are relevant publications left out?
How can the algorithm be adapted to prevent thematically implausible exponential growth?
ACKNOWLEDGMENTS
We would like to thank Bastian Steudel for huge support in the software development.
AUTHOR CONTRIBUTIONS
Matthias Held: Conceptualization, Formal analysis, Methodology, Software, Writing—original draft, Writing—review & editing. Jochen Gläser: Conceptualization, Formal analysis, Methodology, Writing—original draft, Writing—review & editing.
COMPETING INTERESTS
The authors have no competing interests.
FUNDING INFORMATION
The work of MH was supported by the German Ministry of Education and Research (Grant 16PU17003). We acknowledge support by the Open Access Publication Fund of TU Berlin.
DATA AVAILABILITY
Some of the data analyzed in this manuscript is subject to copyright (by Clarivate Analytics). Thus, not all the data can be made available.
Notes
Note here a far-reaching semantic difference between the abovementioned scientific community and a network community.
The global or local character of an algorithm’s approach is a matter of degree, which depends on the amount of local respectively global information that is used for identifying communities (Held, 2022). We use these categories to highlight relevant general differences.
Another aspect is that only a few publications of the algorithms reviewed by Baltsou et al. (2022) provide the source code, which creates a hindrance for most algorithms to be easily applied in bibliometrics.
We limit the calculation of BC links to publications with at least eight references.
The source code is available on Github: https://github.com/blnote/malba-public.
To avoid an explosion in the number of candidate publications we only include the “parents” (citing publications) of candidates that are cited at most x times (e.g., 3000). Note that parents excluded this way may still end up in the candidate set if they cite another less cited candidate.
Find the metadata for this subgraph at https://doi.org/10.5281/zenodo.11392856.
This suggests that the definitive boundaries created by algorithms that partition networks are an illusion, too.
REFERENCES
Author notes
Handling Editor: Li Tang