Hyperauthorship, a phenomenon whereby there are a disproportionately large number of authors on a single paper, is increasingly common in several scientific disciplines, but with unknown consequences for network metrics used to study scientific collaboration. The validity of coauthorship as a proxy for scientific collaboration is affected by this. Using bibliometric data from publications in the field of genomics, we examine the impact of hyperauthorship on metrics of scientific collaboration, and propose a method to determine a suitable cutoff threshold for hyperauthored papers and compare coauthorship networks with and without hyperauthored works. Our analysis reveals that including hyperauthored papers dramatically impacts the structural positioning of central authors and the topological characteristics of the network, while producing small influences on whole-network cohesion measures. We present two solutions to minimize the impact of hyperauthorship: using a mathematically grounded and reproducible calculation of threshold cutoff to exclude hyperauthored papers or fractional counting to weight network results. Our findings affirm the structural influences of hyperauthored papers and suggest that scholars should be mindful when using coauthorship networks to study scientific collaboration.

Scientific collaboration is vital to solving complex scientific problems that require integration of knowledge across disciplines. Bibliometric studies show that discipline-spanning collaborations play an important role in spurring scientific innovation and producing impactful papers (Collins & Evans, 2015; Thelwall & Maflahi, 2022; Uzzi, Mukherjee et al., 2013). To that end, increasing efforts have been made to support scientific research teams as well as to better understand the relationship between diversity in scientific collaborations and research outcomes. Many studies in this space use paper coauthorship as the primary indicator upon which to assess the diversity of a collaboration and that collaboration’s effects on outcomes of scholarly interest.

The average size of authorship teams has increased over time (Ioannidis, 2008; Wuchty, Jones, & Uzzi, 2007), especially in fields such as high-energy physics (Birnholtz, 2006; Milojević, 2010), genomics (Dinh & Cheng, 2018), and medicine (Franceschet & Costantini, 2010), where hyperauthorship is relatively common. The rapid growth in average team size may impact measures of scientific collaboration outcomes, which traditionally have been examined using a mix of bibliometric (Porter & Rafols, 2009; Rafols, Leydesdorff et al., 2012; Schummer, 2004) and network analysis methods (Akbaritabar, 2021; Barley, Dinh et al., 2022; Cummings & Cross, 2003; Fegley & Torvik, 2013). In fact, Sinatra, Deville et al. (2015) found that the number of citations per paper and number of papers per author in the field of interdisciplinary physics have been inflated over the past 15 years, and that the number of authors per paper increased at a similar rate to the number of papers produced in this field. Thus, these “citation” measures are not a proxy for scientific collaboration (Strumia & Torre, 2019). In some leading medical journals, for example, hyperauthored works can include long lists of authors that represent an honorary role in the research process despite not having contributed substantively to the work (Kennedy, Barnsteiner, & Daly, 2014; Wislar, Flanagin et al., 2011). Furthermore, the validity of coauthorship as a primary indicator of research collaborations is another subject of inquiry, as evidence suggests that not all collaborations result in coauthored papers (Lundberg & Brommels, 2006; Smith & Katz, n.d.; Tijssen, 2004) and that not all coauthorships signify collaboration in terms of contribution to writing (Cronin, 2001; Dinh & Cheng, 2018). It is important to distinguish that scientific collaboration is a process of working together, whereas coauthorship is an indicator of scientific contribution with certain norms and guidelines (Cronin, 2001). Thus, they refer to different aspects of scientific research, and while coauthorship may suggest collaboration, there may be other factors beyond direct collaboration that can explain coauthorship (Birnholtz, 2006).

Given the complex relationship between coauthorship and scientific collaboration, especially in the presence of hyperauthorship, this study examines how hyperauthored papers impact the coauthorship network metrics that scholars use to study scientific collaborations. The inclusion of even a few hyperauthored papers within a bibliometrically constructed network may substantially inflate the average number of collaborators per author. Consequently, including hyperauthored works may inflate author-level network measures frequently used to assess an author’s influence in scientific collaboration. We test this hypothesis by examining a database of papers from the interdisciplinary field of genomics, specifically focusing on papers authored by 413 researchers affiliated with a large biological research institute, where hyperauthorship is common. Using these data, we (1) propose a method to determine a suitable cutoff threshold for hyperauthored papers using the cumulative frequency distribution of number of authors per paper; (2) compare the changes (if any) in network metrics of coauthorship networks with and without hyperauthored papers; and (3) present two solutions to minimize the impact of hyperauthorship by using a threshold cutoff to exclude hyperauthored papers or using fractional counting (i.e., Newman and Jaccard weighting functions) to weight network results.

Our analysis reveals that including hyperauthored papers dramatically impacts the structural positioning of central authors and the topological characteristics of the network, while producing comparatively small influences on whole-network cohesion measures. These findings suggest that scholars should be mindful when using bibliometric networks to study scientific collaboration, especially if the object of analysis focuses on egocentric dependent variables. We argue that researchers should consider whether including hyperauthored works is necessary to address their research questions, and consider omitting them from analysis when unnecessary. Further, when including hyperauthored work, we find that a fractional counting approach overall can mitigate the impact of hyperauthorship compared to full counting, with the most optimal solution being fractional counting based on the number of shared coauthors across all papers. Our findings affirm researchers’ concerns about the structural influences of hyperauthored papers and indicate that scholars must directly consider how hyperauthored works will affect their results when studying scientific collaboration using coauthorship networks.

2.1. Network Metrics as Indicators of Scientific Collaboration

Scholars in the science of science have used network measures to analyze structural patterns of scientific collaboration (Bordons, Aparicio et al., 2015; Leydesdorff, 2007) as well as to identify factors that impact collaboration across disciplinary (Morillo, Bordons, & Gómez, 2003; Porter, Cohen et al., 2007) and geographical (Bordons & Gomez, 2000; Naik, Sugimoto et al., 2023) boundaries. The benefits of collaboration in the production of scientific knowledge are well defined in the literature, including that more diverse research teams can benefit from increased creativity and innovation (Burt, 2004; Nemeth & Nemeth-Brown, 2003). Leydesdorff (2007) found that betweenness centrality is a reliable indicator of interdisciplinarity in journal–journal citation networks; the higher the betweenness, the more diverse the disciplines that cite a journal. Bordons et al.’s (2015) network analysis of coauthors in three fields (Nanoscience, Pharmacology, and Statistics) showed that authors with the most number of “strong tie” coauthors (i.e., those with repeated collaborations) tend to have the highest research productivity. Costanza and Kubiszewski (2012) examined the coauthorship network of 172 authors who published the most number of papers in an interdisciplinary field (Ecosystem Services) and found that the number of coauthors had a positive linear relationship with the number of citations an article received, which also had a positive correlation with the average h index of each article. These studies exemplify that network analysis is a preferred method of analysis in which coauthorship and citation patterns often are used as proxies for scientific collaboration.

2.2. Hyperauthorship and Scientific Collaboration

Bibliometric studies have found a continuous and consistent growth in coauthorship that spans all scientific disciplines (Dehdarirad & Nasini, 2017; Milojević, 2010; Valderas, 2007). While there are notable benefits of scientific collaboration, there are also drawbacks to consider, particularly in terms of fair allocation of credit when coauthorship is given for reasons other than scientific collaboration (Birnholtz, 2006; Cronin, 2001). Especially with the rising prevalence of publications with large numbers of coauthors, known as hyperauthorship, norms and requirements for authorship in a collaborative work are also impacted (Cronin, 2001). Scholars in bibliometrics and network science have found that hyperauthorship affects traditional indicators of scholarship productivity, such as the h-index (Koltun & Hafner, 2021), degree centrality (Fegley & Torvik, 2013), and author degree distributions (Milojević, 2010). Koltun and Hafner’s (2021) analysis of over two million publications on Google Scholar and the citations between them revealed that authors with 100 coauthors or more over the course of their careers have disproportionately high h-indices. However, the h-indices were found to be uncorrelated with other productivity indicators, such as the number of scientific awards received. Fegley and Torvik (2013) found that hyperauthorship influenced coauthorship network structure, where groups of authors were completely connected within their own clusters (i.e., common multiauthored paper) and thus had higher degree centrality than expected. Milojević (2010) compared the probability distributions of new collaboration based on prior coauthorship with and without hyperauthorship and showed that while both distributions are power-law, the distribution with hyperauthorship includes anomalous noise. The author also found that for authors with less than 20 coauthors over the course of their careers, the degree distribution was a log-normal “hook” instead of a power law. This finding illustrates that the number of coauthors has an effect on the topology of the collaboration network. Altogether, these studies show that hyperauthorship may impact a scholar’s interpretation of a particular author’s (or a group of authors’) collaboration activity and their connectedness within a network.

2.3. Mitigating the Impact of Hyperauthorship

We have observed that in many studies using bibliometric data, hyperauthored papers are not explicitly acknowledged or addressed, despite the known impacts of hyperauthorship on coauthorship networks, where papers with a high number of authors often have inflated weights due to the presence of numerous large complete subgraphs (Batagelj, 2020; Batagelj & Cerinšek, 2013). In some cases, keeping hyperauthored papers may be useful or necessary, such as in studies examining author name disambiguation (Farber & Ao, 2022; Kim, 2019), researcher productivity (Costas, van Leeuwen, & Bordons, 2010; Thelwall & Maflahi, 2022), or growth in coauthorship size over time (Borner, Dall’Asta et al., 2005; Thelwall & Maflahi, 2022). Other studies have kept hyperauthored papers in order to evaluate network normalization techniques to minimize the impact of hyperauthorship, such as fractional counting (Batagelj, 2020; Perianes-Rodriguez, Waltman, & van Eck, 2016), or to examine core-periphery structures of scientific collaborations networks (Batagelj & Zaveršnik, 2011; Fetscherin & Heinrich, 2015; Uzzi, Amaral, & Reed-Tsochas, 2007). Among those works that do explicitly seek to mitigate the impact of hyperauthorship the most common practice has been to exclude papers that have a certain number of coauthors (Cronin, 2001; Fegley & Torvik, 2013; Milojević, 2010; Morris & Goldstein, 2007). However, practices for choosing this threshold have been inconsistent. Cronin (2001) was the first to define hyperauthorship as any paper with more than 100 authors. Similarly, Milojević (2010) set a threshold of more than 200 authors for a hyperauthored paper. Both Fegley and Torvik (2013) and Morris and Goldstein (2007) operationalized a hyperauthored paper to have at least 20 authors. Kim (2019) identified and removed papers in the top 1% for the highest number of authors (≥9 authors in the DBLP data set, ≥16 authors in the MAG data set, and ≥13 authors in the MEDLINE data set). Ahmed, Cambo et al. (2013) removed 3% of papers that were identified as hyperauthored, but did not specify the threshold. While these empirical solutions are important first steps, variability in how hyperauthored papers are handled creates challenges for comparability across studies. Therefore, a standardized and reproducible method for identifying and excluding hyperauthorship would be beneficial to the field of bibliometrics. To address this need, we demonstrate how a reproducible pipeline for preprocessing hyperauthorship data can help mitigate any potential effects of hyperauthorship on network measures of interest, while also offering researchers a standard to consider when seeking to exclude hyperauthored works from their data sets.

3.1. Data

The data set used for this study consisted of bibliometric records for 413 researchers within a large biological research institute. Among the 413 researchers, 208 have coauthored one or more papers together during their careers at the research institute. Each researcher’s publication data throughout their academic career (up until 2021) were collected using a Scopus database, including metadata such as full title, publication type, journal/conference proceeding name, publisher, DOI, author names, organizational unit of author(s), citation counts (based on Scopus), open access status, and keywords. We added a unique ID to each publication for quick retrieval and matching purposes. The publication data from this set of seed authors allow us to examine the immediate impact of hyperauthorship on each researcher’s egocentric network of coauthors, particularly in terms of collaboration frequency and strength with their coauthors. Additionally, this data set allows us to compare researchers’ egocentric networks and determine whether the presence of hyperauthored papers may benefit some researchers while disadvantaging others.

The original format of the data set was a two-mode network, as two types of nodes, papers and authors, are connected. An edge between a paper node and an author node indicates that a paper is authored by a particular author. As our goal was to analyze coauthorship network patterns, we transformed the two-mode network into an author-author one-mode network via a weighted bipartite projection method by Breiger (1974) and later by Borgatti and Halgin (2011). The resulting weighted projected graph contains edges between two author nodes that are previously connected to the same paper node in the original bipartite graph. In other words, the projected graph contains coauthorship edges, along with weights reflecting the number of papers that two authors have coauthored together.

The resulting publication data set contained 19,100 unique papers produced by 35,658 unique authors. Figure 1 shows the distribution of the number of coauthors for the papers in the resulting data set. This serves as evidence that hyperauthorship is a relatively recent but increasingly common trend in publishing, marked by a notable increase in the number of hyperauthored papers observed after the 2000s. Considering the trajectory of hyperauthored papers in our data set, it is imperative to examine the impacts that these papers would have on our understanding of scientific collaboration patterns over time.

Figure 1.

Scatterplot depicting number of coauthors per paper for all articles in the data set, including hyperauthored papers.

Figure 1.

Scatterplot depicting number of coauthors per paper for all articles in the data set, including hyperauthored papers.

Close modal

3.1.1. Threshold for hyperauthorship

We establish a threshold to determine when a paper is categorized as hyperauthored based on the distribution of the number of authors per paper using a cumulative frequency distribution approach. Our goal is to show a generalized and reproducible method for identifying and removing hyperauthored papers from any skewed distribution of publications. Our process involves several steps to determine the cutoff point for hyperauthorship, where outliers with a large number of authors could be excluded from analysis. First, we check whether the distribution of the number of authors per paper is normally distributed. If the distribution is normal, we use the empirical rule (i.e., the 68–95–99.7 rule) to determine outliers by identifying the threshold at which 95% of the data are captured (i.e., two standard deviations from the mean). If the distribution is not normally distributed, we apply Chebyshev’s inequality (i.e., the 75–88.9 rule) to determine a threshold at which 88.9% of the data are captured (i.e., three standard deviations from the mean). We then compare our approach to the cumulative frequency distribution method as another point of comparison. The cumulative frequency approach involves ranking observations in order of magnitude and calculating cumulative frequencies based on the ranking. We opt to use two methods for determining the cutoff point so that we can cross-validate the findings and establish the reliability of our approach. While these two approaches are highly dependent on data and distribution, we anticipate that the method can be applied to any bibliometric data set, given the expected nonnormal distribution as per Lotka’s law.

3.2. Network Weighting Functions

Another potential solution to addressing hyperauthored works is to apply weighting functions to potentially reduce these products’ structural influence on collaboration networks. Fair allocation of authorship credit to authors engaged in multiauthored papers has been a topic of considerable interest for bibliometrics researchers (Abramo, D’Angelo, & Rosati, 2013; Perianes-Rodriguez et al., 2016; Sivertsen, Rousseau, & Zhang, 2019). This problem is especially relevant to researchers who use a combination of bibliometric and network approaches, as choices about credit allocation have a direct impact on how the network is constructed and weighted (Gauffriau, 2017; Perianes-Rodriguez et al., 2016). Gauffriau (2021) in their comprehensive literature review find that full counting and fractional counting are the two primary methods used for credit allocation. The full counting method assigns a weight of one to each author of a paper, whereas the fractional counting method distributes a single weight among all the coauthors of a paper. In this study, we will implement both counting methods, as summarized in Table 1, and with formulations stated below.

Table 1.

Summary of network weighting functions and measures

Function/MeasureDescriptionUsage
Full counting Assigns a weight of one to each author Constructing whole and egocentric networks 
Fractional counting (Newman’s method) Distributes a single weight among all coauthors of a paper based on the number of coauthors Constructing whole and egocentric networks 
Fractional counting (Jaccard’s method) Computes the Jaccard index to measure the neighborhood overlap between two authors Constructing whole and egocentric networks 
Function/MeasureDescriptionUsage
Full counting Assigns a weight of one to each author Constructing whole and egocentric networks 
Fractional counting (Newman’s method) Distributes a single weight among all coauthors of a paper based on the number of coauthors Constructing whole and egocentric networks 
Fractional counting (Jaccard’s method) Computes the Jaccard index to measure the neighborhood overlap between two authors Constructing whole and egocentric networks 

3.2.1. Full counting

(1)
Here aip indicates whether i is an author in paper p, where 1 indicates authorship and 0 indicates no authorship. Similarly, ajp is 1 if j is an author in paper p. Thus, the resulting wij is 1 if i and j are both authors in p. This counting method attributes a full weight of 1 to each coauthorship instance and aggregates based on the number of papers on which i and j are both coauthors.

3.2.2. Fractional counting

There are several approaches to fractional counting (Batagelj, 2020; Gauffriau, 2021), which are essentially means of normalizing the coauthorship weights based on the number of coauthors in a paper. Batagelj (2020) applied three weighting algorithms to normalize coauthorship and cocitation networks, effectively mitigating the impact of hyperauthorship on the network structure caused by overrepresentation of edge weights. Based on prior literature, we utilize two main weighting functions, namely Newman’s and Jaccard’s methods. Newman’s method and variants of the method have been used in prior studies such as Griffin, Arth et al. (2021) and Perianes-Rodriguez et al. (2016). Jaccard’s method has been used in Brandão and Moro (2017) and Pan, Sinha et al. (2012) as a measure of neighborhood overlap between two authors.

Weight wij based on Newman (2001):
(2)
where Pij is the set of all papers coauthored by i and j. Np(Np − 1) calculates the number of possible pairs of neighbors for paper p, which is used to normalize the weight based on Newman’s approach. 1NpNp1 calculates the influence of every coauthor in paper p to the weight of the edge between i and j (Newman, 2001, Formula 2). This is essentially weighting the influence of each paper inversely proportional to the number of potential coauthor pairs it contains. Hence, papers with fewer coauthors will have a greater influence on the weight of the edge between two authors.
Weight wij based on Jaccard index Borgatti and Halgin (2011):
(3)
where |Ni| is the number of coauthors that i has and |Nj| is the number of coauthors that j has. Thus, |N(i) ∩ N(j)| is the number of shared coauthors between i and j; |N(i) ∪ N(j)| is the number of all coauthors that both i and j have. Note that this weighting approach does not yet consider the weights associated with the edges between common neighbors, but rather focuses on the presence or absence of common neighbors between the nodes. In future work, we will incorporate the weights of common neighbors to take into account the strength of coauthorship between the neighbors as well as in relation to the focal node. However, this extension may introduce additional computational complexity and require further refinement of the algorithm.

3.3. Network Measures

Here, we define the network measures that are computed for this study. We use existing algorithms available in NetworkX, a Python library for network analysis, and modify a subset of measures based on our operationalization. We conduct network analysis at both the whole-network and egocentric network levels, computing the same set of metrics (discussed below) to both levels, as shown in Table 2.

Table 2.

Summary of whole and egocentric network measures

MeasureDescriptionUsage
Density Proportion of edges in the network relative to total possible edges Whole-network cohesion measure 
Average clustering Measures local neighborhood formation in the network Whole-network cohesion measure 
Average path length Measures average shortest path distance between every pair of nodes Whole-network cohesion measure 
Giant component Identifies the largest connected subgraph Whole-network cohesion measure 
Clauset-Newman-Moore Agglomerative clustering that initialize each node as separate community, then uses a greedy approach to iteratively merge pairs of communities that enhance modularity Community detection algorithm 
Clique percolation Divisive clustering that detects overlapping cliques, then forms communities by grouping these cliques Community detection algorithm 
Louvain modularity Agglomerative clustering that rearranges nodes within communities, then reaggregates them into separate communities to enhance modularity Community detection algorithm 
Omega coefficient Indicates the small-world property of the network based on path length and clustering Topological measure 
Alpha exponent Measures the network’s degree distribution to check for a power-law fit Topological measure 
Degree centrality Measures the number of edges each node has to other nodes Whole-network and egocentric measure 
Betweenness centrality Measures the number of shortest paths that pass through each node Whole-network and egocentric measure 
Closeness centrality Measures average reachability of one node to other nodes Whole-network and egocentric measure 
Eigenvector centrality Measures the extent to which a node is an immediate neighbor of well-connected nodes Whole-network and egocentric measure 
MeasureDescriptionUsage
Density Proportion of edges in the network relative to total possible edges Whole-network cohesion measure 
Average clustering Measures local neighborhood formation in the network Whole-network cohesion measure 
Average path length Measures average shortest path distance between every pair of nodes Whole-network cohesion measure 
Giant component Identifies the largest connected subgraph Whole-network cohesion measure 
Clauset-Newman-Moore Agglomerative clustering that initialize each node as separate community, then uses a greedy approach to iteratively merge pairs of communities that enhance modularity Community detection algorithm 
Clique percolation Divisive clustering that detects overlapping cliques, then forms communities by grouping these cliques Community detection algorithm 
Louvain modularity Agglomerative clustering that rearranges nodes within communities, then reaggregates them into separate communities to enhance modularity Community detection algorithm 
Omega coefficient Indicates the small-world property of the network based on path length and clustering Topological measure 
Alpha exponent Measures the network’s degree distribution to check for a power-law fit Topological measure 
Degree centrality Measures the number of edges each node has to other nodes Whole-network and egocentric measure 
Betweenness centrality Measures the number of shortest paths that pass through each node Whole-network and egocentric measure 
Closeness centrality Measures average reachability of one node to other nodes Whole-network and egocentric measure 
Eigenvector centrality Measures the extent to which a node is an immediate neighbor of well-connected nodes Whole-network and egocentric measure 

3.3.1. Whole-network cohesion measures

Density measures the proportion of edges that exist in a network relative to the total number of possible edges. We use the formula (2m)/(n(n − 1)) to calculate density, where n is the number of nodes, and m is the number of edges.

Average clustering measures the extent to which nodes in a network tend to form local neighborhoods. We calculate this by dividing the fraction of triangles in the network by the possible number of triangles that could exist with a given network size.

Average path length measures the average shortest path distance between every pair of nodes. We use Dijkstra’s algorithm with Newman’s (2001) modification for weighted networks, where each node is iteratively selected as a source node and a shortest path to every other node is calculated. This modification entails inverting the edge weights to reflect the strength of the collaboration tie, whereby higher weights means lower “distance/cost” to traverse through.

Giant component is the largest connected subgraph in a network, where all nodes are reachable to each other. This algorithm iterates through each source node and conducts a breadth-first search to ensure there is no disconnected path between any two nodes. We first extracted all the connected components that exist in the network, then determine the largest component and assign that as the giant component.

3.3.2. Subgroup measures

The Clauset-Newman-Moore greedy modularity maximization algorithm initiates by treating each node as its own community and then employs a greedy approach to merge communities that maximize the network’s modularity, seeking to maximize the positive contribution to modularity by pairing communities.

The clique percolation method functions as a divisive clustering approach by identifying cohesive subgraphs or cliques of nodes, which are groups where each node is connected to every other node. These cliques are gradually merged to create a hierarchy of community structures.

The Louvain modularity algorithm is an agglomerative clustering method that starts with each node in its own community and optimizes modularity by iteratively merging nodes into communities, locally optimizing modularity by considering nodes’ connections within their neighborhoods, effectively identifying community structures within a network.

3.3.3. Centrality measures

Degree centrality measures the number of edges that each node has to other nodes in the network. We normalize each node’s degree centrality by dividing it by the maximum degree of the network (n − 1).

Betweenness centrality measures the number of shortest paths that pass through each node. This measure indicates the extent to which a node can bridge connection(s) to other nodes in the network. Given the size of the network and the computational complexity, we approximate betweenness centrality based on a random sample of 1,000 nodes. This measure is normalized by 1/((n − 1)(n − 2)), where n is the number of nodes in a directed network.

Closeness centrality measures the average reachability of one node to other nodes in the network. We calculate this based on the reciprocal of the average path length between a source node and all other n − 1 nodes.

Eigenvector centrality measures the extent to which a node is an immediate neighbor of well-connected nodes. Eigenvector centrality is calculated by Ax = λx, where A is an adjacency matrix of the network with an eigenvalue of λ. The algorithm iterates over each node and is complete when λ is the highest in A.

3.3.4. Topological measures

The omega coefficient indicates the extent to which a network exhibits a small-world property. The formula is ω = Lr/LC/Cl, where C is the clustering coefficient of the network, L is the average path length of the network, Lr is the average path length of the simulated random network, and Cl is the clustering coefficient of the simulated lattice network. The ω coefficient ranges from −∞ to +∞, where ω close to zero reflects a small-world topology. A random network is indicated by a positive ω. A lattice network is indicated by a negative ω. We compute ω on the giant component, with five rewiring iterations per edge, and five random graphs generated to calculate the simulated statistics.

The alpha exponent indicates the extent to which the network’s degree distribution exhibits a power-law fit. The algorithm is implemented via the powerlaw package in Python, where the optimal α exponent is computed for the network. α ranges from 1 to ∞, and α between 2 and 3 indicates that the network degree distribution is a power-law fit (Newman, 2005).

3.4. Comparison Between Networks Without and With Hyperauthorship

We calculate the percentage change in network measures without and with the inclusion of hyperauthored papers. The formula used to calculate the percentage change between two values (Value 1 and Value 2) is:
(4)

We first present the hyperauthorship cutoff results based on our authorship threshold approach. The number of authors per paper ranges from 1 to 156 in our data set (mean = 5.46, median = 4, SD = 6.37; Figure 1). Given that the mean number of authors is more than the median number of authors, we expect a positively skewed distribution. As shown in Figure 3, we have a skewed probability distribution and thus we opted to use Chebyshev’s inequality function to estimate a suitable outlier threshold. Based on Chebyshev’s distribution at k = 3, where approximately 89% of the data will be within three standard deviations, we find the upper bound at 25.85. This means that a threshold of ≈ 26 authors and above would be considered outliers in this data set. We further evaluate the reliability of this threshold by using a cumulative percentage approach, as shown in Figure 2. We find that 90% of papers are included within a threshold of 25 authors per paper. Thus, this method suggests a cutoff threshold of excluding all papers with more than 25 coauthors from analyses. Using this cutoff, we removed 203 papers that have more than 25 authors. These papers have a range of 26 to 156 authors per paper (mean = 50.88, median = 43, SD = 26.25). After removing these works, the resulting distribution of the updated data set changes to a range of 1 to 25 authors per paper (mean = 4.97, median = 4, SD = 3.36).

Figure 2.

Histogram depicting number of coauthors per paper for all articles included in this analysis. The red dotted line indicates the cutoff threshold in authorship at 90% cumulative percentage, indicating hyperauthorship.

Figure 2.

Histogram depicting number of coauthors per paper for all articles included in this analysis. The red dotted line indicates the cutoff threshold in authorship at 90% cumulative percentage, indicating hyperauthorship.

Close modal
Figure 3.

Histogram depicting the distribution of the number of authors per paper in this analysis, as estimated by Chebyshev’s inequality (blue line) and cumulative normal distribution (orange line).

Figure 3.

Histogram depicting the distribution of the number of authors per paper in this analysis, as estimated by Chebyshev’s inequality (blue line) and cumulative normal distribution (orange line).

Close modal

The structural and topological characteristics of the networks are impacted to varying degrees as a result of excluding versus including hyperauthored papers (Table 3). The coauthorship network without hyperauthored papers was projected based on 18,897 unique papers. The network including hyperauthored papers was projected based on 19,100 papers, which contains 203 more papers than the first network. Although including hyperauthored papers results in a minimal percentage change in number of papers between the two networks (1.074%), the resulting changes to network size are notable. There is a 17% (n = 5,191) increase in the number of authors when hyperauthored papers are included, which resulted in a notable increase of 121% (n = 242,311) in coauthorship ties. The density of the network also increases by 75% given the rise in number of edges; however, there is no change in average clustering across the two networks. As the networks are sparsely connected, making density an unreliable measure of cohesion, we reported average degree centrality and found that the increase was also notable at 89% when hyperauthored papers are included. Although the number of edges increases when hyperauthored papers are included, the number of closed triangles between nodes remains the same. This indicates that more edges do not necessarily lead to a higher level of triadic closure among the authors. There is a slight increase in the average shortest path length (+11%) and a decrease in the number of components (from 15 to 14) in the network with hyperauthored papers. Interestingly, the size of the largest (giant) component increases with similar magnitude (+121%) with the increase in number of edges. In particular, the giant component in the network without hyperauthorship excludes 831 edges, and the giant component in the network with hyperauthorship excludes 729 edges, thus suggesting that the network with hyperauthored papers has slightly fewer pendant edges that are not connected to the rest of the network.

Table 3.

Network descriptives for coauthorship networks without and networks with hyperauthorship

Network measuresWithout hyperauthorsWith hyperauthors% change
# of unique papers 18,897 19,100 +1.07 
# of nodes (authors) 30,467 35,658 +17.04 
# of edges (coauthorship) 199,581 441,892 +121.41 
Density 0.0004 0.0007 +75 
Average clustering 0.854 0.854 
Average path length (of subgraph) 4.27 4.95 +15.93 
Size of giant component 198,750 441,163 +121.97 
# of Components 15 14 −6.67 
Clauset-Newman-Moore 292 307 +5.14 
Clique percolation 945 910 −3.70 
Louvain modularity 74 68 −8.11 
Small-worldliness (ω−0.295 −0.355 −20.339 
Power-law exponent (α2.919 4.867 +66.735 
Average degree centrality (unweighted) 13.101 24.785 +89.184 
Average eigenvector centrality (unweighted) 0.003 0.0007 −76.667 
Average betweenness centrality (unweighted) 0.0001 0.000 −100 
Average closeness centrality (unweighted) 0.231 0.240 +3.896 
Network measuresWithout hyperauthorsWith hyperauthors% change
# of unique papers 18,897 19,100 +1.07 
# of nodes (authors) 30,467 35,658 +17.04 
# of edges (coauthorship) 199,581 441,892 +121.41 
Density 0.0004 0.0007 +75 
Average clustering 0.854 0.854 
Average path length (of subgraph) 4.27 4.95 +15.93 
Size of giant component 198,750 441,163 +121.97 
# of Components 15 14 −6.67 
Clauset-Newman-Moore 292 307 +5.14 
Clique percolation 945 910 −3.70 
Louvain modularity 74 68 −8.11 
Small-worldliness (ω−0.295 −0.355 −20.339 
Power-law exponent (α2.919 4.867 +66.735 
Average degree centrality (unweighted) 13.101 24.785 +89.184 
Average eigenvector centrality (unweighted) 0.003 0.0007 −76.667 
Average betweenness centrality (unweighted) 0.0001 0.000 −100 
Average closeness centrality (unweighted) 0.231 0.240 +3.896 

In terms of topology, both networks without and with hyperauthored papers exhibit a lattice-like structure as opposed to a small-world structure (negative ω values). The network without hyperauthored papers exhibits a closer fit to a power-law topology (i.e., hub-and-spokes structure, consistent with Newman (2005)’s finding) than a network with hyperauthored papers. This result also highlights the impact that hyperauthorship has on the degree distribution that changes the topology of the network.

Figures 4 and 5 show the distribution (log-normal) of the number of coauthors of an author’s egonetwork and a paper’s egonetwork, respectively. The inclusion of hyperauthored papers notably impacts the right-tail of the distribution where a number of authors had a large number of coauthors. As the result, the slope of the right-tail in the (b) network is less steep compared to the (a) network. The impact is also visible in the paper egonetwork distribution, with more oscillations along the right-tail. Altogether, this shows that hyperauthorship is the best descriptor of coauthorship network degree distribution due to the high variability of coauthorship counts when hyperauthored papers are included.

Figure 4.

Log-normal distribution of the number of coauthors of a given author: (a) without hyperauthors; (b) with hyperauthors.

Figure 4.

Log-normal distribution of the number of coauthors of a given author: (a) without hyperauthors; (b) with hyperauthors.

Close modal
Figure 5.

Log-normal distribution of the number of authors of a given paper: (a) without hyperauthors; (b) with hyperauthors.

Figure 5.

Log-normal distribution of the number of authors of a given paper: (a) without hyperauthors; (b) with hyperauthors.

Close modal

We observe notable differences in average centrality measures when hyperauthored papers are included, as shown in Table 3. Both average degree and closeness centrality increased, by 89% and 3% respectively. Average eigenvector centrality and betweenness centrality decreased significantly, by 76% and 100% respectively. It’s important to note that while the change in average centrality values seems small, the magnitude of the change is notable given that the values are averages over a large number of observations.

We further examined how centrality measures are impacted by the presence of hyperauthored papers when different weighting functions are used in the calculation. Table 4 shows the average centrality measures based on full counting (i.e., “weighted”) and two partial counting methods, Newman’s and Jaccard’s functions. We also include the unweighted measures to compare with the weighted counterparts. The percentage change is reported to show the difference in measures when hyperauthored papers are included, and the optimal weighting function is one that can minimize this percentage difference. We find that for degree centrality, Newman weighting is most effective in minimizing the difference in measures across the two networks. For betweenness centrality, full counting is preferred, as there is no difference in betweenness centrality across the two networks when this weighting function is used. For closeness centrality, Jaccard weighting along with the unweighted measure are preferred, with the least difference in closeness centrality when hyperauthored papers are included. For eigenvector centrality, full counting method is preferred as there is no change in centrality in the presence of hyperauthorship.

Table 4.

Average centrality measures with various weighting functions for networks without and networks with hyperauthorship

Network measuresWithout hyperauthorsWith hyperauthors% change
Average degree centrality (unweighted) 13.10 24.78 +89.18 
Average degree centrality (weighted) 19.23 34.72 +80.56 
Average degree centrality (Newman weighted) 3.05 2.90 −5.08 
Average degree centrality (Jaccard weighted) 4.61 11.67 +152.97 
Average betweenness centrality (unweighted) 0.0001 0.000 −100 
Average betweenness centrality (weighted) 0.0002 0.0002 
Average betweenness centrality (Newman weighted) 0.0003 0.0002 −33.33 
Average betweenness centrality (Jaccard weighted) 0.0002 0.0001 −50 
Average closeness centrality (unweighted) 0.23 0.24 +3.90 
Average closeness centrality (weighted) 0.45 0.46 +5.97 
Average closeness centrality (Newman weighted) 1.52 2.30 +51.85 
Average closeness centrality (Jaccard weighted) 21.61 22.27 +3.06 
Average eigenvector centrality (unweighted) 0.003 0.0007 −76.67 
Average eigenvector centrality (weighted) 0.0003 0.0003 
Average eigenvector centrality (Newman weighted) 0.0003 0.0002 −33.33 
Average eigenvector centrality (Jaccard weighted) 0.0002 0.0004 +100 
Network measuresWithout hyperauthorsWith hyperauthors% change
Average degree centrality (unweighted) 13.10 24.78 +89.18 
Average degree centrality (weighted) 19.23 34.72 +80.56 
Average degree centrality (Newman weighted) 3.05 2.90 −5.08 
Average degree centrality (Jaccard weighted) 4.61 11.67 +152.97 
Average betweenness centrality (unweighted) 0.0001 0.000 −100 
Average betweenness centrality (weighted) 0.0002 0.0002 
Average betweenness centrality (Newman weighted) 0.0003 0.0002 −33.33 
Average betweenness centrality (Jaccard weighted) 0.0002 0.0001 −50 
Average closeness centrality (unweighted) 0.23 0.24 +3.90 
Average closeness centrality (weighted) 0.45 0.46 +5.97 
Average closeness centrality (Newman weighted) 1.52 2.30 +51.85 
Average closeness centrality (Jaccard weighted) 21.61 22.27 +3.06 
Average eigenvector centrality (unweighted) 0.003 0.0007 −76.67 
Average eigenvector centrality (weighted) 0.0003 0.0003 
Average eigenvector centrality (Newman weighted) 0.0003 0.0002 −33.33 
Average eigenvector centrality (Jaccard weighted) 0.0002 0.0004 +100 

Altogether, the findings suggest that including hyperauthored papers distorts microlevel and egocentric measures while maintaining relative stability for the network as a whole. This conclusion is based on the observation that while there is little change in whole-network structure (despite growth in network size), there is significant change in the average centrality measures at the microlevel, indicating that the inclusion of hyperauthored papers can greatly affect the position and influence of individual authors within the network.

4.1. Egocentric Network Case Study

Given our initial findings that hyperauthor papers produce meaningful structural influences for egocentric measures, we conducted an egocentric case study of a specific set of authors to explore how their positions in the network changed due to the inclusion of hyperauthorship. The authors are selected based on their importance in the network based on degree centrality (selection criteria shown in Table 5). Degree centrality is a reliable indicator of power and prestige in our network, as authors with high degree centrality are more likely to benefit from their immediate coauthors and their respective coauthorship networks and in terms of knowledge and skills (Badar, Frantz, & Jabeen, 2016; Li, Liao, & Yen, 2013). A high number of connections within the network also suggests that these authors are actively collaborating and contributing to the field.

Table 5.

Egocentric networks that were impacted by the inclusion of hyperauthorship

  With hyperauthor
High centralityLow centrality
Without hyperauthor High centrality Node 67 Node 135 
Low centrality Node 16 Node 3918 
  With hyperauthor
High centralityLow centrality
Without hyperauthor High centrality Node 67 Node 135 
Low centrality Node 16 Node 3918 

Node 67 (egonetworks in Figure 6) fits with our selection criteria of high centrality for both networks without and with hyperauthorship as it has high centrality in both networks (rank one and rank three, respectively). In particular, this author produced 747 papers in total, two of which were hyperauthored (one with 27 coauthors, and one with 49 coauthors). The egonetwork consisted of 850 unique coauthors and 5,425 edges between them when hyperauthorship was excluded. When hyperauthorship is included, their network grew to 916 unique coauthors and 6,896 edges between them. As shown in Figure 6(a) and (b), the egonetwork size increased slightly with the presence of a large cluster of nodes resulting for hyperauthorship.

Figure 6.

Node 67 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 850, edges = 5,425; With hyperauthors: nodes = 916, edges = 6,896. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.

Figure 6.

Node 67 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 850, edges = 5,425; With hyperauthors: nodes = 916, edges = 6,896. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.

Close modal

Node 135 (egonetworks in Figure 7) provides a case where including hyperauthorship negatively impacted the ego’s centrality in the coauthorship network. This node was ranked 28th in degree centrality when excluding hyperauthored works, but dropped to 80th place when hyperauthorship was included. This is because the author was not involved in any hyperauthored papers. While they published 142 papers (range of coauthors: 0–21) and were central in the network overall, other authors who benefited from hyperauthorship surpassed this author’s centrality. As a result, the author’s network size did not change, with 364 coauthors and 2,298 edges in the network with hyperauthorship and 2,297 edges in the network without hyperauthorship. An additional edge appeared in the network with hyperauthorship, representing two authors who published another hyperauthored paper that Node 135 was not a part of.

Figure 7.

Node 135 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 364, edges = 2,297; with hyperauthors: nodes = 364, edges = 2,298. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.

Figure 7.

Node 135 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 364, edges = 2,297; with hyperauthors: nodes = 364, edges = 2,298. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.

Close modal

Node 16 (egonetworks in Figure 8) exemplifies how hyperauthorship can benefit an author’s position in the network. The author had 394 papers that were not hyperauthored (range of coauthors: 0–24), and three papers that were hyperauthored, with 49, 70, and 155 coauthors respectively. In the network without hyperauthorship, they had 276 coauthors and 1,343 edges between them. Including their three additional papers in the network with hyperauthorship grew their network to 502 coauthors and 17,284 edges. As a result, this author rose from 52nd in degree centrality to 28th when hyperauthored works are included.

Figure 8.

Node 16 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 276, edges = 1,343; with hyperauthors: nodes = 502, edges = 17,284. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.

Figure 8.

Node 16 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 276, edges = 1,343; with hyperauthors: nodes = 502, edges = 17,284. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.

Close modal

Node 3918 (egonetworks in Figure 9) exemplifies a contingency when hyperauthorship may not influence the overall centrality of an author in the network. This author produced six papers without hyperauthored works (range of coauthors: 2–5). Their egonetwork contained 13 unique coauthors and 36 edges between them. When their one hyperauthored paper (with 54 coauthors) was added to the network, their network size grew to 64 unique coauthors and 1,521 edges between them. Even though Node 3918’s ranking in terms of degree centrality compared to other authors has remained relatively unchanged with the inclusion of hyperauthorship, their egonetwork grew significantly when a hyperauthored paper was included.

Figure 9.

Node 3918 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 13, edges = 36; with hyperauthors: nodes = 64, edges = 1,521. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.

Figure 9.

Node 3918 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 13, edges = 36; with hyperauthors: nodes = 64, edges = 1,521. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.

Close modal

4.1.1. Impact of weighting functions

Having demonstrated the influence of hyperauthored works on standard metrics of centrality, we next examined the impact of weighting functions on the four egonetworks’ centrality measures and determined whether certain approaches to weighting centrality assessments offers an optimal method to curtail the inflation effects associated with hyperauthorship. We find that weighting by full and fractional counting significantly changes the four average centrality values of the egonetworks in our case study.

As shown in Tables S1S2 in the Supplementary material, the choices of weighting functions matter to all measures of centrality. In the case of node 67, weighting by full counting yields the highest average degree centrality for networks both without and with hyperauthors. On the other hand, weighting by fractional counting (Newman and Jaccard) brings the average degree centrality down significantly, and even lower than the unweighted degree measures for both networks. Specifically, centrality measures with Jaccard weighting are often the lowest compared to measures from other weighting functions, with the exception of average closeness centrality. The notably high average closeness centrality (173 in the network without hyperauthorship and 180 in the network with hyperauthorship) suggests that there are many authors that receive high scores because their immediate neighbors are well connected. This effect is magnified in the closeness centrality measure when Jaccard is used as edge weight. For nodes 135, 16, and 3918, we observe similar effects of Jaccard weighting on closeness centrality, where the average closeness measures are notably inflated compared to two other weighting functions.

We also examined which weighting function(s) are effective in minimizing the percentage change between the network without hyperauthorship and the network with hyperauthorship. This allowed us to determine the optimal weighting function to mitigate the inflated effects of hyperauthorship. We focus on node 16 and node 3918 (results in Table S2 of the Supplementary Materials) in this analysis because the effects of hyperauthorship were most profound to their egonetworks. In node 16, Newman weighting was most effective as the average degree centrality actually decreased (−23%) when hyperauthorship is included. For betweenness centrality, the unweighted measure was preferred as it best minimizes the percentage change between the two networks. For closeness centrality, the unweighted measure yields the lowest percentage change while the Newman weighted measure yields the highest percentage change. For eigenvector centrality, Newman weighting was the only function that yields a decrease in eigenvector (−44%), while other weighting functions are increased due to hyperauthorship.

We observe similar patterns of results in node 3918, with the exception of betweenness centrality and eigenvector centrality. A Jaccard weighting function is effective for betweenness centrality where the percentage decrease is most minimized with this weighting function (−79%). Eigenvector centrality with Jaccard weighting is also preferred as the impact of hyperauthorship is minimized (−36%) compared with other weighting functions. A Newman weighting function was the most effective in minimizing the inflated effects of hyperauthorship for degree centrality (−52%), but least effective for closeness centrality (+559%).

4.2. Discussion and Conclusion

Our structural analysis and egocentric analysis of coauthorship networks revealed notable effects that hyperauthorship has on certain centrality measures. First, including even a small number of hyperauthored papers created noise in the overall distribution of the number of authors of a given paper (shown in Figure 5) as well as in the number of coauthors (shown in Figure 4). Secondly, hyperauthorship inflated the average degree and closeness measures, and deflated the average eigenvector and betweenness centrality measures. Structural measures including network size, giant component size, density, and average path length, were inflated as well. Interestingly, average clustering remained unchanged, indicating that while the network grew about twice in size when hyperauthored papers are included, the network is not more connected in terms of local neighborhoods.

Our approach for establishing a hyperauthorship threshold yielded an appropriate cutoff point of 25 authors in our data set. This threshold value is similar to the threshold of 20 authors as set in Fegley and Torvik (2013)’s and Morris and Goldstein (2007)’s studies. On the other hand, this threshold is substantially smaller than the values of 100 authors and 200 authors determined in Cronin (2001)’s and Milojević (2010)’s studies, respectively. We suspect that our cutoff values are different from those in the literature due to differences in the bibliographic database the data are collected from and the research fields that comprise the data sets. Our data were collected from an internal API that retrieves bibliographic information from Scopus, while others have used PubMed (Fegley & Torvik, 2013), Web of Science (Morris & Goldstein, 2007), or Thomson Reuters (Milojević, 2010). As exemplified in Glänzel and Thijs (2004)’s study, coauthorship dynamics differ significantly across different research fields. Our data contain publications specific to the field of genomics, while other studies focus on nanotechnology (Milojević, 2010), library and information science (Cronin, 2001; Morris & Goldstein, 2007), and biomedicine (Fegley & Torvik, 2013). The observed differences also suggest that determining the hyperauthor cutoff point may depend on the distribution of the data set. Therefore, a generalizable approach like ours for handling hyperauthorship data would be beneficial, as it could be applied across different disciplines and sources of bibliometric data.

We also examined whether weighting functions mitigate the impacts of hyperauthorship on centrality network metrics. We compared four different weighting scenarios (i.e., unweighted, weighted based on full counting, weighted based on the Newman method, and weighted based on the Jaccard method) for the entire network and four egocentric networks. The impact of weighting functions at the whole network level was slightly different than the impact at the egocentric network level, and varied based on the centrality measure. For degree centrality, Newman weighting is preferred at both the whole network and egocentric network levels. On the other hand, eigenvector centrality with weighting based on full counting can best curtail the effects of hyperauthorship for both network levels. For betweenness centrality, weighting by full counting is preferred at the whole network level, whereas fractional counting (Newman and Jaccard) is preferred for egocentric networks. For closeness centrality, all weighting methods inflated the measure, and thus no weighting is preferred. Overall, the choices in weighting methods have an observable impact on most centrality measures (except closeness centrality) and their resilience to the inflated effects of hyperauthorship.

Our study contributes to research in bibliometrics and scientific collaborations in specific ways. Our network-analytic approach demonstrates the significant structural impacts that even a small proportion of hyperauthored papers produced for multiple levels of a network, from the egocentric level to the whole network. In particular, we find that degree centrality and closeness centrality are overestimated when hyperauthored papers are included, whereas betweenness and eigenvector centrality are notably underestimated. Thus, our work should be taken as a cautionary message for scholars who are interested in using coauthorship networks to study collaboration. We encourage analysts and readers to think carefully about the nature of the relationships they are studying before deciding whether to include hyperauthored works in their analyses. Furthermore, our findings also show that network metrics that are typically used by academic institutions and funders as indicators of research success are susceptible to distortion in the presence of hyperauthorship. In particular, these metrics have been used to infer researcher’s productivity, collaboration effectiveness, and structural positioning to peers in their respective fields and may be used to guide decisions related to promotions, funding, and research rankings (Cummings & Cross, 2003; Larivière & Gingras, 2010). These insights underscore the need for a more nuanced approach to evaluating research productivity, particularly as hyperauthorship becomes increasingly prevalent, thereby encouraging discussions on how best to account for these complexities in research assessments.

Our analysis offers several options for how to handle hyperauthored works in analyses. First, if hyperauthored works are unnecessary for analysis (e.g., because they represent such a small proportion of a data set), or are semantically distinct from the object of study (e.g., if the analyst wishes to use coauthorship as a proxy for close collaborative relationships), we recommend considering removing them from the data set. Our paper offers a generalized and mathematically grounded approach for doing so. Second, if hyperauthored works are important for analysis, we encourage analysts to be mindful when interpreting network statistics that are likely to be inflated by these works. The removal of hyperauthored papers is not ideal in this context, as it may distort the network structure that could otherwise explain certain macro-level mechanisms driving collaborations. Hence, we propose multiple weighting functions to mitigate the impact of hyperauthorship, as opposed to removing hyperauthored papers, and find that weighting based on full counting and Newman-based fractional counting are preferred.

While this study has yielded insights into the impact of hyperauthorship on the collaborative network structures of group of researchers in a large biological research institute, we acknowledge several limitations in our study design and suggest areas for future research. Firstly, our study relies on data obtained from a research center in genomics, which may limit the generalizability of our findings to other fields and research contexts. In future research, we will broaden the scope of our analysis by incorporating data from other research centers as validation that our method can be reliably applied to any bibliometric data set. Secondly, we will expand to comparisons of our threshold approach with existing measures for reducing network sizes without the removal of hyperauthored works such as extracting network backbone, skeletons, cuts, cores, and islands (Batagelj, Doreian et al., 2014; Batagelj & Zaveršnik, 2011). Third, we will examine the impact of hyperauthorship on both two-mode paper-author networks and one-mode coauthorship networks in terms of network cohesion, centrality, and topology. Lastly, we will enhance our Newman- and Jaccard-based weighting methodologies by incorporating the weights among the coauthors of the focal authors, including those within the 2-hop and 3-hop neighborhoods. Our initial approach represents a first step in demonstrating how weights can be effectively incorporated for immediate neighbors in the authorship network. The weighting methods proposed of this study may be used to develop metrics for weighted common neighbors in coauthorship networks.

Collectively, our findings are directly relevant to researchers who want to use bibliometrics and network measures to make inferences about scientific collaboration. The removal of hyperauthorship is recommended especially before construction and analysis of coauthorship networks in order to avoid misrepresentation of network density, degree distribution, and centrality ranking of coauthors. Despite the relatively small number of hyperauthored papers in our data set, their impact on the coauthorship network structure, which serves as a proxy for understanding collaboration, is significant. This effect reflects the changing nature of guidelines and norms that determine who qualifies as an author in a scientific publication (Cronin, 2001). As a result, researchers should be cautious about using coauthorship as a proxy for scientific collaboration, as the nature and extent of each author’s contribution in a collaboration can vary widely.

The authors are grateful to the research institute whose data are featured herein for providing staff and financial support to collect and analyze the data herein. Multiple research assistants enabled this work, including Rachel Rosenberg, Tianyi Liang, and Petiya Stoichkova. We thank Dr. Yi-Yun Cheng for their valuable feedback on drafts of this manuscript. This work was also funded by a seed grant provided by UIUC’s campus research board.

Ly Dinh: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Validation, Writing—original draft, Writing—review & editing. William C. Barley: Conceptualization, Funding acquisition, Investigation, Resources, Writing—original draft, Writing—review & editing. Lauren Johnson: Conceptualization, Resources, Validation, Writing—review & editing. Brian F. Allan: Conceptualization, Funding acquisition, Investigation, Resources, Writing—original draft, Writing—review & editing.

The authors have no competing interests.

This work was funded by a seed grant provided by University of Illinois Urbana-Champaign’s campus research board.

The data and code used in this project are openly accessible to facilitate reproducibility and further adoption. In accordance with ethical considerations and IRB regulations, the data shared strictly adhere to the de-identification of any sensitive or personally identifiable information to protect the privacy of the researchers in the data set. Hence, the data are a de-identified network edgelist where papers and authors are given unique IDs. The code is licensed under MIT License, and may be freely used and modified, with attribution to the original work: Dinh, L., Barley, W. C., Johnson, L., & Allan, B. F. (2024). Dataset for manuscript: Hyperauthored papers disporportionately amplify important egocentric network metrics. Zenodo. https://doi.org/10.5281/zenodo.10668904.

Abramo
,
G.
,
D’Angelo
,
C. A.
, &
Rosati
,
F.
(
2013
).
The importance of accounting for the number of co-authors and their order when assessing research performance at the individual level in the life sciences
.
Journal of Informetrics
,
7
(
1
),
198
208
.
Ahmed
,
S. I.
,
Cambo
,
S. A.
,
Lagoze
,
C.
, &
Velden
,
T.
(
2013
).
Toward a mesoscopic analysis of the temporal evolution of scientific collaboration networks
. In
iConference 2013 Proceedings
(pp.
878
881
).
Akbaritabar
,
A.
(
2021
).
A quantitative view of the structure of institutional scientific collaborations using the example of Berlin
.
Quantitative Science Studies
,
2
(
2
),
753
777
.
Badar
,
K.
,
Frantz
,
T. L.
, &
Jabeen
,
M.
(
2016
).
Research performance and degree centrality in co-authorship networks: The moderating role of homophily
.
Aslib Journal of Information Management
,
68
(
6
),
756
771
.
Barley
,
W. C.
,
Dinh
,
L.
,
Workman
,
H.
, &
Fang
,
C.
(
2022
).
Exploring the relationship between interdisciplinary ties and linguistic familiarity using multilevel network analysis
.
Communication Research
,
49
(
1
),
33
60
.
Batagelj
,
V.
(
2020
).
On fractional approach to analysis of linked networks
.
Scientometrics
,
123
(
2
),
621
633
.
Batagelj
,
V.
, &
Cerinšek
,
M.
(
2013
).
On bibliographic networks
.
Scientometrics
,
96
(
3
),
845
864
.
Batagelj
,
V.
,
Doreian
,
P.
,
Ferligoj
,
A.
, &
Kejzar
,
N.
(
2014
).
Understanding large temporal networks and spatial networks: Exploration, pattern searching, visualization and network evolution
(
Vol. 2
).
Chichester
:
John Wiley & Sons
.
Batagelj
,
V.
, &
Zaveršnik
,
M.
(
2011
).
Fast algorithms for determining (generalized) core groups in social networks
.
Advances in Data Analysis and Classification
,
5
(
2
),
129
145
.
Birnholtz
,
J. P.
(
2006
).
What does it mean to be an author? The intersection of credit, contribution, and collaboration in science
.
Journal of the American Society for Information Science and Technology
,
57
(
13
),
1758
1770
.
Bordons
,
M.
,
Aparicio
,
J.
,
Gonzalez-Albo
,
B.
, &
Diaz-Faes
,
A. A.
(
2015
).
The relationship between the research performance of scientists and their position in co-authorship networks in three fields
.
Journal of Informetrics
,
9
(
1
),
135
144
.
Bordons
,
M.
, &
Gomez
,
I.
(
2000
).
Collaboration networks in science
. In
B.
Cronin
&
H. B.
Atkins
(Eds.),
The web of knowledge: A Festschrift in honor of Eugene Garfield
(pp.
197
213
).
Medford, NJ
:
Information Today, Inc. & ASIS
.
Borgatti
,
S. P.
, &
Halgin
,
D. S.
(
2011
).
Analyzing affiliation networks
. In
P. J.
Carrington
&
J.
Scott
(Eds.),
The Sage handbook of social network analysis
(pp.
417
433
).
Thousand Oaks, CA
:
Sage
.
Borner
,
K.
,
Dall’Asta
,
L.
,
Ke
,
W.
, &
Vespignani
,
A.
(
2005
).
Studying the emerging global brain: Analyzing and visualizing the impact of co-authorship teams
.
Complexity
,
10
(
4
),
57
67
.
Brandão
,
M. A.
, &
Moro
,
M. M.
(
2017
).
The strength of co-authorship ties through different topological properties
.
Journal of the Brazilian Computer Society
,
23
,
5
.
Breiger
,
R. L.
(
1974
).
The duality of persons and groups
.
Social Forces
,
53
(
2
),
181
190
.
Burt
,
R. S.
(
2004
).
Structural holes and good ideas
.
American Journal of Sociology
,
110
(
2
),
349
399
.
Collins
,
H.
, &
Evans
,
R.
(
2015
).
Expertise revisited, Part I—Interactional expertise
.
Studies in History and Philosophy of Science Part A
,
54
,
113
123
. ,
[PubMed]
Costanza
,
R.
, &
Kubiszewski
,
I.
(
2012
).
The authorship structure of “ecosystem services” as a transdisciplinary field of scholarship
.
Ecosystem Services
,
1
(
1
),
16
25
.
Costas
,
R.
,
van Leeuwen
,
T. N.
, &
Bordons
,
M.
(
2010
).
A bibliometric classificatory approach for the study and assessment of research performance at the individual level: The effects of age on productivity and impact
.
Journal of the American Society for Information Science and Technology
,
61
(
8
),
1564
1581
.
Cronin
,
B.
(
2001
).
Hyperauthorship: A postmodern perversion or evidence of a structural shift in scholarly communication practices?
Journal of the American Society for Information Science and Technology
,
52
(
7
),
558
569
.
Cummings
,
J. N.
, &
Cross
,
R.
(
2003
).
Structural properties of work groups and their consequences for performance
.
Social Networks
,
25
(
3
),
197
210
.
Dehdarirad
,
T.
, &
Nasini
,
S.
(
2017
).
Research impact in co-authorship networks: A two-mode analysis
.
Journal of Informetrics
,
11
(
2
),
371
388
.
Dinh
,
L.
, &
Cheng
,
Y.-Y.
(
2018
).
Middle of the (by)line: Examining hyperauthorship networks in the Human Genome Project
.
Proceedings of the Association for Information Science and Technology
,
55
(
1
),
790
791
.
Farber
,
M.
, &
Ao
,
L.
(
2022
).
The Microsoft Academic Knowledge Graph enhanced: Author name disambiguation, publication classification, and embeddings
.
Quantitative Science Studies
,
3
(
1
),
51
98
.
Fegley
,
B. D.
, &
Torvik
,
V. I.
(
2013
).
Has large-scale named-entity network analysis been resting on a flawed assumption?
PLOS ONE
,
8
(
7
),
e70299
. ,
[PubMed]
Fetscherin
,
M.
, &
Heinrich
,
D.
(
2015
).
Consumer brand relationships research: A bibliometric citation meta-analysis
.
Journal of Business Research
,
68
(
2
),
380
390
.
Franceschet
,
M.
, &
Costantini
,
A.
(
2010
).
The effect of scholar collaboration on impact and quality of academic papers
.
Journal of Informetrics
,
4
(
4
),
540
553
.
Gauffriau
,
M.
(
2017
).
A categorization of arguments for counting methods for publication and citation indicators
.
Journal of Informetrics
,
11
(
3
),
672
684
.
Gauffriau
,
M.
(
2021
).
Counting methods introduced into the bibliometric research literature 1970–2018: A review
.
Quantitative Science Studies
,
2
(
3
),
932
975
.
Glänzel
,
W.
, &
Thijs
,
B.
(
2004
).
Does co-authorship inflate the share of self-citations?
Scientometrics
,
61
(
3
),
395
404
.
Griffin
,
D. J.
,
Arth
,
Z. W.
,
Hakim
,
S. D.
,
Britt
,
B. C.
,
Gilbreath
,
J. N.
, …
Bolkan
,
S.
(
2021
).
Collaborations in communication: Authorship credit allocation via a weighted fractional count procedure
.
Scientometrics
,
126
(
5
),
4355
4372
.
Ioannidis
,
J. P.
(
2008
).
Measuring co-authorship and networking-adjusted scientific impact
.
PLOS ONE
,
3
(
7
),
e2778
. ,
[PubMed]
Kennedy
,
M. S.
,
Barnsteiner
,
J.
, &
Daly
,
J.
(
2014
).
Honorary and ghost authorship in nursing publications
.
Journal of Nursing Scholarship
,
46
(
6
),
416
422
. ,
[PubMed]
Kim
,
J.
(
2019
).
Scale-free collaboration networks: An author name disambiguation perspective
.
Journal of the Association for Information Science and Technology
,
70
(
7
),
685
700
.
Koltun
,
V.
, &
Hafner
,
D.
(
2021
).
The h-index is no longer an effective correlate of scientific reputation
.
PLOS ONE
,
16
(
6
),
e0253397
. ,
[PubMed]
Larivière
,
V.
, &
Gingras
,
Y.
(
2010
).
On the relationship between interdisciplinarity and scientific impact
.
Journal of the American Society for Information Science and Technology
,
61
(
1
),
126
131
.
Leydesdorff
,
L.
(
2007
).
Betweenness centrality as an indicator of the interdisciplinarity of scientific journals
.
Journal of the American Society for Information Science and Technology
,
58
(
9
),
1303
1319
.
Li
,
E. Y.
,
Liao
,
C. H.
, &
Yen
,
H. R.
(
2013
).
Co-authorship networks and research impact: A social capital perspective
.
Research Policy
,
42
(
9
),
1515
1530
.
Lundberg
,
J. T.
, &
Brommels
,
M.
(
2006
).
Collaboration uncovered: Exploring the adequacy of measuring university-industry collaboration through co-authorship and funding
.
Scientometrics
,
69
(
3
),
575
589
.
Milojević
,
S.
(
2010
).
Modes of collaboration in modern science: Beyond power laws and preferential attachment
.
Journal of the American Society for Information Science and Technology
,
61
(
7
),
1410
1423
.
Morillo
,
F.
,
Bordons
,
M.
, &
Gómez
,
I.
(
2003
).
Interdisciplinarity in science: A tentative typology of disciplines and research areas
.
Journal of the American Society for Information Science and Technology
,
54
(
13
),
1237
1249
.
Morris
,
S. A.
, &
Goldstein
,
M. L.
(
2007
).
Manifestation of research teams in journal literature: A growth model of papers, authors, collaboration, coauthorship, weak ties, and Lotka’s law
.
Journal of the American Society for Information Science and Technology
,
58
(
12
),
1764
1782
.
Naik
,
C.
,
Sugimoto
,
C. R.
,
Larivière
,
V.
,
Leng
,
C.
, &
Guo
,
W.
(
2023
).
Impact of geographic diversity on citation of collaborative research
.
Quantitative Science Studies
,
4
(
2
),
442
465
.
Nemeth
,
C. J.
, &
Nemeth-Brown
,
B.
(
2003
).
Better than individuals. The potential benefits of dissent and diversity for group creativity
. In
P. B.
Paulus
&
B. A.
Nijstad
(Eds.),
Group creativity: Innovation through collaboration
(pp.
63
84
).
Oxford University Press
.
Newman
,
M. E. J.
(
2001
).
Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality
.
Physical Review E
,
64
(
1
),
016132
. ,
[PubMed]
Newman
,
M. E. J.
(
2005
).
Power laws, Pareto distributions and Zipf’s law
.
Contemporary Physics
,
46
(
5
),
323
351
.
Pan
,
R. K.
,
Sinha
,
S.
,
Kaski
,
K.
, &
Saramäki
,
J.
(
2012
).
The evolution of interdisciplinarity in physics research
.
Scientific Reports
,
2
(
1
),
551
. ,
[PubMed]
Perianes-Rodriguez
,
A.
,
Waltman
,
L.
, &
van Eck
,
N. J.
(
2016
).
Constructing bibliometric networks: A comparison between full and fractional counting
.
Journal of Informetrics
,
10
(
4
),
1178
1195
.
Porter
,
A.
,
Cohen
,
A.
,
David Roessner
,
J.
, &
Perreault
,
M.
(
2007
).
Measuring researcher interdisciplinarity
.
Scientometrics
,
72
(
1
),
117
147
.
Porter
,
A.
, &
Rafols
,
I.
(
2009
).
Is science becoming more interdisciplinary? Measuring and mapping six research fields over time
.
Scientometrics
,
81
(
3
),
719
745
.
Rafols
,
I.
,
Leydesdorff
,
L.
,
O’Hare
,
A.
,
Nightingale
,
P.
, &
Stirling
,
A.
(
2012
).
How journal rankings can suppress interdisciplinary research: A comparison between innovation studies and business & management
.
Research Policy
,
41
(
7
),
1262
1282
.
Schummer
,
J.
(
2004
).
Multidisciplinarity, interdisciplinarity, and patterns of research collaboration in nanoscience and nanotechnology
.
Scientometrics
,
59
(
3
),
425
465
.
Sinatra
,
R.
,
Deville
,
P.
,
Szell
,
M.
,
Wang
,
D.
, &
Barabási
,
A.-L.
(
2015
).
A century of physics
.
Nature Physics
,
11
(
10
),
791
796
.
Sivertsen
,
G.
,
Rousseau
,
R.
, &
Zhang
,
L.
(
2019
).
Measuring scientific contributions with modified fractional counting
.
Journal of Informetrics
,
13
(
2
),
679
694
.
Smith
,
D.
, &
Katz
,
J.
(
n.d.
).
HEFCE fundamental review of research policy and funding: Collaborative approaches to research: Final report
.
Higher Education Policy Unit (HEPU), University of Leeds and the Science Policy Research Unit (SPRU) University of Sussex
.
Strumia
,
A.
, &
Torre
,
R.
(
2019
).
Biblioranking fundamental physics
.
Journal of Informetrics
,
13
(
2
),
515
539
.
Thelwall
,
M.
, &
Maflahi
,
N.
(
2022
).
Research coauthorship 1900–2020: Continuous, universal, and ongoing expansion
.
Quantitative Science Studies
,
3
(
2
),
331
344
.
Tijssen
,
R. J.
(
2004
).
Is the commercialisation of scientific research affecting the production of public knowledge?: Global trends in the output of corporate research articles
.
Research Policy
,
33
(
5
),
709
733
.
Uzzi
,
B.
,
Amaral
,
L. A.
, &
Reed-Tsochas
,
F.
(
2007
).
Small-world networks and management science research: A review
.
European Management Review
,
4
(
2
),
77
91
.
Uzzi
,
B.
,
Mukherjee
,
S.
,
Stringer
,
M.
, &
Jones
,
B.
(
2013
).
Atypical combinations and scientific impact
.
Science
,
342
(
6157
),
468
472
. ,
[PubMed]
Valderas
,
J. M.
(
2007
).
Why do team-authored papers get cited more?
Science
,
317
(
5844
),
1496
1498
. ,
[PubMed]
Wislar
,
J. S.
,
Flanagin
,
A.
,
Fontanarosa
,
P. B.
, &
DeAngelis
,
C. D.
(
2011
).
Honorary and ghost authorship in high impact biomedical journals: A cross sectional survey
.
British Medical Journal
,
343
,
d6128
. ,
[PubMed]
Wuchty
,
S.
,
Jones
,
B. F.
, &
Uzzi
,
B.
(
2007
).
The increasing dominance of teams in production of knowledge
.
Science
,
316
(
5827
),
1036
1039
. ,
[PubMed]

Author notes

Handling Editor: Vincent Larivière

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.

Supplementary data