Abstract
Hyperauthorship, a phenomenon whereby there are a disproportionately large number of authors on a single paper, is increasingly common in several scientific disciplines, but with unknown consequences for network metrics used to study scientific collaboration. The validity of coauthorship as a proxy for scientific collaboration is affected by this. Using bibliometric data from publications in the field of genomics, we examine the impact of hyperauthorship on metrics of scientific collaboration, and propose a method to determine a suitable cutoff threshold for hyperauthored papers and compare coauthorship networks with and without hyperauthored works. Our analysis reveals that including hyperauthored papers dramatically impacts the structural positioning of central authors and the topological characteristics of the network, while producing small influences on whole-network cohesion measures. We present two solutions to minimize the impact of hyperauthorship: using a mathematically grounded and reproducible calculation of threshold cutoff to exclude hyperauthored papers or fractional counting to weight network results. Our findings affirm the structural influences of hyperauthored papers and suggest that scholars should be mindful when using coauthorship networks to study scientific collaboration.
PEER REVIEW
1. INTRODUCTION
Scientific collaboration is vital to solving complex scientific problems that require integration of knowledge across disciplines. Bibliometric studies show that discipline-spanning collaborations play an important role in spurring scientific innovation and producing impactful papers (Collins & Evans, 2015; Thelwall & Maflahi, 2022; Uzzi, Mukherjee et al., 2013). To that end, increasing efforts have been made to support scientific research teams as well as to better understand the relationship between diversity in scientific collaborations and research outcomes. Many studies in this space use paper coauthorship as the primary indicator upon which to assess the diversity of a collaboration and that collaboration’s effects on outcomes of scholarly interest.
The average size of authorship teams has increased over time (Ioannidis, 2008; Wuchty, Jones, & Uzzi, 2007), especially in fields such as high-energy physics (Birnholtz, 2006; Milojević, 2010), genomics (Dinh & Cheng, 2018), and medicine (Franceschet & Costantini, 2010), where hyperauthorship is relatively common. The rapid growth in average team size may impact measures of scientific collaboration outcomes, which traditionally have been examined using a mix of bibliometric (Porter & Rafols, 2009; Rafols, Leydesdorff et al., 2012; Schummer, 2004) and network analysis methods (Akbaritabar, 2021; Barley, Dinh et al., 2022; Cummings & Cross, 2003; Fegley & Torvik, 2013). In fact, Sinatra, Deville et al. (2015) found that the number of citations per paper and number of papers per author in the field of interdisciplinary physics have been inflated over the past 15 years, and that the number of authors per paper increased at a similar rate to the number of papers produced in this field. Thus, these “citation” measures are not a proxy for scientific collaboration (Strumia & Torre, 2019). In some leading medical journals, for example, hyperauthored works can include long lists of authors that represent an honorary role in the research process despite not having contributed substantively to the work (Kennedy, Barnsteiner, & Daly, 2014; Wislar, Flanagin et al., 2011). Furthermore, the validity of coauthorship as a primary indicator of research collaborations is another subject of inquiry, as evidence suggests that not all collaborations result in coauthored papers (Lundberg & Brommels, 2006; Smith & Katz, n.d.; Tijssen, 2004) and that not all coauthorships signify collaboration in terms of contribution to writing (Cronin, 2001; Dinh & Cheng, 2018). It is important to distinguish that scientific collaboration is a process of working together, whereas coauthorship is an indicator of scientific contribution with certain norms and guidelines (Cronin, 2001). Thus, they refer to different aspects of scientific research, and while coauthorship may suggest collaboration, there may be other factors beyond direct collaboration that can explain coauthorship (Birnholtz, 2006).
Given the complex relationship between coauthorship and scientific collaboration, especially in the presence of hyperauthorship, this study examines how hyperauthored papers impact the coauthorship network metrics that scholars use to study scientific collaborations. The inclusion of even a few hyperauthored papers within a bibliometrically constructed network may substantially inflate the average number of collaborators per author. Consequently, including hyperauthored works may inflate author-level network measures frequently used to assess an author’s influence in scientific collaboration. We test this hypothesis by examining a database of papers from the interdisciplinary field of genomics, specifically focusing on papers authored by 413 researchers affiliated with a large biological research institute, where hyperauthorship is common. Using these data, we (1) propose a method to determine a suitable cutoff threshold for hyperauthored papers using the cumulative frequency distribution of number of authors per paper; (2) compare the changes (if any) in network metrics of coauthorship networks with and without hyperauthored papers; and (3) present two solutions to minimize the impact of hyperauthorship by using a threshold cutoff to exclude hyperauthored papers or using fractional counting (i.e., Newman and Jaccard weighting functions) to weight network results.
Our analysis reveals that including hyperauthored papers dramatically impacts the structural positioning of central authors and the topological characteristics of the network, while producing comparatively small influences on whole-network cohesion measures. These findings suggest that scholars should be mindful when using bibliometric networks to study scientific collaboration, especially if the object of analysis focuses on egocentric dependent variables. We argue that researchers should consider whether including hyperauthored works is necessary to address their research questions, and consider omitting them from analysis when unnecessary. Further, when including hyperauthored work, we find that a fractional counting approach overall can mitigate the impact of hyperauthorship compared to full counting, with the most optimal solution being fractional counting based on the number of shared coauthors across all papers. Our findings affirm researchers’ concerns about the structural influences of hyperauthored papers and indicate that scholars must directly consider how hyperauthored works will affect their results when studying scientific collaboration using coauthorship networks.
2. BACKGROUND
2.1. Network Metrics as Indicators of Scientific Collaboration
Scholars in the science of science have used network measures to analyze structural patterns of scientific collaboration (Bordons, Aparicio et al., 2015; Leydesdorff, 2007) as well as to identify factors that impact collaboration across disciplinary (Morillo, Bordons, & Gómez, 2003; Porter, Cohen et al., 2007) and geographical (Bordons & Gomez, 2000; Naik, Sugimoto et al., 2023) boundaries. The benefits of collaboration in the production of scientific knowledge are well defined in the literature, including that more diverse research teams can benefit from increased creativity and innovation (Burt, 2004; Nemeth & Nemeth-Brown, 2003). Leydesdorff (2007) found that betweenness centrality is a reliable indicator of interdisciplinarity in journal–journal citation networks; the higher the betweenness, the more diverse the disciplines that cite a journal. Bordons et al.’s (2015) network analysis of coauthors in three fields (Nanoscience, Pharmacology, and Statistics) showed that authors with the most number of “strong tie” coauthors (i.e., those with repeated collaborations) tend to have the highest research productivity. Costanza and Kubiszewski (2012) examined the coauthorship network of 172 authors who published the most number of papers in an interdisciplinary field (Ecosystem Services) and found that the number of coauthors had a positive linear relationship with the number of citations an article received, which also had a positive correlation with the average h index of each article. These studies exemplify that network analysis is a preferred method of analysis in which coauthorship and citation patterns often are used as proxies for scientific collaboration.
2.2. Hyperauthorship and Scientific Collaboration
Bibliometric studies have found a continuous and consistent growth in coauthorship that spans all scientific disciplines (Dehdarirad & Nasini, 2017; Milojević, 2010; Valderas, 2007). While there are notable benefits of scientific collaboration, there are also drawbacks to consider, particularly in terms of fair allocation of credit when coauthorship is given for reasons other than scientific collaboration (Birnholtz, 2006; Cronin, 2001). Especially with the rising prevalence of publications with large numbers of coauthors, known as hyperauthorship, norms and requirements for authorship in a collaborative work are also impacted (Cronin, 2001). Scholars in bibliometrics and network science have found that hyperauthorship affects traditional indicators of scholarship productivity, such as the h-index (Koltun & Hafner, 2021), degree centrality (Fegley & Torvik, 2013), and author degree distributions (Milojević, 2010). Koltun and Hafner’s (2021) analysis of over two million publications on Google Scholar and the citations between them revealed that authors with 100 coauthors or more over the course of their careers have disproportionately high h-indices. However, the h-indices were found to be uncorrelated with other productivity indicators, such as the number of scientific awards received. Fegley and Torvik (2013) found that hyperauthorship influenced coauthorship network structure, where groups of authors were completely connected within their own clusters (i.e., common multiauthored paper) and thus had higher degree centrality than expected. Milojević (2010) compared the probability distributions of new collaboration based on prior coauthorship with and without hyperauthorship and showed that while both distributions are power-law, the distribution with hyperauthorship includes anomalous noise. The author also found that for authors with less than 20 coauthors over the course of their careers, the degree distribution was a log-normal “hook” instead of a power law. This finding illustrates that the number of coauthors has an effect on the topology of the collaboration network. Altogether, these studies show that hyperauthorship may impact a scholar’s interpretation of a particular author’s (or a group of authors’) collaboration activity and their connectedness within a network.
2.3. Mitigating the Impact of Hyperauthorship
We have observed that in many studies using bibliometric data, hyperauthored papers are not explicitly acknowledged or addressed, despite the known impacts of hyperauthorship on coauthorship networks, where papers with a high number of authors often have inflated weights due to the presence of numerous large complete subgraphs (Batagelj, 2020; Batagelj & Cerinšek, 2013). In some cases, keeping hyperauthored papers may be useful or necessary, such as in studies examining author name disambiguation (Farber & Ao, 2022; Kim, 2019), researcher productivity (Costas, van Leeuwen, & Bordons, 2010; Thelwall & Maflahi, 2022), or growth in coauthorship size over time (Borner, Dall’Asta et al., 2005; Thelwall & Maflahi, 2022). Other studies have kept hyperauthored papers in order to evaluate network normalization techniques to minimize the impact of hyperauthorship, such as fractional counting (Batagelj, 2020; Perianes-Rodriguez, Waltman, & van Eck, 2016), or to examine core-periphery structures of scientific collaborations networks (Batagelj & Zaveršnik, 2011; Fetscherin & Heinrich, 2015; Uzzi, Amaral, & Reed-Tsochas, 2007). Among those works that do explicitly seek to mitigate the impact of hyperauthorship the most common practice has been to exclude papers that have a certain number of coauthors (Cronin, 2001; Fegley & Torvik, 2013; Milojević, 2010; Morris & Goldstein, 2007). However, practices for choosing this threshold have been inconsistent. Cronin (2001) was the first to define hyperauthorship as any paper with more than 100 authors. Similarly, Milojević (2010) set a threshold of more than 200 authors for a hyperauthored paper. Both Fegley and Torvik (2013) and Morris and Goldstein (2007) operationalized a hyperauthored paper to have at least 20 authors. Kim (2019) identified and removed papers in the top 1% for the highest number of authors (≥9 authors in the DBLP data set, ≥16 authors in the MAG data set, and ≥13 authors in the MEDLINE data set). Ahmed, Cambo et al. (2013) removed 3% of papers that were identified as hyperauthored, but did not specify the threshold. While these empirical solutions are important first steps, variability in how hyperauthored papers are handled creates challenges for comparability across studies. Therefore, a standardized and reproducible method for identifying and excluding hyperauthorship would be beneficial to the field of bibliometrics. To address this need, we demonstrate how a reproducible pipeline for preprocessing hyperauthorship data can help mitigate any potential effects of hyperauthorship on network measures of interest, while also offering researchers a standard to consider when seeking to exclude hyperauthored works from their data sets.
3. METHODOLOGY
3.1. Data
The data set used for this study consisted of bibliometric records for 413 researchers within a large biological research institute. Among the 413 researchers, 208 have coauthored one or more papers together during their careers at the research institute. Each researcher’s publication data throughout their academic career (up until 2021) were collected using a Scopus database, including metadata such as full title, publication type, journal/conference proceeding name, publisher, DOI, author names, organizational unit of author(s), citation counts (based on Scopus), open access status, and keywords. We added a unique ID to each publication for quick retrieval and matching purposes. The publication data from this set of seed authors allow us to examine the immediate impact of hyperauthorship on each researcher’s egocentric network of coauthors, particularly in terms of collaboration frequency and strength with their coauthors. Additionally, this data set allows us to compare researchers’ egocentric networks and determine whether the presence of hyperauthored papers may benefit some researchers while disadvantaging others.
The original format of the data set was a two-mode network, as two types of nodes, papers and authors, are connected. An edge between a paper node and an author node indicates that a paper is authored by a particular author. As our goal was to analyze coauthorship network patterns, we transformed the two-mode network into an author-author one-mode network via a weighted bipartite projection method by Breiger (1974) and later by Borgatti and Halgin (2011). The resulting weighted projected graph contains edges between two author nodes that are previously connected to the same paper node in the original bipartite graph. In other words, the projected graph contains coauthorship edges, along with weights reflecting the number of papers that two authors have coauthored together.
The resulting publication data set contained 19,100 unique papers produced by 35,658 unique authors. Figure 1 shows the distribution of the number of coauthors for the papers in the resulting data set. This serves as evidence that hyperauthorship is a relatively recent but increasingly common trend in publishing, marked by a notable increase in the number of hyperauthored papers observed after the 2000s. Considering the trajectory of hyperauthored papers in our data set, it is imperative to examine the impacts that these papers would have on our understanding of scientific collaboration patterns over time.
Scatterplot depicting number of coauthors per paper for all articles in the data set, including hyperauthored papers.
Scatterplot depicting number of coauthors per paper for all articles in the data set, including hyperauthored papers.
3.1.1. Threshold for hyperauthorship
We establish a threshold to determine when a paper is categorized as hyperauthored based on the distribution of the number of authors per paper using a cumulative frequency distribution approach. Our goal is to show a generalized and reproducible method for identifying and removing hyperauthored papers from any skewed distribution of publications. Our process involves several steps to determine the cutoff point for hyperauthorship, where outliers with a large number of authors could be excluded from analysis. First, we check whether the distribution of the number of authors per paper is normally distributed. If the distribution is normal, we use the empirical rule (i.e., the 68–95–99.7 rule) to determine outliers by identifying the threshold at which 95% of the data are captured (i.e., two standard deviations from the mean). If the distribution is not normally distributed, we apply Chebyshev’s inequality (i.e., the 75–88.9 rule) to determine a threshold at which 88.9% of the data are captured (i.e., three standard deviations from the mean). We then compare our approach to the cumulative frequency distribution method as another point of comparison. The cumulative frequency approach involves ranking observations in order of magnitude and calculating cumulative frequencies based on the ranking. We opt to use two methods for determining the cutoff point so that we can cross-validate the findings and establish the reliability of our approach. While these two approaches are highly dependent on data and distribution, we anticipate that the method can be applied to any bibliometric data set, given the expected nonnormal distribution as per Lotka’s law.
3.2. Network Weighting Functions
Another potential solution to addressing hyperauthored works is to apply weighting functions to potentially reduce these products’ structural influence on collaboration networks. Fair allocation of authorship credit to authors engaged in multiauthored papers has been a topic of considerable interest for bibliometrics researchers (Abramo, D’Angelo, & Rosati, 2013; Perianes-Rodriguez et al., 2016; Sivertsen, Rousseau, & Zhang, 2019). This problem is especially relevant to researchers who use a combination of bibliometric and network approaches, as choices about credit allocation have a direct impact on how the network is constructed and weighted (Gauffriau, 2017; Perianes-Rodriguez et al., 2016). Gauffriau (2021) in their comprehensive literature review find that full counting and fractional counting are the two primary methods used for credit allocation. The full counting method assigns a weight of one to each author of a paper, whereas the fractional counting method distributes a single weight among all the coauthors of a paper. In this study, we will implement both counting methods, as summarized in Table 1, and with formulations stated below.
Summary of network weighting functions and measures
Function/Measure . | Description . | Usage . |
---|---|---|
Full counting | Assigns a weight of one to each author | Constructing whole and egocentric networks |
Fractional counting (Newman’s method) | Distributes a single weight among all coauthors of a paper based on the number of coauthors | Constructing whole and egocentric networks |
Fractional counting (Jaccard’s method) | Computes the Jaccard index to measure the neighborhood overlap between two authors | Constructing whole and egocentric networks |
Function/Measure . | Description . | Usage . |
---|---|---|
Full counting | Assigns a weight of one to each author | Constructing whole and egocentric networks |
Fractional counting (Newman’s method) | Distributes a single weight among all coauthors of a paper based on the number of coauthors | Constructing whole and egocentric networks |
Fractional counting (Jaccard’s method) | Computes the Jaccard index to measure the neighborhood overlap between two authors | Constructing whole and egocentric networks |
3.2.1. Full counting
3.2.2. Fractional counting
There are several approaches to fractional counting (Batagelj, 2020; Gauffriau, 2021), which are essentially means of normalizing the coauthorship weights based on the number of coauthors in a paper. Batagelj (2020) applied three weighting algorithms to normalize coauthorship and cocitation networks, effectively mitigating the impact of hyperauthorship on the network structure caused by overrepresentation of edge weights. Based on prior literature, we utilize two main weighting functions, namely Newman’s and Jaccard’s methods. Newman’s method and variants of the method have been used in prior studies such as Griffin, Arth et al. (2021) and Perianes-Rodriguez et al. (2016). Jaccard’s method has been used in Brandão and Moro (2017) and Pan, Sinha et al. (2012) as a measure of neighborhood overlap between two authors.
3.3. Network Measures
Here, we define the network measures that are computed for this study. We use existing algorithms available in NetworkX, a Python library for network analysis, and modify a subset of measures based on our operationalization. We conduct network analysis at both the whole-network and egocentric network levels, computing the same set of metrics (discussed below) to both levels, as shown in Table 2.
Summary of whole and egocentric network measures
Measure . | Description . | Usage . |
---|---|---|
Density | Proportion of edges in the network relative to total possible edges | Whole-network cohesion measure |
Average clustering | Measures local neighborhood formation in the network | Whole-network cohesion measure |
Average path length | Measures average shortest path distance between every pair of nodes | Whole-network cohesion measure |
Giant component | Identifies the largest connected subgraph | Whole-network cohesion measure |
Clauset-Newman-Moore | Agglomerative clustering that initialize each node as separate community, then uses a greedy approach to iteratively merge pairs of communities that enhance modularity | Community detection algorithm |
Clique percolation | Divisive clustering that detects overlapping cliques, then forms communities by grouping these cliques | Community detection algorithm |
Louvain modularity | Agglomerative clustering that rearranges nodes within communities, then reaggregates them into separate communities to enhance modularity | Community detection algorithm |
Omega coefficient | Indicates the small-world property of the network based on path length and clustering | Topological measure |
Alpha exponent | Measures the network’s degree distribution to check for a power-law fit | Topological measure |
Degree centrality | Measures the number of edges each node has to other nodes | Whole-network and egocentric measure |
Betweenness centrality | Measures the number of shortest paths that pass through each node | Whole-network and egocentric measure |
Closeness centrality | Measures average reachability of one node to other nodes | Whole-network and egocentric measure |
Eigenvector centrality | Measures the extent to which a node is an immediate neighbor of well-connected nodes | Whole-network and egocentric measure |
Measure . | Description . | Usage . |
---|---|---|
Density | Proportion of edges in the network relative to total possible edges | Whole-network cohesion measure |
Average clustering | Measures local neighborhood formation in the network | Whole-network cohesion measure |
Average path length | Measures average shortest path distance between every pair of nodes | Whole-network cohesion measure |
Giant component | Identifies the largest connected subgraph | Whole-network cohesion measure |
Clauset-Newman-Moore | Agglomerative clustering that initialize each node as separate community, then uses a greedy approach to iteratively merge pairs of communities that enhance modularity | Community detection algorithm |
Clique percolation | Divisive clustering that detects overlapping cliques, then forms communities by grouping these cliques | Community detection algorithm |
Louvain modularity | Agglomerative clustering that rearranges nodes within communities, then reaggregates them into separate communities to enhance modularity | Community detection algorithm |
Omega coefficient | Indicates the small-world property of the network based on path length and clustering | Topological measure |
Alpha exponent | Measures the network’s degree distribution to check for a power-law fit | Topological measure |
Degree centrality | Measures the number of edges each node has to other nodes | Whole-network and egocentric measure |
Betweenness centrality | Measures the number of shortest paths that pass through each node | Whole-network and egocentric measure |
Closeness centrality | Measures average reachability of one node to other nodes | Whole-network and egocentric measure |
Eigenvector centrality | Measures the extent to which a node is an immediate neighbor of well-connected nodes | Whole-network and egocentric measure |
3.3.1. Whole-network cohesion measures
Density measures the proportion of edges that exist in a network relative to the total number of possible edges. We use the formula (2m)/(n(n − 1)) to calculate density, where n is the number of nodes, and m is the number of edges.
Average clustering measures the extent to which nodes in a network tend to form local neighborhoods. We calculate this by dividing the fraction of triangles in the network by the possible number of triangles that could exist with a given network size.
Average path length measures the average shortest path distance between every pair of nodes. We use Dijkstra’s algorithm with Newman’s (2001) modification for weighted networks, where each node is iteratively selected as a source node and a shortest path to every other node is calculated. This modification entails inverting the edge weights to reflect the strength of the collaboration tie, whereby higher weights means lower “distance/cost” to traverse through.
Giant component is the largest connected subgraph in a network, where all nodes are reachable to each other. This algorithm iterates through each source node and conducts a breadth-first search to ensure there is no disconnected path between any two nodes. We first extracted all the connected components that exist in the network, then determine the largest component and assign that as the giant component.
3.3.2. Subgroup measures
The Clauset-Newman-Moore greedy modularity maximization algorithm initiates by treating each node as its own community and then employs a greedy approach to merge communities that maximize the network’s modularity, seeking to maximize the positive contribution to modularity by pairing communities.
The clique percolation method functions as a divisive clustering approach by identifying cohesive subgraphs or cliques of nodes, which are groups where each node is connected to every other node. These cliques are gradually merged to create a hierarchy of community structures.
The Louvain modularity algorithm is an agglomerative clustering method that starts with each node in its own community and optimizes modularity by iteratively merging nodes into communities, locally optimizing modularity by considering nodes’ connections within their neighborhoods, effectively identifying community structures within a network.
3.3.3. Centrality measures
Degree centrality measures the number of edges that each node has to other nodes in the network. We normalize each node’s degree centrality by dividing it by the maximum degree of the network (n − 1).
Betweenness centrality measures the number of shortest paths that pass through each node. This measure indicates the extent to which a node can bridge connection(s) to other nodes in the network. Given the size of the network and the computational complexity, we approximate betweenness centrality based on a random sample of 1,000 nodes. This measure is normalized by 1/((n − 1)(n − 2)), where n is the number of nodes in a directed network.
Closeness centrality measures the average reachability of one node to other nodes in the network. We calculate this based on the reciprocal of the average path length between a source node and all other n − 1 nodes.
Eigenvector centrality measures the extent to which a node is an immediate neighbor of well-connected nodes. Eigenvector centrality is calculated by Ax = λx, where A is an adjacency matrix of the network with an eigenvalue of λ. The algorithm iterates over each node and is complete when λ is the highest in A.
3.3.4. Topological measures
The omega coefficient indicates the extent to which a network exhibits a small-world property. The formula is ω = Lr/L − C/Cl, where C is the clustering coefficient of the network, L is the average path length of the network, Lr is the average path length of the simulated random network, and Cl is the clustering coefficient of the simulated lattice network. The ω coefficient ranges from −∞ to +∞, where ω close to zero reflects a small-world topology. A random network is indicated by a positive ω. A lattice network is indicated by a negative ω. We compute ω on the giant component, with five rewiring iterations per edge, and five random graphs generated to calculate the simulated statistics.
The alpha exponent indicates the extent to which the network’s degree distribution exhibits a power-law fit. The algorithm is implemented via the powerlaw package in Python, where the optimal α exponent is computed for the network. α ranges from 1 to ∞, and α between 2 and 3 indicates that the network degree distribution is a power-law fit (Newman, 2005).
3.4. Comparison Between Networks Without and With Hyperauthorship
4. RESULTS
We first present the hyperauthorship cutoff results based on our authorship threshold approach. The number of authors per paper ranges from 1 to 156 in our data set (mean = 5.46, median = 4, SD = 6.37; Figure 1). Given that the mean number of authors is more than the median number of authors, we expect a positively skewed distribution. As shown in Figure 3, we have a skewed probability distribution and thus we opted to use Chebyshev’s inequality function to estimate a suitable outlier threshold. Based on Chebyshev’s distribution at k = 3, where approximately 89% of the data will be within three standard deviations, we find the upper bound at 25.85. This means that a threshold of ≈ 26 authors and above would be considered outliers in this data set. We further evaluate the reliability of this threshold by using a cumulative percentage approach, as shown in Figure 2. We find that 90% of papers are included within a threshold of 25 authors per paper. Thus, this method suggests a cutoff threshold of excluding all papers with more than 25 coauthors from analyses. Using this cutoff, we removed 203 papers that have more than 25 authors. These papers have a range of 26 to 156 authors per paper (mean = 50.88, median = 43, SD = 26.25). After removing these works, the resulting distribution of the updated data set changes to a range of 1 to 25 authors per paper (mean = 4.97, median = 4, SD = 3.36).
Histogram depicting number of coauthors per paper for all articles included in this analysis. The red dotted line indicates the cutoff threshold in authorship at 90% cumulative percentage, indicating hyperauthorship.
Histogram depicting number of coauthors per paper for all articles included in this analysis. The red dotted line indicates the cutoff threshold in authorship at 90% cumulative percentage, indicating hyperauthorship.
Histogram depicting the distribution of the number of authors per paper in this analysis, as estimated by Chebyshev’s inequality (blue line) and cumulative normal distribution (orange line).
Histogram depicting the distribution of the number of authors per paper in this analysis, as estimated by Chebyshev’s inequality (blue line) and cumulative normal distribution (orange line).
The structural and topological characteristics of the networks are impacted to varying degrees as a result of excluding versus including hyperauthored papers (Table 3). The coauthorship network without hyperauthored papers was projected based on 18,897 unique papers. The network including hyperauthored papers was projected based on 19,100 papers, which contains 203 more papers than the first network. Although including hyperauthored papers results in a minimal percentage change in number of papers between the two networks (1.074%), the resulting changes to network size are notable. There is a 17% (n = 5,191) increase in the number of authors when hyperauthored papers are included, which resulted in a notable increase of 121% (n = 242,311) in coauthorship ties. The density of the network also increases by 75% given the rise in number of edges; however, there is no change in average clustering across the two networks. As the networks are sparsely connected, making density an unreliable measure of cohesion, we reported average degree centrality and found that the increase was also notable at 89% when hyperauthored papers are included. Although the number of edges increases when hyperauthored papers are included, the number of closed triangles between nodes remains the same. This indicates that more edges do not necessarily lead to a higher level of triadic closure among the authors. There is a slight increase in the average shortest path length (+11%) and a decrease in the number of components (from 15 to 14) in the network with hyperauthored papers. Interestingly, the size of the largest (giant) component increases with similar magnitude (+121%) with the increase in number of edges. In particular, the giant component in the network without hyperauthorship excludes 831 edges, and the giant component in the network with hyperauthorship excludes 729 edges, thus suggesting that the network with hyperauthored papers has slightly fewer pendant edges that are not connected to the rest of the network.
Network descriptives for coauthorship networks without and networks with hyperauthorship
Network measures . | Without hyperauthors . | With hyperauthors . | % change . |
---|---|---|---|
# of unique papers | 18,897 | 19,100 | +1.07 |
# of nodes (authors) | 30,467 | 35,658 | +17.04 |
# of edges (coauthorship) | 199,581 | 441,892 | +121.41 |
Density | 0.0004 | 0.0007 | +75 |
Average clustering | 0.854 | 0.854 | 0 |
Average path length (of subgraph) | 4.27 | 4.95 | +15.93 |
Size of giant component | 198,750 | 441,163 | +121.97 |
# of Components | 15 | 14 | −6.67 |
Clauset-Newman-Moore | 292 | 307 | +5.14 |
Clique percolation | 945 | 910 | −3.70 |
Louvain modularity | 74 | 68 | −8.11 |
Small-worldliness (ω) | −0.295 | −0.355 | −20.339 |
Power-law exponent (α) | 2.919 | 4.867 | +66.735 |
Average degree centrality (unweighted) | 13.101 | 24.785 | +89.184 |
Average eigenvector centrality (unweighted) | 0.003 | 0.0007 | −76.667 |
Average betweenness centrality (unweighted) | 0.0001 | 0.000 | −100 |
Average closeness centrality (unweighted) | 0.231 | 0.240 | +3.896 |
Network measures . | Without hyperauthors . | With hyperauthors . | % change . |
---|---|---|---|
# of unique papers | 18,897 | 19,100 | +1.07 |
# of nodes (authors) | 30,467 | 35,658 | +17.04 |
# of edges (coauthorship) | 199,581 | 441,892 | +121.41 |
Density | 0.0004 | 0.0007 | +75 |
Average clustering | 0.854 | 0.854 | 0 |
Average path length (of subgraph) | 4.27 | 4.95 | +15.93 |
Size of giant component | 198,750 | 441,163 | +121.97 |
# of Components | 15 | 14 | −6.67 |
Clauset-Newman-Moore | 292 | 307 | +5.14 |
Clique percolation | 945 | 910 | −3.70 |
Louvain modularity | 74 | 68 | −8.11 |
Small-worldliness (ω) | −0.295 | −0.355 | −20.339 |
Power-law exponent (α) | 2.919 | 4.867 | +66.735 |
Average degree centrality (unweighted) | 13.101 | 24.785 | +89.184 |
Average eigenvector centrality (unweighted) | 0.003 | 0.0007 | −76.667 |
Average betweenness centrality (unweighted) | 0.0001 | 0.000 | −100 |
Average closeness centrality (unweighted) | 0.231 | 0.240 | +3.896 |
In terms of topology, both networks without and with hyperauthored papers exhibit a lattice-like structure as opposed to a small-world structure (negative ω values). The network without hyperauthored papers exhibits a closer fit to a power-law topology (i.e., hub-and-spokes structure, consistent with Newman (2005)’s finding) than a network with hyperauthored papers. This result also highlights the impact that hyperauthorship has on the degree distribution that changes the topology of the network.
Figures 4 and 5 show the distribution (log-normal) of the number of coauthors of an author’s egonetwork and a paper’s egonetwork, respectively. The inclusion of hyperauthored papers notably impacts the right-tail of the distribution where a number of authors had a large number of coauthors. As the result, the slope of the right-tail in the (b) network is less steep compared to the (a) network. The impact is also visible in the paper egonetwork distribution, with more oscillations along the right-tail. Altogether, this shows that hyperauthorship is the best descriptor of coauthorship network degree distribution due to the high variability of coauthorship counts when hyperauthored papers are included.
Log-normal distribution of the number of coauthors of a given author: (a) without hyperauthors; (b) with hyperauthors.
Log-normal distribution of the number of coauthors of a given author: (a) without hyperauthors; (b) with hyperauthors.
Log-normal distribution of the number of authors of a given paper: (a) without hyperauthors; (b) with hyperauthors.
Log-normal distribution of the number of authors of a given paper: (a) without hyperauthors; (b) with hyperauthors.
We observe notable differences in average centrality measures when hyperauthored papers are included, as shown in Table 3. Both average degree and closeness centrality increased, by 89% and 3% respectively. Average eigenvector centrality and betweenness centrality decreased significantly, by 76% and 100% respectively. It’s important to note that while the change in average centrality values seems small, the magnitude of the change is notable given that the values are averages over a large number of observations.
We further examined how centrality measures are impacted by the presence of hyperauthored papers when different weighting functions are used in the calculation. Table 4 shows the average centrality measures based on full counting (i.e., “weighted”) and two partial counting methods, Newman’s and Jaccard’s functions. We also include the unweighted measures to compare with the weighted counterparts. The percentage change is reported to show the difference in measures when hyperauthored papers are included, and the optimal weighting function is one that can minimize this percentage difference. We find that for degree centrality, Newman weighting is most effective in minimizing the difference in measures across the two networks. For betweenness centrality, full counting is preferred, as there is no difference in betweenness centrality across the two networks when this weighting function is used. For closeness centrality, Jaccard weighting along with the unweighted measure are preferred, with the least difference in closeness centrality when hyperauthored papers are included. For eigenvector centrality, full counting method is preferred as there is no change in centrality in the presence of hyperauthorship.
Average centrality measures with various weighting functions for networks without and networks with hyperauthorship
Network measures . | Without hyperauthors . | With hyperauthors . | % change . |
---|---|---|---|
Average degree centrality (unweighted) | 13.10 | 24.78 | +89.18 |
Average degree centrality (weighted) | 19.23 | 34.72 | +80.56 |
Average degree centrality (Newman weighted) | 3.05 | 2.90 | −5.08 |
Average degree centrality (Jaccard weighted) | 4.61 | 11.67 | +152.97 |
Average betweenness centrality (unweighted) | 0.0001 | 0.000 | −100 |
Average betweenness centrality (weighted) | 0.0002 | 0.0002 | 0 |
Average betweenness centrality (Newman weighted) | 0.0003 | 0.0002 | −33.33 |
Average betweenness centrality (Jaccard weighted) | 0.0002 | 0.0001 | −50 |
Average closeness centrality (unweighted) | 0.23 | 0.24 | +3.90 |
Average closeness centrality (weighted) | 0.45 | 0.46 | +5.97 |
Average closeness centrality (Newman weighted) | 1.52 | 2.30 | +51.85 |
Average closeness centrality (Jaccard weighted) | 21.61 | 22.27 | +3.06 |
Average eigenvector centrality (unweighted) | 0.003 | 0.0007 | −76.67 |
Average eigenvector centrality (weighted) | 0.0003 | 0.0003 | 0 |
Average eigenvector centrality (Newman weighted) | 0.0003 | 0.0002 | −33.33 |
Average eigenvector centrality (Jaccard weighted) | 0.0002 | 0.0004 | +100 |
Network measures . | Without hyperauthors . | With hyperauthors . | % change . |
---|---|---|---|
Average degree centrality (unweighted) | 13.10 | 24.78 | +89.18 |
Average degree centrality (weighted) | 19.23 | 34.72 | +80.56 |
Average degree centrality (Newman weighted) | 3.05 | 2.90 | −5.08 |
Average degree centrality (Jaccard weighted) | 4.61 | 11.67 | +152.97 |
Average betweenness centrality (unweighted) | 0.0001 | 0.000 | −100 |
Average betweenness centrality (weighted) | 0.0002 | 0.0002 | 0 |
Average betweenness centrality (Newman weighted) | 0.0003 | 0.0002 | −33.33 |
Average betweenness centrality (Jaccard weighted) | 0.0002 | 0.0001 | −50 |
Average closeness centrality (unweighted) | 0.23 | 0.24 | +3.90 |
Average closeness centrality (weighted) | 0.45 | 0.46 | +5.97 |
Average closeness centrality (Newman weighted) | 1.52 | 2.30 | +51.85 |
Average closeness centrality (Jaccard weighted) | 21.61 | 22.27 | +3.06 |
Average eigenvector centrality (unweighted) | 0.003 | 0.0007 | −76.67 |
Average eigenvector centrality (weighted) | 0.0003 | 0.0003 | 0 |
Average eigenvector centrality (Newman weighted) | 0.0003 | 0.0002 | −33.33 |
Average eigenvector centrality (Jaccard weighted) | 0.0002 | 0.0004 | +100 |
Altogether, the findings suggest that including hyperauthored papers distorts microlevel and egocentric measures while maintaining relative stability for the network as a whole. This conclusion is based on the observation that while there is little change in whole-network structure (despite growth in network size), there is significant change in the average centrality measures at the microlevel, indicating that the inclusion of hyperauthored papers can greatly affect the position and influence of individual authors within the network.
4.1. Egocentric Network Case Study
Given our initial findings that hyperauthor papers produce meaningful structural influences for egocentric measures, we conducted an egocentric case study of a specific set of authors to explore how their positions in the network changed due to the inclusion of hyperauthorship. The authors are selected based on their importance in the network based on degree centrality (selection criteria shown in Table 5). Degree centrality is a reliable indicator of power and prestige in our network, as authors with high degree centrality are more likely to benefit from their immediate coauthors and their respective coauthorship networks and in terms of knowledge and skills (Badar, Frantz, & Jabeen, 2016; Li, Liao, & Yen, 2013). A high number of connections within the network also suggests that these authors are actively collaborating and contributing to the field.
Egocentric networks that were impacted by the inclusion of hyperauthorship
. | . | With hyperauthor . | |
---|---|---|---|
High centrality . | Low centrality . | ||
Without hyperauthor | High centrality | Node 67 | Node 135 |
Low centrality | Node 16 | Node 3918 |
. | . | With hyperauthor . | |
---|---|---|---|
High centrality . | Low centrality . | ||
Without hyperauthor | High centrality | Node 67 | Node 135 |
Low centrality | Node 16 | Node 3918 |
Node 67 (egonetworks in Figure 6) fits with our selection criteria of high centrality for both networks without and with hyperauthorship as it has high centrality in both networks (rank one and rank three, respectively). In particular, this author produced 747 papers in total, two of which were hyperauthored (one with 27 coauthors, and one with 49 coauthors). The egonetwork consisted of 850 unique coauthors and 5,425 edges between them when hyperauthorship was excluded. When hyperauthorship is included, their network grew to 916 unique coauthors and 6,896 edges between them. As shown in Figure 6(a) and (b), the egonetwork size increased slightly with the presence of a large cluster of nodes resulting for hyperauthorship.
Node 67 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 850, edges = 5,425; With hyperauthors: nodes = 916, edges = 6,896. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.
Node 67 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 850, edges = 5,425; With hyperauthors: nodes = 916, edges = 6,896. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.
Node 135 (egonetworks in Figure 7) provides a case where including hyperauthorship negatively impacted the ego’s centrality in the coauthorship network. This node was ranked 28th in degree centrality when excluding hyperauthored works, but dropped to 80th place when hyperauthorship was included. This is because the author was not involved in any hyperauthored papers. While they published 142 papers (range of coauthors: 0–21) and were central in the network overall, other authors who benefited from hyperauthorship surpassed this author’s centrality. As a result, the author’s network size did not change, with 364 coauthors and 2,298 edges in the network with hyperauthorship and 2,297 edges in the network without hyperauthorship. An additional edge appeared in the network with hyperauthorship, representing two authors who published another hyperauthored paper that Node 135 was not a part of.
Node 135 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 364, edges = 2,297; with hyperauthors: nodes = 364, edges = 2,298. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.
Node 135 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 364, edges = 2,297; with hyperauthors: nodes = 364, edges = 2,298. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.
Node 16 (egonetworks in Figure 8) exemplifies how hyperauthorship can benefit an author’s position in the network. The author had 394 papers that were not hyperauthored (range of coauthors: 0–24), and three papers that were hyperauthored, with 49, 70, and 155 coauthors respectively. In the network without hyperauthorship, they had 276 coauthors and 1,343 edges between them. Including their three additional papers in the network with hyperauthorship grew their network to 502 coauthors and 17,284 edges. As a result, this author rose from 52nd in degree centrality to 28th when hyperauthored works are included.
Node 16 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 276, edges = 1,343; with hyperauthors: nodes = 502, edges = 17,284. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.
Node 16 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 276, edges = 1,343; with hyperauthors: nodes = 502, edges = 17,284. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.
Node 3918 (egonetworks in Figure 9) exemplifies a contingency when hyperauthorship may not influence the overall centrality of an author in the network. This author produced six papers without hyperauthored works (range of coauthors: 2–5). Their egonetwork contained 13 unique coauthors and 36 edges between them. When their one hyperauthored paper (with 54 coauthors) was added to the network, their network size grew to 64 unique coauthors and 1,521 edges between them. Even though Node 3918’s ranking in terms of degree centrality compared to other authors has remained relatively unchanged with the inclusion of hyperauthorship, their egonetwork grew significantly when a hyperauthored paper was included.
Node 3918 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 13, edges = 36; with hyperauthors: nodes = 64, edges = 1,521. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.
Node 3918 egonetworks. Node colors: ego – light blue; alter – dark blue. Nodes sized by degree centrality. Without hyperauthors: nodes = 13, edges = 36; with hyperauthors: nodes = 64, edges = 1,521. (a) Shows network without hyperauthors, weighted using full counting. (b)–(d) Show network with hyperauthors used, weighted using (b) full counting, (c) fractional counting using the Newman algorithm, and (d) the Jaccard algorithm.
4.1.1. Impact of weighting functions
Having demonstrated the influence of hyperauthored works on standard metrics of centrality, we next examined the impact of weighting functions on the four egonetworks’ centrality measures and determined whether certain approaches to weighting centrality assessments offers an optimal method to curtail the inflation effects associated with hyperauthorship. We find that weighting by full and fractional counting significantly changes the four average centrality values of the egonetworks in our case study.
As shown in Tables S1–S2 in the Supplementary material, the choices of weighting functions matter to all measures of centrality. In the case of node 67, weighting by full counting yields the highest average degree centrality for networks both without and with hyperauthors. On the other hand, weighting by fractional counting (Newman and Jaccard) brings the average degree centrality down significantly, and even lower than the unweighted degree measures for both networks. Specifically, centrality measures with Jaccard weighting are often the lowest compared to measures from other weighting functions, with the exception of average closeness centrality. The notably high average closeness centrality (173 in the network without hyperauthorship and 180 in the network with hyperauthorship) suggests that there are many authors that receive high scores because their immediate neighbors are well connected. This effect is magnified in the closeness centrality measure when Jaccard is used as edge weight. For nodes 135, 16, and 3918, we observe similar effects of Jaccard weighting on closeness centrality, where the average closeness measures are notably inflated compared to two other weighting functions.
We also examined which weighting function(s) are effective in minimizing the percentage change between the network without hyperauthorship and the network with hyperauthorship. This allowed us to determine the optimal weighting function to mitigate the inflated effects of hyperauthorship. We focus on node 16 and node 3918 (results in Table S2 of the Supplementary Materials) in this analysis because the effects of hyperauthorship were most profound to their egonetworks. In node 16, Newman weighting was most effective as the average degree centrality actually decreased (−23%) when hyperauthorship is included. For betweenness centrality, the unweighted measure was preferred as it best minimizes the percentage change between the two networks. For closeness centrality, the unweighted measure yields the lowest percentage change while the Newman weighted measure yields the highest percentage change. For eigenvector centrality, Newman weighting was the only function that yields a decrease in eigenvector (−44%), while other weighting functions are increased due to hyperauthorship.
We observe similar patterns of results in node 3918, with the exception of betweenness centrality and eigenvector centrality. A Jaccard weighting function is effective for betweenness centrality where the percentage decrease is most minimized with this weighting function (−79%). Eigenvector centrality with Jaccard weighting is also preferred as the impact of hyperauthorship is minimized (−36%) compared with other weighting functions. A Newman weighting function was the most effective in minimizing the inflated effects of hyperauthorship for degree centrality (−52%), but least effective for closeness centrality (+559%).
4.2. Discussion and Conclusion
Our structural analysis and egocentric analysis of coauthorship networks revealed notable effects that hyperauthorship has on certain centrality measures. First, including even a small number of hyperauthored papers created noise in the overall distribution of the number of authors of a given paper (shown in Figure 5) as well as in the number of coauthors (shown in Figure 4). Secondly, hyperauthorship inflated the average degree and closeness measures, and deflated the average eigenvector and betweenness centrality measures. Structural measures including network size, giant component size, density, and average path length, were inflated as well. Interestingly, average clustering remained unchanged, indicating that while the network grew about twice in size when hyperauthored papers are included, the network is not more connected in terms of local neighborhoods.
Our approach for establishing a hyperauthorship threshold yielded an appropriate cutoff point of 25 authors in our data set. This threshold value is similar to the threshold of 20 authors as set in Fegley and Torvik (2013)’s and Morris and Goldstein (2007)’s studies. On the other hand, this threshold is substantially smaller than the values of 100 authors and 200 authors determined in Cronin (2001)’s and Milojević (2010)’s studies, respectively. We suspect that our cutoff values are different from those in the literature due to differences in the bibliographic database the data are collected from and the research fields that comprise the data sets. Our data were collected from an internal API that retrieves bibliographic information from Scopus, while others have used PubMed (Fegley & Torvik, 2013), Web of Science (Morris & Goldstein, 2007), or Thomson Reuters (Milojević, 2010). As exemplified in Glänzel and Thijs (2004)’s study, coauthorship dynamics differ significantly across different research fields. Our data contain publications specific to the field of genomics, while other studies focus on nanotechnology (Milojević, 2010), library and information science (Cronin, 2001; Morris & Goldstein, 2007), and biomedicine (Fegley & Torvik, 2013). The observed differences also suggest that determining the hyperauthor cutoff point may depend on the distribution of the data set. Therefore, a generalizable approach like ours for handling hyperauthorship data would be beneficial, as it could be applied across different disciplines and sources of bibliometric data.
We also examined whether weighting functions mitigate the impacts of hyperauthorship on centrality network metrics. We compared four different weighting scenarios (i.e., unweighted, weighted based on full counting, weighted based on the Newman method, and weighted based on the Jaccard method) for the entire network and four egocentric networks. The impact of weighting functions at the whole network level was slightly different than the impact at the egocentric network level, and varied based on the centrality measure. For degree centrality, Newman weighting is preferred at both the whole network and egocentric network levels. On the other hand, eigenvector centrality with weighting based on full counting can best curtail the effects of hyperauthorship for both network levels. For betweenness centrality, weighting by full counting is preferred at the whole network level, whereas fractional counting (Newman and Jaccard) is preferred for egocentric networks. For closeness centrality, all weighting methods inflated the measure, and thus no weighting is preferred. Overall, the choices in weighting methods have an observable impact on most centrality measures (except closeness centrality) and their resilience to the inflated effects of hyperauthorship.
Our study contributes to research in bibliometrics and scientific collaborations in specific ways. Our network-analytic approach demonstrates the significant structural impacts that even a small proportion of hyperauthored papers produced for multiple levels of a network, from the egocentric level to the whole network. In particular, we find that degree centrality and closeness centrality are overestimated when hyperauthored papers are included, whereas betweenness and eigenvector centrality are notably underestimated. Thus, our work should be taken as a cautionary message for scholars who are interested in using coauthorship networks to study collaboration. We encourage analysts and readers to think carefully about the nature of the relationships they are studying before deciding whether to include hyperauthored works in their analyses. Furthermore, our findings also show that network metrics that are typically used by academic institutions and funders as indicators of research success are susceptible to distortion in the presence of hyperauthorship. In particular, these metrics have been used to infer researcher’s productivity, collaboration effectiveness, and structural positioning to peers in their respective fields and may be used to guide decisions related to promotions, funding, and research rankings (Cummings & Cross, 2003; Larivière & Gingras, 2010). These insights underscore the need for a more nuanced approach to evaluating research productivity, particularly as hyperauthorship becomes increasingly prevalent, thereby encouraging discussions on how best to account for these complexities in research assessments.
Our analysis offers several options for how to handle hyperauthored works in analyses. First, if hyperauthored works are unnecessary for analysis (e.g., because they represent such a small proportion of a data set), or are semantically distinct from the object of study (e.g., if the analyst wishes to use coauthorship as a proxy for close collaborative relationships), we recommend considering removing them from the data set. Our paper offers a generalized and mathematically grounded approach for doing so. Second, if hyperauthored works are important for analysis, we encourage analysts to be mindful when interpreting network statistics that are likely to be inflated by these works. The removal of hyperauthored papers is not ideal in this context, as it may distort the network structure that could otherwise explain certain macro-level mechanisms driving collaborations. Hence, we propose multiple weighting functions to mitigate the impact of hyperauthorship, as opposed to removing hyperauthored papers, and find that weighting based on full counting and Newman-based fractional counting are preferred.
While this study has yielded insights into the impact of hyperauthorship on the collaborative network structures of group of researchers in a large biological research institute, we acknowledge several limitations in our study design and suggest areas for future research. Firstly, our study relies on data obtained from a research center in genomics, which may limit the generalizability of our findings to other fields and research contexts. In future research, we will broaden the scope of our analysis by incorporating data from other research centers as validation that our method can be reliably applied to any bibliometric data set. Secondly, we will expand to comparisons of our threshold approach with existing measures for reducing network sizes without the removal of hyperauthored works such as extracting network backbone, skeletons, cuts, cores, and islands (Batagelj, Doreian et al., 2014; Batagelj & Zaveršnik, 2011). Third, we will examine the impact of hyperauthorship on both two-mode paper-author networks and one-mode coauthorship networks in terms of network cohesion, centrality, and topology. Lastly, we will enhance our Newman- and Jaccard-based weighting methodologies by incorporating the weights among the coauthors of the focal authors, including those within the 2-hop and 3-hop neighborhoods. Our initial approach represents a first step in demonstrating how weights can be effectively incorporated for immediate neighbors in the authorship network. The weighting methods proposed of this study may be used to develop metrics for weighted common neighbors in coauthorship networks.
Collectively, our findings are directly relevant to researchers who want to use bibliometrics and network measures to make inferences about scientific collaboration. The removal of hyperauthorship is recommended especially before construction and analysis of coauthorship networks in order to avoid misrepresentation of network density, degree distribution, and centrality ranking of coauthors. Despite the relatively small number of hyperauthored papers in our data set, their impact on the coauthorship network structure, which serves as a proxy for understanding collaboration, is significant. This effect reflects the changing nature of guidelines and norms that determine who qualifies as an author in a scientific publication (Cronin, 2001). As a result, researchers should be cautious about using coauthorship as a proxy for scientific collaboration, as the nature and extent of each author’s contribution in a collaboration can vary widely.
ACKNOWLEDGMENTS
The authors are grateful to the research institute whose data are featured herein for providing staff and financial support to collect and analyze the data herein. Multiple research assistants enabled this work, including Rachel Rosenberg, Tianyi Liang, and Petiya Stoichkova. We thank Dr. Yi-Yun Cheng for their valuable feedback on drafts of this manuscript. This work was also funded by a seed grant provided by UIUC’s campus research board.
AUTHOR CONTRIBUTIONS
Ly Dinh: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Validation, Writing—original draft, Writing—review & editing. William C. Barley: Conceptualization, Funding acquisition, Investigation, Resources, Writing—original draft, Writing—review & editing. Lauren Johnson: Conceptualization, Resources, Validation, Writing—review & editing. Brian F. Allan: Conceptualization, Funding acquisition, Investigation, Resources, Writing—original draft, Writing—review & editing.
COMPETING INTERESTS
The authors have no competing interests.
FUNDING INFORMATION
This work was funded by a seed grant provided by University of Illinois Urbana-Champaign’s campus research board.
DATA AVAILABILITY
The data and code used in this project are openly accessible to facilitate reproducibility and further adoption. In accordance with ethical considerations and IRB regulations, the data shared strictly adhere to the de-identification of any sensitive or personally identifiable information to protect the privacy of the researchers in the data set. Hence, the data are a de-identified network edgelist where papers and authors are given unique IDs. The code is licensed under MIT License, and may be freely used and modified, with attribution to the original work: Dinh, L., Barley, W. C., Johnson, L., & Allan, B. F. (2024). Dataset for manuscript: Hyperauthored papers disporportionately amplify important egocentric network metrics. Zenodo. https://doi.org/10.5281/zenodo.10668904.
REFERENCES
Author notes
Handling Editor: Vincent Larivière