Reproducible science of science at scale: pySciSci

Abstract Science of science (SciSci) is a growing field encompassing diverse interdisciplinary research programs that study the processes underlying science. The field has benefited greatly from access to massive digital databases containing the products of scientific discourse—including publications, journals, patents, books, conference proceedings, and grants. The subsequent proliferation of mathematical models and computational techniques for quantifying the dynamics of innovation and success in science has made it difficult to disentangle universal scientific processes from those dependent on specific databases, data-processing decisions, field practices, etc. Here we present pySciSci, a freely available and easily adaptable package for the analysis of large-scale bibliometric data. The pySciSci package standardizes access to many of the most common data sets in SciSci and provides efficient implementations of common and advanced analytical techniques.


INTRODUCTION
Science of science (SciSci) as a discipline has grown rapidly over the last century, reflecting an increasing interest in quantitatively modeling the processes underlying science-from the novelty of scientific discoveries to the interconnectivity of scientists.The increasing prevalence of SciSci research is due in large part to the availability of large-scale bibliometric data capturing the products of scientific discourse, including publications, patents, and funding.Jointly with the analysis of scientific processes, such bibliometric data are used to map the evolution of specific fields, evaluate scientific performance and eminence, and support government policy and funding decisions (Fortunato, Bergstrom et al., 2018;Wang & Barabási, 2021;Wu, Kittur et al., 2022).However, bibliometric data are distributed across diverse databases, each with its own criteria for inclusion, and varied processes to assure the data's quality and accuracy (Csiszar, 2017).The manifold uses and applications for bibliometric data, combined with the call for reproducible and replicable science, has prompted the need for flexible analysis that is reliably reproduced across multiple data sets.
Here, we introduce pySciSci, an open-source Python package for the analysis of larges-cale bibliometric data.The pySciSci package provides • standardized preprocessing and access to many of the most common data sets in SciSci; • an extensive library of quantitative measures fundamental to SciSci; and • advanced methods for mapping bibliometric networks.
The pySciSci package is intended for researchers of SciSci working from complete bibliometric databases or those who wish to integrate large-scale bibliometric data into other existing projects.By creating a standardized and adaptable programmatic base for the study of bibliometric data, we intend to help democratize SciSci, support diverse research efforts based on bibliometric data sets, and address calls for open access and reproducibility in the SciSci literature and community (Light, Polley, & Börner, 2014).
To the best of our knowledge, our package constitutes one of the most comprehensive collections of methods and data sources in scientometrics and bibliometrics.It complements and extends the capabilities of the Bibliometrix (Aria & Cuccurullo, 2017), BiblioTools (Grauwin & Jensen, 2011), and Citan (Gagolewski, 2011) libraries to multiple databases and more advanced metrics.Although two of the most popular bibliometric programs, VOSviewer (van Eck & Waltman, 2010) and CiteSpace (Chen, 2006), are designed to provide graphical network maps of science, neither program is open source and modifiable.Several programs are much more specialized than pySciSci, and focus on implementations of method families for specific tasks (Moral-Muñoz, Herrera-Viedma et al., 2020); for example, CRXexplorer analyzes a publication's distribution of reference years (Marx, Bornmann et al., 2014), and the open-source Python package ScientoPy offers tools specifically for topical trend analysis (Ruiz-Rosero, Ramırez-González, & Viveros-Delgado, 2019).Our package also complements the CADRE (Mabry, Yan et al., 2020) environment built to host bibliometric data sets.Ultimately, our goal is not to supplant these other efforts to provide access to SciSci research but to facilitate a unified and generalizable open-source environment across different databases and methods of analysis.

THE pySciSci PACKAGE
The pySciSci package is built around Python Pandas data frames (McKinney, 2010), providing the simplicity of Python with the increased speed of SQL relational databases.pySciSci provides a standardized interface for working with several of the major data sets in the Science of Science, including the Microsoft Academic Graph (MAG), the Web of Science ( WOS), the American Physics Society (APS), PubMed, the DBLP Computer Science Bibliography (DBLP), and OpenAlex (Priem, Piwowar, & Orr, 2022).Each data set is referenced in pySciSci as a customized variant of the BibDataBase class, which handles all data loading and preprocessing.For an example of loading and preprocessing each database, we include a Getting Started jupyter notebook in the examples directory.The storage and processing frameworks are highly generalizable, and can be extended to other databases not mentioned above (e.g., United States Patent Office, Scopus, Lens).
The pySciSci pipeline starts by preprocessing raw data into a standardized tabular format (Figure 1).The package creates several relational data tables based on a balance between commonly associated data fields and memory footprint.Bibliometric records are split into five types of entities: publication, author, affiliation (institution), journal/venue, and field of study.The primary unit of analysis in pySciSci is the publication-a catch all phrase encompassing scientific articles, preprints, patents, books, conference papers, and other bibliometric products disseminated as a single entry in a database.The publication objects are stored in their own data table, publication.As the year of publication is the most commonly used publication property, the mapping of publications to year is also replicated in its own Python dictionary, pub2year, for quick reference.Depending on the specific database, the author names, affiliation names, and journal names may be available and are stored in their own data tables: author, affiliation, and journal respectively.In some databases, data fields represent expert Quantitative Science Studies curated entries, and in other databases, data fields may be algorithmically inferred by the database curators; see the specific database references for details.Finally, three relational tables are built to link between the entities: pub2ref captures reference and citation relationships between publications; publicationauthoraffilliation links between publications, their authors, and the author affiliations; and pub2field links publications to their field of study.The preprocessing step which builds the data tables only needs to be run once for each database.
After extracting the data tables, the pySciSci package precomputes several of the most common and useful bibliometric properties that form the backbone of many more advanced methods.For example, if the author information is available, the team size (number of authors) is found for all publications (Wuchty, Jones, & Uzzi, 2007).When the reference/citation information is available, the number of citations within a user defined window (default 10 years) is also precomputed (Wang, 2013).Finally, when both the author and reference/citation information is available, the pySciSci package will archive a copy of the reference/citation relationships in which self-citations are removed, pub2ref_noself.
To facilitate data movement and lower memory overhead when the complete tables are not required, the pySciSci preprocessing step chunks the data tables into smaller tables.When loading a table into memory, the user can quickly load the full table by referencing the table name as a database property or specify multiple filters to load only a subset of the data.The pySciSci also supports dask dataframes (Rocklin, 2015), which add parallelization and block scheduling, allowing large dataframes to be processed without loading the full dataframe into memory.The pySciSci package provides efficient implementations for many advanced metrics focusing on publications, authors, or journals (D), as well as advanced network analysis (E).

Quantitative Science Studies
Due to variations in data coverage between databases, the available package functionality will vary between data sets.For example, the DBLP database does not provide citation relationships between publications, and the APS database does not disambiguate author careers.The pySciSci package supports methods to link bibliometric entities between databases and the framework easily facilitates augmenting a database with additional data sources, allowing for enriched analysis (Gates, Gysi et al., 2021).
Our distribution of pySciSci is accompanied by a growing library of jupyter notebooks that illustrate its basic functionalities and usage.We also encourage the SciSci community to contribute their own implementations, data, use cases, or attempts to reproduce key results from the Science of Science.

PUBLICATIONS AND CITATIONS
The coverage of bibliometric databases varies, with some focusing only on a narrow subset of publications defined by journal or field, and others attempting to encompass all peer-reviewed scientific communication.As shown in Figure 2A, the number of publications and temporal coverage vary dramatically between four common databases.This variability reflects important decisions about data quality and generalizability that a researcher must make; for example, DBLP provides user-curated author careers in computer science, but does not contain citation information, whereas MAG contains a wide range of document types from all of science, with algorithmically inferred fields and author career information.With few exceptions, these databases focus on English-language publications, offering only sparse coverage of publications in other languages.The pySciSci package facilitates restricting each database to specific document types, fields, or years, allowing researchers more control over the publications and authors under study.
Citation analysis is the examination of the frequency, patterns, and networks of citation relationships between publications.Some citation measures have become commonplace, with many implementations available; others are precomputed by major database portals based on proprietary algorithms, and still others require complex processing and computational steps that have largely inhibited their general usage (Bollen, Van de Sompel et al., 2009).The pySciSci package facilitates the analysis of total citation counts for publications, as well as citation time series, fixed time window citation analysis, citation count normalization by year and field, and fractional citation contribution based on team size.Due to the package's modular design, the choice of citation count and normalization is made before calculating specific metrics.The package also includes a simplified interface for fitting models to citation time series, such as in the prediction of the long-term citation counts to a publication (Wang, Song, & Barabási, 2013), or in the assignment of the sleeping beauty score (Ke, Ferrara et al., 2015).Exemplar code illustrating citation metrics can be found in the examples folder.
Due to the prevalence of citation metrics as measures of scientific prominence, techniques for "gaming the system" have flourished that inflate an author's citation metrics for reasons other than scientific impact.For instance, it has been found that men tend to cite themselves more often than women, contributing to widening gender imbalances in scientific impact (King, Bergstrom et al., 2017).Consequently, one of the primary preprocessing steps for contemporary citation analysis is the removal of self-citations occurring between publications by the same author.All analysis facilitated in the pySciSci package can be run either with or without the self-citations when authors are available in the database.
The comparison of citation counts between different disciplines and fields is complicated by differing citation norms and community sizes (Radicchi, Fortunato, & Castellano, 2008).Therefore, it is common to normalize citation counts by field or year averages to create a common reference point, or to rank publications to identify "top publications" in the top 1% or 5% of publications from a field.The citation_rank function facilitates the ranking of publications by different citation metrics and groups.We also provide extended normalization measures that account for a publication's interdisciplinarity by controlling for citation patterns in the immediate cocitation neighborhood.
The diversity of disciplines or journals reflected in a publication's reference and citation relationships has been used to quantify the publication's interdisciplinarity or novelty (Gates, Ke et al., 2019;Porter & Rafols, 2009;Stirling, 2007;Uzzi, Mukherjee et al., 2013).The pySciSci package provides several measures of interdisciplinarity, including the Rao-Stirling diversity index, the Gini coefficient, Simpson's diversity index, and entropy measures, which can be computed using the distribution of publication references or publication citations.For example, consider the publication shown in Figure 3A, with five references in three disciplines: physics, Quantitative Science Studies biology, and economics.The Rao-Stirling reference interdisciplinarity, calculated using the raostriling_interdisciplinarity function, reflects the diversity of the disciplines referenced by the publication (left, 0.34), and the Rao-Stirling citation interdisciplinarity reflects the diversity of the disciplines citing the publication (right, 0.37).The pySciSci package also facilitates the computation of publication novelty and conventionality as measured by atypical combinations of journals in the reference list using the novelty_conventionality function (Uzzi et al., 2013).Other measures based on the local citation graph capture the disruptive influence of a publication as measured by the frequency with which the publication is cited alongside its own references (Funk & Owen-Smith, 2017;Park, Leahey, & Funk, 2023;Wu, Wang, & Evans, 2019), calculated by the disruption_index function.For example, in Figure 3A, four of the citing publications also cite three of the references, resulting in a disruption index of 0.2.Exemplar code for the analysis of publication interdisciplinarity can be found in the examples folder.

PUBLICATION GROUPS: AUTHORS, JOURNALS, FIELDS AND AFFILIATIONS
The next unit of analysis aggregates publications into groups by common author, journal, discipline/field, or affiliation.For example, the infamous journal impact factor considers the group of all publications from the same journal over a fixed time window (typically 2, 3, or 5 years), and is found by averaging their citation counts (Bordons, Fernández, & Gómez, 2002).The pySciSci package implements over 12 citation metrics for groups of publications, which can be easily applied to journal, author, discipline/field, or affiliation aggregations when available in the database.Combined with the different normalization decisions for citation counts, the pySciSci package implements nearly 200 different measures for scientific impact.
At the heart of scientific discoveries are the scientists themselves.Consequently, the sociology of science has analyzed scientific careers in terms of individual incentives, productivity, competition, collaboration, and success.The pySciSci package facilitates author career analysis through both aggregate career statistics and temporal career trajectories.We implement more than 10 metrics for author citation analysis, including the h-index (Hirsch, 2005), author_hindex, and Q-factor (Sinatra, Wang et al., 2016), author_qfactor.The package also includes a simplified interface for fitting models to author career trajectories, such as identifying topic switches (Zeng, Shen et al., 2019), the assessment of yearly productivity patterns (Way, Morgan et al., 2017), or the hot-hand effect (Liu, Wang et al., 2018).

Quantitative Science Studies 705
Reproducible Science of Science at scale For example, consider the representation of Derek de Solla Price's publication career as represented in the MAG, shown in Figure 3B.It captures the citations recieved by 71 articles and books published over 50 years (even though Dr. de Solla Price died in 1983, articles can be reprinted or published posthumously).Using this career trajectory, we find that Dr. de Solla Price has an h-index of 20 and a Q-factor of 14. Exemplar code for the analysis of author careers can be found in the examples folder.
Greater scrutiny is being given to the prevalence of systematic bias in science (Saini, 2019), supported by observations that, for example, female authors have fewer publications than their male colleagues (Larivière, Ni et al., 2013;Xie & Shauman, 1998).Although most databases do not include author biographical information (gender, race, age, position, sexual orientation, etc.), the pySciSci package facilitates linking user provided biographical information to author careers.Implementations are then available for advanced measures of inequality, including the measurement of categorical bias in reference lists (Dworkin, Linn et al., 2020), or career lengths (Huang, Gates et al., 2020).
In addition, the movement of scientists between institutions and countries requires longitudinal data capturing the changes in affiliation throughout a career.When the affiliations are disambiguated, the pySciSci package allows for collaboration and mobility networks between affiliations.These affiliations can be aggregated to the city, state, and country level, allowing for large-scale analysis of global patterns in scientific production and impact.

NETWORK ANALYSIS
Scientific discoveries and careers do not exist in isolation; rather, science evolves as a conversation between scientists, empowered by links between authors, publications, institutions, and other entities.Consequently, many key results from SciSci consider publications, authors, or fields as embedded in a complex web of interrelationships.The pySciSci package provides a flexible interface for working with networked bibliometric data.First, the bibliometric relationships are processed to extract the edge list representation of the network.The package then maps these edge lists to an adjacency matrix, treated internally as a scipy sparse matrix-a memory-efficient and highly flexible network representation.All network relationships can be further unraveled over time by considering snapshots of the network for each year.pySciSci facilitates basic network measures, including the number of connected components, extraction of the largest connected component, threshold filtering, disparity filter (Serrano, Boguná, & Vespignani, 2009), and analysis of degree distributions (Barabási, 2016).The scipy sparse adjacency matrix can also be directly imported into many of the most common packages for more advanced network analysis and visualization.
One of the most common bibliometric networks is the coauthorship network, in which nodes represent authors and two authors are linked if they coauthored a publication (Barabási, Jeong et al., 2002;Gold, Gates et al., 2022;Newman, 2004).Coauthorship networks are used to capture general patterns of collaboration including how many different people an author publishes with, how often an author's collaborators are also each-other's collaborators (network clustering), what the typical networked-based distance between authors is (average path length), and how patterns of collaboration vary between fields and over time.Given a subset of publication and author relations, the coauthorship_network function can build both the static and temporal coauthorship networks.Exemplar code for the analysis of coauthorship networks can be found in the examples folder.
The scientific community's perception of which publications are most related to each other is reflected in the publication cocitation network (Boyack & Klavans, 2010;Gates, Ke et al., 2019).Here, nodes represent publications and two publications are linked if they are both cited by another publication.For example, consider the cocitation network shown in Figure 4, in which nodes come from the set of publications that cite Stirling (2007).The cocitation network shows three distinct clusters of publications, each of which is enriched by a subset of related fields (Computer Science, Economics, Sociology), and a fourth that features Other publications.Indeed, modularity maximization using the Louvain heuristic (Blondel, Guillaume et al., 2008;Newman, 2006) identifies four communities.The similarity of the publication fields and the detected communities can be assessed using the element-centric similarity (Gates & Ahn, 2019;Gates, Wood et al., 2019), a measure between 0 and 1, where 1 captures that the two network communities are identical, and 0 reflects two network communities that group publications very differently.The element-centric similarity between the publication fields and the detected communities is 0.33, reflecting a modest level of agreement.Decomposing the error terms into contributions from publications in different fields, we find the the majority of the error arises from the Other publications (0.26), whereas publications in Economics and Computer Science are more faithfully recovered (0.45 and 0.35 respectively).This cocitation network analysis demonstrates how the diversity measure introduced in Stirling (2007) has impacted three distinct scientific communities.Exemplar code for the analysis of cocitation networks can be found in the examples folder.
Citation networks form the basis for collective measures of scientific impact.For example, the collective assignment of credit to a publication's authors can be measured by the frequency with which an author's other publications are cocited alongside the focus publication (Shen & Barabási, 2014).The pySciSci package algorithmically calculates the collective credit allocation temporally for each year since the article's publication.
Advance in statistical learning methods for graph embedding allow networks to be represented in high-dimensional metric spaces (Goyal & Ferrara, 2018).Such graph embedding methodologies provide compressed representations of the original network that can be used, for example, to predict new connections based on node similarities (Martinez, Berzal, & Cubero, 2016).The pySciSci package provides implementations of the node2vec (Grover & Leskovec, 2016) graph embedding method and its extension for authors, persona2vec (Yoon, Yang et al., 2021), which produces effective representations of scientific journals (Peng, Ke

Quantitative Science Studies 707
Reproducible Science of Science at scale et al., 2021) and author mobility (Murray, Yoon et al., 2020).Exemplar code for graph embedding can be found in the examples folder.

SUMMARY AND DISCUSSION
Here we introduced the open-source pySciSci Python package for bibliometric data analysis.Due to its modular structure, the pySciSci framework is highly generalizable and can easily accommodate many available data sets beyond the four mentioned here.The package also provides efficient implementations of common and advanced SciSci methods, facilitating reproducible analysis across multiple data sets.Most importantly, it is our hope that this package stimulates other researchers to add their own methods and facilitates large-scale collaborations throughout the Science of Science community.

Figure 1 .
Figure 1.Data processing overview.The pySciSci package preprocesses many of the common bibliometric data sources (A) into a standardized set of relational tables (B).The package also cleans and precomputes measures that are frequent building blocks for more advanced computations (C).The pySciSci package provides efficient implementations for many advanced metrics focusing on publications, authors, or journals (D), as well as advanced network analysis (E).

Figure 2 .
Figure 2. Growth of science across databases.The database size as measured by the number of A) publications, B) journals, and C) authors varies over several orders of magnitude between the APS (gold), DBLP (purple), MAG (dark green), and WOS (teal).As the APS does not include disambiguated author careers, it does not appear in C).

Figure 3 .
Figure 3. Advanced career and publication metrics.(A) The pySciSci package captures the several advanced characterizations of publication's influence (references) and impact (citations) including the disruption index, and Rao-Stirling Interdisciplinarity. (B) The package also facilitates the analysis of full author careers and summarizing metrics such as total productivity, h-index, and Q-factor.

Figure 4 .
Figure 4. Cocitation network.The interdisciplinary impact of a publication is illustrated through the cocitation network between citing articles.Here nodes are publications that cited Stirling (2007).Two nodes are linked if some other publication cited both.Node color reflects the publication's discipline: (yellow) computer science, (magenta) economics, (blue) sociology, and (green) other.The three prominent clusters reflect the fact that Stirling (2007) impacted three distinct communities of researchers.