Abstract
Software is a central part of modern science, and knowledge of its use is crucial for the scientific community with respect to reproducibility and attribution of its developers. Several studies have investigated in-text mentions of software and its quality, while the quality of formal software citations has only been analyzed superficially. This study performs an in-depth evaluation of formal software citation based on a set of manually annotated software references. It examines which resources are cited for software usage, to what extent they allow proper identification of software and its specific version, how this information is made available by scientific publishers, and how well it is represented in large-scale bibliographic databases. The results show that software articles are the most cited resource for software, while direct software citations are better suited for identification of software versions. Moreover, we found current practices by both publishers and bibliographic databases to be unsuited to represent these direct software citations, hindering large-scale analyses such as assessing software impact. We argue that current practices for representing software citations—the recommended way to cite software by current citation standards—stand in the way of their adoption by the scientific community, and urge providers of bibliographic data to explicitly model scientific software.
PEER REVIEW
1. INTRODUCTION
Software is an important part of modern science and contributes to the provenance of research results. From the microscopic perspective, identification of the particular software, and its specific version, that was used for a respective study is important to allow for the reproduction of the results. The macroscopic perspective on provenance enables large-scale analyses of software impact similar to impact factors for scholarly publications and thus allows credit to be given to the developers and funders of the software and permits analysis of the patterns of software usage across research domains. In general, software is considered one of the main pillars of science—besides articles and data—as it contains the logic of data transformation (Di Cosmo, Gruenpeter et al., 2020). Therefore, it is advocated that its contribution to research should be indicated and formally cited (Katz, Hong et al., 2021; Smith, Katz, & Niemeyer, 2016). The value of software is also recognized by different stakeholders in the scientific community, with journal policies requiring indication of software usage, and funders requiring researchers to make developed software and source code available as research results. Moreover, researchers themselves have taken up the role of software developers, with 84% reporting that developing software is essential for their research (Goble, 2014), while there is also an increasing need for funding allocated for the development of research software (Chue Hong, 2016).
Software development cannot be properly recognized without being included in measures of impact (Wright, Nagle, & Greenstein, 2023). In science, such measures rely on bibliographic analysis. With respect to research software, however, citation analyses are currently limited to in-text software mentions, and have been performed by automatic identification of software mentions in the full-text documents (Du, Cohoon et al., 2022; Istrate, Li et al., 2023; Schindler, Bensmann et al., 2022). On the one hand, this is due to historic reasons, because the high impact of software on research has only recently been recognized by the scientific community and formal software citation has only recently been advocated. On the other hand, it is not clear whether current citation practices and the infrastructure for citation analysis are suited to represent software, and whether they could be utilized for such analyses. In particular, software should not be cited through a proxy, such as a software article, but directly to allow its proper identification. This is crucial because aspects of software citation differ from the citation of articles, with the specific requirements of proper software citation defined by citation guidelines (Katz et al., 2021). Specifically, versioning—essential for provenance and reproducibility—is not considered in article citation. Hence, technical updates might be necessary to create suitable and machine-readable software representations (Stall, Bilder et al., 2023).
Bibliometric and citation analyses are performed with the aid of providers of bibliometric data and are based on the structured data they provide. Semantic Scholar (Kinney, Anastasiades et al., 2023) or Crossref (Hendricks, Tkaczyk et al., 2020), for instance, provide powerful application programming interfaces (APIs) to access the already preprocessed data about millions of articles. Such infrastructure should also build the basis to integrate software in bibliometric analyses. However, beside the structure of the provided data, such analyses also heavily depend on the data quality (Haustein & Larivière, 2014). With respect to software citations (and citations in general) this includes different stages of data collection and processing, beginning with the authors of scholarly publications who use software and provide all information necessary to identify the particular software and the actual code base1, ranging to publishers to provide structured data, and bibliographic databases that collect and provide the data, which are later used for scientometric studies. According to Batini, Cappiello et al. (2009), measurements of data quality consist of different dimensions, including accuracy and completeness, where the first describes the correctness of the provided data and the second describes whether all necessary information is covered. Regarding software citations, completeness can be interpreted with respect to the identification of the software and the particular code base. Accuracy, in contrast, describes to what extent processed data (i.e., provided by databases) reflect the original content as provided by the authors.
In this article, we analyze the data quality of software citations across the entire data lifecycle, beginning with references as initially provided by the authors, by publishers, and finally by two major databases for bibliographic data. All the analyses we perform are based on a high-quality, manually annotated data set, established in the scope of this work by extending the existing gold standard corpus SoMeSci of software mentions in scholarly publications. We first examine what exactly formal software citations refer to and investigate the completeness of such citations with respect to the particular software, software creator attribution, and the particular software version. Finally, we evaluate whether the bibliometric databases Semantic Scholar and Crossref can actually be used to estimate the impact that software might have.
The results of our analyses show that formal software citations most frequently refer to software articles, illustrating the importance of such as a surrogate citation target of the software. While this typically does not help to identify the specific software version used, it certainly allows the identification of the software itself and provides credit for its development. When using direct software citations, we find that only about two-thirds of them allow the identification of the actual software code base. With respect to bibliometric data providers, we find that significant parts of direct software citations are not represented or contain errors. We presume that algorithms for matching references to scholarly articles often produce wrong results when applied to direct software references and, therefore, conclude that such databases are currently not suited for large-scale analyses of software citation patterns. Furthermore, our work shows how different stakeholders in science—especially authors, software developers, and providers of bibliographic data—can contribute to improving the traceability and identification of software in scientific literature.
2. RELATED WORK
As outlined in Section 1, software is ubiquitous in data-driven science, and knowledge of its use is essential for the scientific community. Recent work has found that software is either mentioned informally within the full-text document of scientific articles or formally with a bibliographic reference, with the first practice being more common (Du et al., 2022; Howison & Bullard, 2016; Schindler, Bensmann et al., 2021a; Schindler et al., 2022). The analysis of informal mentions has been the subject of multiple investigations, either performing high-quality manual analyses on small corpora (Du, Cohoon et al., 2021; Howison & Bullard, 2016; Nangia & Katz, 2017; Schindler et al., 2021a) or automatic large-scale analyses (Duck, Nenadic et al., 2016; Pan, Yan et al., 2015; Schindler et al., 2022; Schindler, Zapilko, & Krüger, 2020). The reported results often vary greatly due to different underlying data and the chosen approach. The scientific domain, for instance, has a strong influence on software usage, with between 0.2 software mentions per article reported in Economics up to 30.8 in Bioinformatics. Some work has further included analyses on how often formal citations are provided together with informal mentions, with results varying due to the approach and underlying data. Howison and Bullard (2016) report that 44% of informal mentions include a formal citation, while Schindler et al. (2021a) report 16%, 24.8% are found in the data of Schindler et al. (2022), and Du et al. (2022) report 18%. Only the work of Howison and Bullard (2016) and Du et al. (2022) further investigates the formal citations themselves. They distinguish the cited resource behind formal citations and report respectively that 84% and 89% of citations refer to articles. Howison and Bullard (2016) further identified 5% of citations referring to software manuals and 11% to software directly, while Du et al. (2022) report 8% referring to software directly. In this work, further analyses of resource types for software citations are performed to show validity of the data.
Identification, credit, and provenance are three central aspects for software citation (Smith et al., 2016; Soito & Hwang, 2016). The identification of software is considered possible when the provided metadata allow the uniquely determination of the used software. Software names are, in general, insufficient for this purpose because they have been shown to be ambiguous (Duck, Kovacevic et al., 2015; Schindler et al., 2022) and can potentially refer to legacy software that is no longer findable by name. Therefore, the use of persistent unique identifiers is advocated for software citations (Katz et al., 2021; Smith et al., 2016; Soito & Hwang, 2016). Credit and attribution for the development of software is important for multiple stakeholders who have an interest in assessing the impact of software, including software developers and research funders. Software has not consistently been treated as a citable resource by the scientific community (Bouquin, Chivvis et al., 2020), which made it hard to assess its impact and to provide proper credit for its costly development (Mayernik, Hart et al., 2017). In general, proper attribution of a software developer can be challenging when multiple people or instances with different contributions are involved (Katz & Smith, 2015), or even impossible for open source projects. The use of software is part of research’s provenance; therefore, not only the software but also its specific development state—referred to as its code base in this work and usually indicated by a version—needs to be uniquely identifiable by the metadata provided with a software citation because most software is under constant development and changing in its range of functions and behavior (Katz et al., 2021; Smith et al., 2016). In general, the development state can either be uniquely identified by version numbers assigned by the developer or by a release date corresponding to a version (Katz et al., 2021). The completeness of informal software mentions with respect to identification, credit, and provenance has received some attention in the existing literature (Du et al., 2021, 2022; Howison & Bullard, 2016; Schindler et al., 2021a, 2022), which found that current mention practices often lack information. Again, only the work of Howison and Bullard (2016) and Du et al. (2022) takes formal references into account in this context, by including the information in formal references when determining the overall completeness of software citations. Moreover, Du et al. (2022) explicitly provide analyses of formal citations and report that 35% include a version and 78% identify software developers. In this work, we perform further systematic analyses of the completeness of formal software citation in terms of identification, credit, and provenance to gain a better understanding of software citation practices in scientific literature.
Large-scale scientometric analyses are commonly performed with bibliographic databases (Cho & Yu, 2018; Dion, Sumner, & Mitchell, 2018; Napolitano, Xu, & Gao, 2022; Peroni, Ciancarini et al., 2020), as they provide the necessary structured metadata for formal citations. Software could be included in such analyses if its formal citations are represented within those databases and correctly structured semantically. A semantic representation is important because metadata that is not correctly structured can become useless for downstream tasks. Proper representation of the software name and version, for instance, is crucial for tracking of software usage, and is necessary for the disambiguation of citation targets by data providers themselves. As described, Stall et al. (2023) argue that updates to existing infrastructure might be necessary for this purpose. We analyze how well software references are represented within the state of the art bibliographic databases Semantic Scholar2 and Crossref3 to assess how well they can be used as scientometric resources with respect to the analysis of scientific software usage. Semantic Scholar is a discovery service for scientific literature (Wade, 2022) developed and maintained by the Allen Institute for Artificial Intelligence (AI2). The service is based on the Semantic Scholar Academic Graph (S2AG) containing metadata of scientific publications for 205 million publications and 2.5 billion citation edges (Wade, 2022). A major aspect of Semantic Scholar is to integrate machine learning methods to enhance data quality and search. They did, for instance, develop a system for publication deduplication named S2APLER and perform citation linking based on fuzzy text-matching heuristics (Kinney et al., 2023). Data are provided free and open by Semantic Scholar and can be accessed via API. Semantic Scholar is widely used as a search mechanism for academic publications, while the underlying knowledge graph also enables scientometric analyses (Napolitano et al., 2022).
Crossref is the result of a joint effort by an association of publishers (Lammey, 2015) with over 17,000 members as of June 2023, with the goal to improve linking between publications made by heterogeneous publishers. The main application of Crossref is a database of metadata for scholarly articles and professional materials that enables unique identification of the resources covered by incorporating and introducing persistent identifiers. The corresponding bibliographic information is integrated into the database by publishers with central quality control by Crossref. Moreover, metadata is enriched by Crossref, mainly by adding citation links, but also through adding further information, such as funder registry information or journal classification codes (Hendricks et al., 2020). Crossref makes the database openly available without any license restriction through an API, and therefore enables scientometric analyses. The resource is widely used for this purpose—for instance, for citation analyses by Dion et al. (2018) and Peroni et al. (2020) and citation network analyses by Cho and Yu (2018).
3. ANALYSES
The goal of this study is to investigate the data quality of formal software citation in science based on manually annotated, high-quality data. This section outlines the four main research questions and the analyses employed to investigate them:
What types of resources are referenced by formal software citations?
Is software formally cited without being mentioned in the full-text document?
Do formal software references provide all necessary information to identify software, developer, and the code base used?
How well are formal software references represented in bibliographic databases?
3.1. Citation Resource Types
The first analysis investigates what type of resource is referenced by the bibliographic entry associated with in-text software mentions in scientific publications. Different resources can be referenced within these bibliographic entries because different software citation practices exist in the scientific community. Not all of these practices are suited for formal software citation because they might not provide all the necessary information to identify the software. Analyzing them, therefore, allows us to assess current practices and can reveal shortcomings.
The resource type is analyzed by determining the distribution of resource types in the data set introduced in Section 4. Furthermore, the resource type is analyzed with respect to the metadata provided in the context of the corresponding informal in-text software mention because authors using unsuited citation types that do not provide all required metadata to identify software could systematically add the missing information in the full-text document. The relevant resource types were defined based on the previous work of Howison and Bullard (2016), who distinguished between citations to publications, user manuals, and project names or websites, and further extended based on observations made during data annotation.
3.1.1. Direct software citations
These describe the cited software itself and are the recommended way to cite software by recently established software citation guidelines (Katz et al., 2021). An example of a Direct Software Citation (hereafter referred to as Direct Citation) is given in Listings 1 and 2, corresponding to the bibliographic entry as available in the PDF publication and in the Journal Article Tag Suite (JATS) XML. Properly executed Direct Citations capture all the metadata of software that is required for unique identification of the software, its developer, and the exact code base. In practice, not all required information may be present and the references themselves can be structured arbitrarily, as there is no commonly agreed citation style for direct software citations, which is further investigated in Section 3.3.
3.1.2. Software articles
These are scientific articles describing scientific software published by developers of scientific software, and cited in place of the software (Howison & Bullard, 2016). An example is provided in Listing 3. The practice is common as it allows developers of scientific software to receive scientific attribution for the costly development of the research software, and, historically, the publication of software has been considered a weaker contribution to the publication of an article (Hafer & Kirkpatrick, 2009). Software articles are among the most highly cited scientific papers (Van Noorden, Maher, & Nuzzo, 2014) and specific journals for publishing software articles have been established (e.g., The Journal of Open Research Software or Source Code for Biology and Medicine). Software articles are cited in the same way as other scientific articles without software specific information. Therefore, information identifying the code base, such as the version or release date, is generally missing from this citation type.
3.1.3. Software manuals
These are textual instructions for using software, and, particularly for commercial software, they are often the closest textual document associated with the software. It is established practice to cite such manuals instead of the software itself, with an example given in Listing 4. Corresponding references are formatted for citing a text source with information typically provided for an online source such as a URL and date of access. As with software articles, this citation type omits crucial information closer describing the corresponding software.
3.1.4. Websites
Websites associated with software are sometimes cited instead of the software. The corresponding references are structured as typical online resources, potentially providing the date of access. An example is provided in Listing 5. Same as for other styles, relevant information specifying the software itself is missing. As Direct Citations often include URLs, it is important to distinguish them from Websites. Here, all cases where additional information about the software is provided (except name, URL, and date of access) are considered as Direct Citations.
3.1.5. Other
Some citations describe the rare instances in which no verifiable resource is described by a reference. Those cases are present in practice and are likely to result from faulty automatic citation recommendations or author errors.
3.2. Formal Software Reference Without In-Text Software Mentions
The second part of the analysis investigates if software is formally referenced even if its not mentioned within a article’s full-text document. In theory, it is possible that authors formally cite software but do not state the name of the software, for instance, if they replace the software name with a generic term such as “source code.” Howison and Bullard (2016) report that generic terms make up only 1% of overall software mentions, but there could be further reasons why software is formally cited but not mentioned. As this aspect has—to the best of our knowledge—not yet been analyzed, we investigate if this practice exists and include the resulting set of references in the analyses for completeness. However, we consider only Direct Citations, Websites, and Manuals because it is not feasible to annotate Software Articles, as explained in Section 4.1. Analyzing this trend is highly relevant to formal software citations because it allows a better assessement of its importance for software traceability, as software usage is only identifiable through formal citations in the described cases, which is not considered by current methods of analyzing software in scientific publications.
3.3. Direct Citation Completeness
The third part of the analysis investigates the completeness of Direct Citations, Manuals, and Websites4 in terms of metadata describing the software as provided by the authors of the scientific publication, who have the responsibility for providing complete information to identify the software. In general, Direct Citations are the recommended practice for software citation (Katz et al., 2021) because they allow unique identification of software, developer, and the exact code base. However, it has not yet been analyzed what metadata is actually provided in practice, aside from the version and developer (Du et al., 2022). Therefore, completeness is analyzed in terms of: Name, Creator, Identifier, Archive, URL, Release Date (exact or only year), Version, Date of Access, Type of Citation, and Description, by analyzing the number of cases where the information was provided. An Identifier is defined as a specific unique identifier for software (e.g., an RRID); an Archive is assumed to be a persistent link to a repository where the software is published; a URL is any other link that is provided; and the Type of Citation is usually provided to identify the type of source that was cited (e.g., “[Source Code]” or “[Software]”), as shown in Listings 1 and 5. These specific metadata were selected as they are the recommended information to be provided for software citations to allow proper identification of the software, where Identifier, Archive, or URL can be applied in the given priority depending on how the software was published, while a Date of Access and Version both allow unique identification of the exact software development state, and Type of Citation as well as Description are considered as optional information (Katz et al., 2021).
Further, it is analyzed whether the referenced software is identifiable, whether its developer can be attributed, and whether the specific code base can be determined. Software is considered as identifiable if either Identifier, Archive, or Name and Creator are provided. Furthermore, references providing URLs are considered as conditionally identifiable because URLs are not persistent and commonly become invalid over time, with the effect of link rot regarding scientific data having been shown in prior work by Lakic, Rossetto, and Bernstein (2023). Moreover, they can link to different resources associated with the software (e.g., the download website, reference manuals, or the creator). The information for proper attribution is considered given if the software is identifiable and its Creator is stated. The code base refers to the exact software development state, and is defined as identifiable when either Release Date or Version number is given. A Date of Access is considered as conditionally identifiable under the assumption that the newest available release was used. The release year is considered as insufficient as multiple minor versions or even more than one major version can exist in the same year. Lastly, the overall completeness of software citations is analyzed if the software has a corresponding informal in-text software mention because authors could choose to provide part of the metadata within the full-text document and part of the metadata in the formal citation.
This analysis is of central importance because authors have the main responsibility for providing complete information in their software citations, which provides the basis for representation by publishers and literature databases. It can reveal how well software is formally cited, and builds the basis to formulate recommendations for potential improvements. These can serve as a basis for funders or journals to update their policies and require proper attribution of software usage by formal citation as recommended.
3.4. Database Accuracy
The fourth analysis investigates how well Direct Citations are represented by publishers and especially within scientific bibliographic databases. There are several studies analyzing informal software mentions in scientific literature (see Section 2), while no large-scale analysis of formal references has been performed. To implement such bibliometric analyses scientists typically employ large-scale bibliographic databases, while the information in the databases is based on the information provided by scientific publishers, who need to mark up the metadata provided by authors in a suitable manner. However, it is not clear how Direct Citations are represented by both publishers and databases because the structure of software citations strongly varies from other scientific publications that are usually represented by them (Stall et al., 2023). Regarding publishers, it is quantitatively analyzed how software citations are structured from the publishers’ side to assess the quality of the semantic representation. With respect to bibliographic databases, we analyze quantitatively what information is available within the databases, whether the information is represented in a structured manner that would allows a systematic analysis, and whether the information provided within the database is correct. For correctness it is considered whether the information from the originally provided reference differs from the information contained in a database, but also whether new information, which can potentially be added by a database, is correct. All aspects are examined for all individual references by comparing the information provided by the publisher and the state-of-the-art bibliographic databases Crossref and Semantic Scholar.
This analysis is essential with respect to formal software citations because it allows us to assess whether they can be systematically utilized by the scientific community based on the current infrastructure of publishers and bibliographic databases. This could, in turn, allow an extension and enhancement of current analyses of software citation. On the other hand, the analysis has the potential to gather insights on the current shortcomings of the representation of formal software citations and provides the basis to update the infrastructure of bibliographic data providers.
3.5. Confidence Intervals (CIs)
Confidence Intervals are used for reporting of statistical results with the underlying data being either multinomial or binary. Multinomial CIs are calculated based on the method proposed by Glaz and Sison (1999) and Sison and Glaz (1995) for calculation of simultaneous CIs. For binary variables, CIs are based on the binomial distribution, calculated as commonly known by the term Wald-interval, using an approximation by a normal distribution as proposed by Wallis (2013).
4. DATA SET
A high-quality data set to analyze all aspects outlined in Section 3 was established with quality control at each annotation step. The data set was based on the SoMeSci corpus of informal software mention in scientific publications, which was extended to also cover formal citation. SoMeSci is a manually annotated data set with data quality ensured by high Inter Annotator Agreement (IAA) of κ = .82. It covers several aspects of software mentions within scientific literature of which we utilize the information on informal software mentions and their associated formal citations. The approach of extending SoMeSci was chosen as the annotation performed required considerable manual effort (details described below), which could be systematically reduced based on SoMeSci without restricting gold standard annotation quality.
Overall, SoMeSci contains 1,367 articles in four sets with varying sources and annotation properties: PLoS methods includes 480 methods sections from articles published by the open source scientific publisher PLoS; PLoS sentences includes selected sentences from 677 articles published by PLoS; PubMed fulltext includes 100 full-text document publications from the PubMed Central Open Access (PMC OA) set; and Creation sentences includes selected sentences from 110 articles selected from PLoS and PMC OA that specifically publish software. Within those articles, 3,756 in-text mentions of software and 591 corresponding in-text citations are contained. An example of such an annotation is illustrated in Figure 1 with the corresponding bibliographic entry provided in Listing 6.
4.1. Annotation
Data were systematically annotated to answer all the research questions outlined in Section 3. An overview of the data annotation for analyses and the corresponding data sources is given in Figure 2. As high-quality data is required to make reliable statements, data quality was evaluated at every step throughout the annotation process. The quality was assessed by calculating the Inter Annotator Agreement (IAA) based on Cohen’s κ (Cohen, 1960) to account for chance agreement for the categorical annotation tasks. Particularly challenging annotations with insufficient agreement were handled by double annotation and reannotation of diverging cases.
4.1.1. Software citation types
As outlined in Section 3.1, these were manually annotated for all references associated with an in-text software mention in SoMeSci (Figure 2, left). The reference content itself is not contained in SoMeSci and was obtained from the publishers on January 22, 20225 and the reference type was then annotated based upon it. An initial overlapping annotation of 10% of data showed an IAA of κ = .75, which can be considered substantial agreement (Landis & Koch, 1977) but was determined as insufficient for analysis. Therefore, all samples were annotated by two annotators. To test the consistency of the annotation we also evaluated the final overall agreement at a value of κ = .76. Differences were then discussed and reannotated to ensure high data quality.
4.1.2. Formal software citation without in-text mentions
To identify cases where software is formally referenced without being mentioned in-text (Figure 2, middle-left), all 28,903 bibliographic references in the combined 579 of PLoS Methods and PubMed Fulltext sets of SoMeSci were annotated for citation type. The remaining articles were not included to keep the number of references and the corresponding annotation cost at a feasible level. The annotation was further implemented in two steps to improve recall and annotation efficiency and performed based on the reference JATS information available from the publisher (as illustrated in Listings 2, 4, 5). It was performed with a simple tool that displays the reference information and allows point-and-click annotation. In the first step it was only annotated if a reference was potentially relevant, and in the second step the citation type was annotated for the marked references. Only the citation types Direct Citation, Manual, and Website were considered, as Software Articles can generally not be distinguished from other articles based on the reference information alone. In the first step, the set of references was reduced to 1,392 references. The annotation quality of this annotation was measured by assessing the recall based on the known references connected to in-text software mentions that are known to be present in the references. Overall, 74 of 75 (99%) known Direct Citations, 24 of 25 (96%) Manuals, and 5 of 5 (100%) Websites were successfully identified, which was considered as satisfactory quality.
4.1.3. Direct citation completeness
This was annotated on all available Direct Citations, Manuals, Websites, and Other references from the first two annotation steps (detailed results for the first annotations are provided in Sections 5.1 and 5.2; Figure 2, middle-right). The corresponding reference texts were manually extracted from the published PDF documents in February 2023. All information introduced in Section 3.3 was then annotated using the annotation tool BRAT (Stenetorp, Pyysalo et al., 2012). An example of the annotation is provided in Listing 1. The annotation was performed by two annotators and the IAA was initially estimated on 10% overlap at a value of κ = .82, which is considered in the range of almost perfect agreement (Landis & Koch, 1977). The remaining data were then annotated by a single annotator, while challenging cases were further discussed throughout the process.
4.1.4. Database accuracy
This was annotated for the same references as the completeness analysis, based on the JATS information available from publishers and on the reference entries provided by Semantic Scholar and Crossref as described in Section 3.4 (Figure 2, right). The publishers’ references were automatically gathered while the corresponding Semantic Scholar and Crossref entries were manually gathered in August 2022. Automatic collection was not possible due to partially missing entries in both databases that hindered precise matching. Additionally, multiple entries per reference are in some cases present in Semantic Scholar, which were all extracted and annotated separately. The extracted information was then annotated for the same information considered for citation completeness, described above. To assess the quality of data representation and capture potential errors in reference entries specific tags were introduced in the annotation:
unstructured marks information that is not labeled within the database, but instead part of a single field containing multiple information about the software. Entries can be partially structured (e.g., the creator and publication date being labelled) but version and software name a unstructured within one field.
wrong place indicates that information is structured but with a false underlying concept (e.g., a creator being labeled as a publication venue);
wrong content indicates that the information in a database is wrong;
incomplete content indicates that some information is only partly presented, and part of the original information is lost;
duplicate indicates duplicate information introduced by a database. Note that this entry refers to duplicate information within one reference, not the duplicate entries for one reference within Semantic Scholar mentioned above.
4.1.5. Annotation effort
Overall, considerable annotation effort was necessary to generate the high-quality data set described above. The annotation of 603 references for citation type took an estimated 30 seconds per reference, summing to 11 hours for two annotators and the subsequent reannotation of 68 references. The annotation of ≈29,000 references to identify software citations without plain text mentions is estimated to take 5 seconds per reference in the first run, summing to 40 hours, and 20 seconds in the second run, summing to 4 hours. The annotation of plain text annotation completeness for all Direct Citations is estimated to take around 2 minutes per 205 references, summing to 8 hours, including the overlap for quality control. The final annotation of JATS and database entries—including gathering the corresponding entries—is estimated to take around 7 minutes per reference, summing to 48 hours as all references were examined by two annotators with an additional 2 hours for manual identification of database entries. In total, this amounts to 113 hours spent on annotation and quality control.
5. RESULTS
In this section the results of the analyses outlined in Section 3 are presented, addressing each of the four main research objectives individually. All results are based on the manually annotated data, for which the annotation process is outlined above.
5.1. Citation Resource Types
The resource types for bibliography entries connected to in-text software mentions were systematically annotated to analyze their distribution. Overall, 603 entries were annotated based on the original SoMeSci annotations of 591 in-text citations connected to software. The numbers differ because multiple reference entries can be referenced by one in-text citation string within SoMeSci. For the following analyses, nine duplicate entries in which the same software was cited twice in one article were excluded. Further, all 30 references contained in the SoMeSciCreation Sentences set were excluded because new research software is developed in their scope and we argue that including them could add a bias as authors publishing software might be more particular about software citation than other authors. Moreover, 25 (4%) additional references had to be excluded from the analysis because the bibliography entries were not related to the in-text software mention, even though the citation was directly associated with the software in the full-text document. Further investigation into the underlying reasons showed that 12 (2%) cases described prior use cases for the software, seven (1%) were article errors, either by authors or publishers, where all citation numbers in the document were mixed up, and six (1%) were entirely unrelated to the software.
The distribution of resource types as introduced in Section 3.1 is illustrated in Figure 3. The annotation results show that most references are Software Articles in 375 or 69.6% (95% CI: [65.9, 73.6]) cases, followed by Direct Citations in 120 or 22.3% (95% CI: [18.6, 26.3]) and Manuals in 35 or 6.5% (95% CI: [2.8, 10.5]) cases. Websites and Other references were only found in five or 0.9% (95% CI: [0, 5]) and four or 0.7% (95% CI: [0, 4.8]) cases, respectively.
It was further investigated whether there is an interaction between the citation type and the metadata provided within the full-text document of a publication, because authors might provide the information inherently missing from resource types such as Software Article in the full-text document of a publication. Therefore, the number of stated versions, release dates, developers, and URLs in the full-text document is compared with respect to the type of formal citation, including all mentions that were not formally cited as an additional class. Versions and releases are summarized under version if at least one of both is given because releases are rare in the SoMeSci data set. Further, the types Website and Other are excluded because there are too few data points for them. The results are illustrated in Figure 4. The results show that fewer versions are mentioned with software articles, with 31.3% (95% CI: [26.7, 36.0]) as compared to mentions without formal citations with 51.5% (95% CI: [49.4, 53.7]) but also in comparison with direct citations with 59.8% (95% CI: [51.1, 68.5]). We employed a chi-square test, χ2(1, N = 2,482) = 51.8, p < .001, to test whether the number of provided versions systematically differs between not cited software and software cited by a software article, and use Cramer’s V to estimate the effect size, V = 0.15. The test shows that significantly fewer versions are mentioned in-text when software is cited by a software article with a small effect size. Developers are mentioned less in all cases where software is formally cited (direct 7.4% (95% CI: [2.7, 12.0]), manual 0%, software articles 3.7% (95% CI: [1.8, 5.6])) as compared to not formally cited software (36% (95% CI: [34.0, 38.1]); however, in most citation types developer attribution is given, including software articles. The in-text mention of URLs is at a similar level between software cited with software articles and software that was not cited, while URLs were never provided in-text when software was cited directly or through a manual. A similar picture is found in alternative names.
5.2. Formal Software Citations Without In-Text Software Mentions
All references within the SoMeSciPLoS Methods and PubMed Fulltexts sets were analyzed to determine if software is formally cited without being mentioned in the article’s full-text document. All contexts of identified references were further manually examined to determine the reasons why the software was not mentioned in-text. In total, 32 formal software citations were identified within the references of all articles. However, 11 of these citations are actually connected to in-text software citations but appear in nonannotated parts of the SoMeSciPLoS Methods set. The remaining 21 are not connected to in-text software and are contained within 17 articles. Closer examination of the corresponding reference contexts revealed that two of the cases are due to annotation errors in the original SoMeSci data, and five can be attributed to errors within the articles (e.g., mixed up references, as described in Section 5.1).
The remaining 14 cases reflect the citation practice of interest where software is formally cited without in-text mention and consist of seven Direct Citations, four Manuals, and three Websites. To assess the extent of this practice, these numbers are considered in relation to the number of overall formal citations within the analyzed articles from PLoS Methods and PubMed Fulltexts. This amounts to 8.5% of Direct Citations for 82 Direct Citation with 75 cases where the software is mentioned in-text, 13.8% of Manuals with 29 overall cases and 25 in-text mentions, and 37.5% of Websites with eight overall cases and five in-text mentions. Note that the sample size for Manuals and Websites is quite small and that all additionally identified Websites result from the same article. The manual analyses of the underlying citation practices showed that in four cases generic terms were mentioned instead of the software name (e.g., “processing was done with [19],” where [19] is the software citation), in seven cases the use of software was not indicated at all, and in three cases knowledge from the software was cited (e.g., the FAQ of the software being referenced instead of the software).
5.3. Citation Completeness
The completeness of formal software citations was analyzed for 153 Direct Citations—including eight Website and four Other citations as incomplete Direct Citations—and 44 Manuals identified in Sections 5.1 and 5.2, but as in Section 5.1 excluding eight references from the Creation sentences set. The corresponding results are summarized in Figure 5. Regarding Direct Citations, we found that the Name, Creator, and Publication Year of software are commonly mentioned in 146 or 94.8% (95% CI: [91.3, 98.3]), 141 or 91.6% (95% CI: [87.2, 95.9], and 132 or 85.7% (95% CI: [80.2, 91.2]) instances, respectively. Version, Description, and URL are less common, with 100 or 64.9% (95% CI: [57.4, 72.5]), 77 or 50% (95% CI: [42.1, 57.9]), and 62 or 40.3% (95% CI: [32.5, 48.0]) of instances, while the Type of Citation and Date of Access are only rarely provided in 44 or 28.6% (95% CI: [21.4, 35.7]) and 25 or 16.2% (95% CI: [10.4, 22.1]) of cases. Release date (3, 2% with 95% CI: [0, 4.13]), Identifier (1, 0.6% with 95% CI: [0, 1.92]), and Archive (1, 0.6% with 95% CI: [0, 1.92]) were only sporadically found. Regarding Manuals, most results are at a comparable level, with a difference in Version, which are less often contained in Manual citations with 25% (95% CI: [12.2, 37.8]). Exact results for Manuals are provided in the Supplementary material6.
The analysis was extended to further cover whether software is identifiable, if the creator can be attributed, and whether the code base can be identified, as defined in Section 3.3. The corresponding results are illustrated in Figure 6. Regarding Direct Citation, software can be identified with high confidence in 132 or 87.4% (95% CI: [82.1, 92.7]) of cases based on the mention of Name and Creator. Archive and Identifier are only stated in one case each and overlap with mentions of Name and Creator. Furthermore, if cases where a URL is provided are considered as identifiable the overall number of identifiable cases increases to 149 or 98.7% (95% CI: [96.9, 100]). However, URLs are often not persistent or might only point to the developer instead of the software, which makes it dangerous to assume that they are always identifiable. The creator can be attributed in 132 or 87.7% (95% CI: [82.5, 92.9]) of cases in which they were provided. The exact code bases can be identified with high confidence in 101 or 65.6% (95% CI: [58.1, 73.1]) of cases, with Versions being provided in 99 cases and Release Dates in three with one case overlap. Note that the software itself also has to be identifiable to identify the code base. Therefore, the numbers for code base identification are based on the 98.7% of software that was found to be identifiable before. Considering a Date of Access, available in 25 cases, sufficient for code based identification under the assumptions that the newest available version at the date of access was used, the code base can be identified in 113 or 73.4% (95% CI: [66.4, 80.4]) of cases. Regarding Manual citations, the values for software identification and creator attribution are at equal levels; however, the value for code base identification is lower, with only 25% (95% CI: [12.2, 37.8]) and 27.3% (95% CI: [14.1, 40.4]) being identifiable with and without considering a Date of Access, respectively. This value is mostly caused by the lower number of provided versions as outlined above. Exact results for Manuals are provided in the Supplementary material6.
As in Section 5.1, completeness was further investigated, including metadata provided with in-text software mentions. It is possible that metadata describing a software are provided in the full-text document instead of the software citation. While this would have the drawback of making the in-text information not directly identifiable, it would still mean that the required information to describe the software has been provided within an article. Therefore, the formally and informally provided information is compared and aggregated to determine if completeness can be gained by observing both. This was only performed for samples that have an in-text mention and are annotated in SoMeSci, therefore excluding the samples described in Section 5.2. Further, only the metadata of Version/Release, Creator, and URL is considered as the remaining information does not overlap between the annotations. The results are given in Figure S5 of the Supplementary material. By definition, the software Name is always given for informal mentions in the SoMeSci annotation; therefore, the number of overall provided names within theses samples is 100%. Creators are only rarely provided when software is cited, with a Direct Citation in 6.9% (95% CI: [2.6, 11.3]) of cases and for Manuals in zero cases. All of the cases where a developer was mentioned in-text overlapped with cases where the Creator was also provided in the formal references, therefore not improving the overall coverage. Versions are provided quite often with an informal mention when software is cited with a Direct Citation with 58.5% (95% CI: [50, 66.9]) and 47.1% (95% CI: [30.3, 63.8]) for Manual citations. Further, through aggregating over formal and informal information the overall coverage for versions improves up to 80% (95% CI: [73.1, 86.9]) for Direct Citations and to 55.9% (95% CI: [39.2, 72.6]) for Manuals. URLs were never mentioned in the full-text document when software was formally cited by either a Direct Citation or a Manual.
5.4. Database Accuracy
The quality of database representation was evaluated on the same references as the completeness and the set of references in the Creation Sentences, because only the representation of the information is investigated here, not its amount. As outlined in Section 4.1, it is investigated which information is available from the different databases, whether all available information is covered, how it is structured, and whether it is correct. The quality of database representation was investigated individually for the considered information (e.g., Name, Version, Developer) and illustrated in newly established, adapted alluvial plots that illustrate and compare the availability, structure, and correctness of individual references between the publisher’s JATS information, Semantic Scholar, and Crossref. Potential errors that can occur in a database representation are illustrated in Listing 9, including unstructured representation, incomplete representation of information, errors in information, and addition of wrong information. As described in Section 3.4 multiple entries for one reference can exist in Semantic Scholar. Overall, 40 (24.8%) out of 161 represented citations have duplicate entries, with 33 (27% of 121) in Direct Citations and seven (17.5% of 40) in Manuals. For all following analyses the most complete entry, covering most relevant information, was selected.
5.4.1. Database errors
Before analyzing the individual information some general analyses were performed. In both databases missing entries for references were identified, where it is necessary to distinguish two cases of missing references: entire articles missing and individual references missing. References that are missing because the entire articles is not contained in a database are ignored because this is a problem of overall coverage and does not provide any information about the quality of software citation representation. However, individual references missing, even when an article is represented, can point to a problem regarding software citation representation and needs to be investigated. Overall, 36 of 157, 22.9% (95% CI: [16.4, 29.5]) of Direct Citations and five of 45, 11.1% (95% CI: [1.93, 20.3]) of Manual references were individually missing from Semantic Scholar. In Crossref, eight of 155, 5.2% (95% CI: [1.7, 8.6]) of Direct Citations and six of 47, 12.8% (95% CI: [3.2, 22.3]) of Manuals were missing7. To investigate if this is a systematic bias concerning software citations, we further investigated what amount of Software Articles are missing from the databases, serving as a sample of regularly published articles. We identified four of 382, 1.1% (95% CI: [0, 2.1]) and one of 381, 0.3% (95% CI: [0, 1]) of Software Articles missing from Semantic Scholar and Crossref, respectively. To test if the amount of missing articles differs between Software Articles and Direct Software Citation we employed a chi-squared test for each of the databases, Semantic Scholar χ2(1, N = 539) = 74.4, p < .001 and Crossref χ2(1, N = 536) = 13.2, p < .001 with effect sizes of V = 0.38 for Semantic Scholar and V = 0.17 for Crossref, estimated by Cramer’s V. We do not employ further tests regarding software manuals because there are fewer data available and statements would be less reliable.
Furthermore, it was observed during annotation that Semantic Scholar sometimes adds wrong information without relation to the original reference information (see Listing 9), and that correct information is in some cases duplicated in a wrong location (e.g., the software name is represented as both title and publication venue). In total, wrong information is added to reference representations of Direct Citations in 19, 15.7% (95% CI: [9.2, 22.2]) and for Manuals in 25, 62.5% (95% CI: [47.55, 77.5]), while duplicate information is added in 26, 21.5% of Direct Citations and two, 5% of Manuals, with three cases overlapping. Both problems were not observed for Crossref.
5.4.2. Presentation of results
In the following the individual information is illustrated through adapted alluvial plots, which are introduced in the following. An example plot illustrating the adapted alluvial plot is given in Figure 7. All annotated samples are individually listed in the plot from top to bottom, while their order can change from left to right. The flow of a specific sample is indicated by the color originating from the middle column, named JATS. If multiple samples have the same information flow their lines are summarized. The middle JATS column shows the information available from the publisher, and indicates whether the information is available and whether it is correctly structured. The columns to left and right of the middle show the same information for Crossref (CRO) and Semantic Scholar (SEM), respectively. The outermost columns, CRO_ERR and SEM_ERR, show whether the represented information is correct or whether an error is present, for Crossref and Semantic Scholar, respectively. Because we observed that some references are entirely missing in Crossref and Semantic Scholar they are shown with the special label “missing” to indicate that no information is available in these cases. This information flow illustration allows to directly compare how the different sources structure the metadata and whether errors are introduced. In particular, the difference between the structure provided by the publisher and the corresponding representation by the databases can be observed.
5.4.3. Software name
The results for database accuracy of software names are given in Figure 8, and a further summary of the results is provided in Table 1. The software name is commonly included by publishers in both Direct Citation (94.3%) and Manual citations (91.5%), with only some information represented in a structured manner in 12.7% of Direct Citations and 29.8% of Manuals. Crossref and Semantic Scholar only lose information on software names in rare cases, with 2% and 3.3%, respectively, for Direct Citations and 9.8% and 0% for Manuals. In turn, information is added by Semantic Scholar in 0.8% of Direct Citations. Semantic Scholar manages to increase the ratio of structured information, for both Direct Citations (24%) and Manuals (60%), with structured samples outweighing unstructured samples for Manuals, while Crossref directly reflects publisher structure, when information is not lost8. Notably, Semantic Scholar does not retain structure for all Direct references, but instead loses structure for 5.8%, and adds structure for 14.9% of references. Regarding Manuals, Semantic Scholar does not lose structure, but adds it in 30% of cases. All information on software names contained in Crossref is correct, while Semantic Scholar introduces a small number of errors in both Direct Citations (3.5%) and Manuals (5.4%9. Regarding Manuals, all errors are due to misrepresentation of software names as other information, while for Direct Citations, 75% of errors are due to misrepresentation and 25% due to wrong information.
Metadata . | Citation . | Database . | n . | Structure . | Correctness . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NA . | (%) . | US . | (%) . | S . | (%) . | C . | (%) . | E . | (%) . | ||||
Name | D | JATS | 158 | 9 | 5.7 | 129 | 81.6 | 20 | 12.7 | – | – | – | – |
D | CRO | 147 | 11 | 7.5 | 120 | 81.6 | 16 | 10.9 | 136 | 92.5 | 0 | 0 | |
D | SEM | 121 | 6 | 5.0 | 86 | 71.1 | 29 | 24.0 | 111 | 91.7 | 4 | 3.3 | |
M | JATS | 47 | 4 | 8.5 | 29 | 61.7 | 14 | 29.8 | – | – | – | – | |
M | CRO | 41 | 7 | 17.1 | 24 | 58.5 | 10 | 24.4 | 34 | 82.9 | 0 | 0 | |
M | SEM | 40 | 3 | 7.5 | 13 | 32.5 | 24 | 60.0 | 35 | 87.5 | 2 | 5.0 | |
Identifier | D | JATS | 158 | 86 | 54.4 | 0 | 0 | 72 | 45.6 | – | – | – | – |
D | CRO | 147 | 90 | 61.2 | 6 | 4.1 | 51 | 34.7 | 57 | 38.8 | 0 | 0 | |
D | SEM | 121 | 103 | 85.1 | 13 | 10.7 | 5 | 4.1 | 15 | 12.4 | 3 | 2.5 | |
M | JATS | 47 | 28 | 59.6 | 0 | 0 | 19 | 40.4 | – | – | – | – | |
M | CRO | 41 | 29 | 70.7 | 1 | 2.4 | 11 | 26.8 | 12 | 29.3 | 0 | 0 | |
M | SEM | 40 | 40 | 100.0 | – | – | – | – | – | – | – | – | |
Creator | D | JATS | 158 | 18 | 11.4 | 88 | 55.7 | 52 | 32.9 | – | – | – | – |
D | CRO | 147 | 36 | 24.5 | 87 | 59.2 | 24 | 16.3 | 94 | 63.9 | 17 | 11.6 | |
D | SEM | 121 | 41 | 33.9 | 39 | 32.2 | 41 | 33.9 | 55 | 45.5 | 25 | 20.7 | |
M | JATS | 47 | 1 | 2.1 | 19 | 40.4 | 27 | 57.4 | – | – | – | – | |
M | CRO | 41 | 16 | 39.0 | 16 | 39.0 | 9 | 22.0 | 17 | 41.5 | 8 | 19.5 | |
M | SEM | 40 | 6 | 15.0 | 7 | 17.5 | 27 | 67.5 | 29 | 72.5 | 5 | 12.5 | |
Version | D | JATS | 158 | 60 | 38.0 | 92 | 58.2 | 6 | 3.8 | – | – | – | – |
D | CRO | 147 | 64 | 43.5 | 80 | 54.4 | 3 | 2.0 | 81 | 55.1 | 2 | 1.4 | |
D | SEM | 121 | 53 | 43.8 | 67 | 55.4 | 1 | 0.8 | 51 | 42.1 | 17 | 14.0 | |
M | JATS | 47 | 36 | 76.6 | 10 | 21.3 | 1 | 2.1 | – | – | – | – | |
M | CRO | 41 | 32 | 78.0 | 9 | 22.0 | 0 | 0 | 9 | 22.0 | 0 | 0 | |
M | SEM | 40 | 31 | 77.5 | 8 | 20.0 | 1 | 2.5 | 6 | 15.0 | 3 | 7.50 | |
Date | D | JATS | 158 | 20 | 12.7 | 89 | 56.3 | 49 | 31.0 | – | – | – | – |
D | CRO | 147 | 19 | 12.9 | 80 | 54.4 | 48 | 32.7 | 128 | 87.1 | 0 | 0 | |
D | SEM | 121 | 18 | 14.9 | 10 | 8.3 | 93 | 76.9 | 77 | 63.6 | 26 | 21.5 | |
M | JATS | 47 | 2 | 4.3 | 17 | 36.2 | 28 | 59.6 | – | – | – | – | |
M | CRO | 41 | 2 | 4.9 | 17 | 41.5 | 22 | 53.7 | 37 | 90.2 | 2 | 4.9 | |
M | SEM | 40 | 2 | 5.0 | 0 | 0 | 38 | 95.0 | 16 | 40.0 | 22 | 55.0 |
Metadata . | Citation . | Database . | n . | Structure . | Correctness . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NA . | (%) . | US . | (%) . | S . | (%) . | C . | (%) . | E . | (%) . | ||||
Name | D | JATS | 158 | 9 | 5.7 | 129 | 81.6 | 20 | 12.7 | – | – | – | – |
D | CRO | 147 | 11 | 7.5 | 120 | 81.6 | 16 | 10.9 | 136 | 92.5 | 0 | 0 | |
D | SEM | 121 | 6 | 5.0 | 86 | 71.1 | 29 | 24.0 | 111 | 91.7 | 4 | 3.3 | |
M | JATS | 47 | 4 | 8.5 | 29 | 61.7 | 14 | 29.8 | – | – | – | – | |
M | CRO | 41 | 7 | 17.1 | 24 | 58.5 | 10 | 24.4 | 34 | 82.9 | 0 | 0 | |
M | SEM | 40 | 3 | 7.5 | 13 | 32.5 | 24 | 60.0 | 35 | 87.5 | 2 | 5.0 | |
Identifier | D | JATS | 158 | 86 | 54.4 | 0 | 0 | 72 | 45.6 | – | – | – | – |
D | CRO | 147 | 90 | 61.2 | 6 | 4.1 | 51 | 34.7 | 57 | 38.8 | 0 | 0 | |
D | SEM | 121 | 103 | 85.1 | 13 | 10.7 | 5 | 4.1 | 15 | 12.4 | 3 | 2.5 | |
M | JATS | 47 | 28 | 59.6 | 0 | 0 | 19 | 40.4 | – | – | – | – | |
M | CRO | 41 | 29 | 70.7 | 1 | 2.4 | 11 | 26.8 | 12 | 29.3 | 0 | 0 | |
M | SEM | 40 | 40 | 100.0 | – | – | – | – | – | – | – | – | |
Creator | D | JATS | 158 | 18 | 11.4 | 88 | 55.7 | 52 | 32.9 | – | – | – | – |
D | CRO | 147 | 36 | 24.5 | 87 | 59.2 | 24 | 16.3 | 94 | 63.9 | 17 | 11.6 | |
D | SEM | 121 | 41 | 33.9 | 39 | 32.2 | 41 | 33.9 | 55 | 45.5 | 25 | 20.7 | |
M | JATS | 47 | 1 | 2.1 | 19 | 40.4 | 27 | 57.4 | – | – | – | – | |
M | CRO | 41 | 16 | 39.0 | 16 | 39.0 | 9 | 22.0 | 17 | 41.5 | 8 | 19.5 | |
M | SEM | 40 | 6 | 15.0 | 7 | 17.5 | 27 | 67.5 | 29 | 72.5 | 5 | 12.5 | |
Version | D | JATS | 158 | 60 | 38.0 | 92 | 58.2 | 6 | 3.8 | – | – | – | – |
D | CRO | 147 | 64 | 43.5 | 80 | 54.4 | 3 | 2.0 | 81 | 55.1 | 2 | 1.4 | |
D | SEM | 121 | 53 | 43.8 | 67 | 55.4 | 1 | 0.8 | 51 | 42.1 | 17 | 14.0 | |
M | JATS | 47 | 36 | 76.6 | 10 | 21.3 | 1 | 2.1 | – | – | – | – | |
M | CRO | 41 | 32 | 78.0 | 9 | 22.0 | 0 | 0 | 9 | 22.0 | 0 | 0 | |
M | SEM | 40 | 31 | 77.5 | 8 | 20.0 | 1 | 2.5 | 6 | 15.0 | 3 | 7.50 | |
Date | D | JATS | 158 | 20 | 12.7 | 89 | 56.3 | 49 | 31.0 | – | – | – | – |
D | CRO | 147 | 19 | 12.9 | 80 | 54.4 | 48 | 32.7 | 128 | 87.1 | 0 | 0 | |
D | SEM | 121 | 18 | 14.9 | 10 | 8.3 | 93 | 76.9 | 77 | 63.6 | 26 | 21.5 | |
M | JATS | 47 | 2 | 4.3 | 17 | 36.2 | 28 | 59.6 | – | – | – | – | |
M | CRO | 41 | 2 | 4.9 | 17 | 41.5 | 22 | 53.7 | 37 | 90.2 | 2 | 4.9 | |
M | SEM | 40 | 2 | 5.0 | 0 | 0 | 38 | 95.0 | 16 | 40.0 | 22 | 55.0 |
5.4.4. Creator
The results for database accuracy of software creators are given in Figure 9, and a further summary of the results is provided in Table 1. The software creator is commonly included by publishers in both Direct Citations (88.6%) and Manual citations (97.9%). It is structured in a majority of Manuals (57.4%) but less often in Direct Citations (32.9%). Crossref and Semantic Scholar both lose information on software creator in a notable amount of cases for Direct Citations (15% and 24.8%, respectively) and Manuals (36.6% and 15%). Semantic Scholar manages to increase the ratio of structured information slightly for Direct Citations to a value of 33.9% and strongly for Manuals up to 67.5%, with structured samples clearly outweighing unstructured samples for Manuals. As for software name, Crossref mostly reflects publisher structure for creators when information is not lost. Again, Semantic Scholar does not retain structure for all Direct Citation references, but instead loses structure for 15.7%, and adds structure for 21.5% of references. Regarding Manuals, Semantic Scholar also loses structure in 10% but adds it in 32.5% of references. Semantic Scholar introduces a notable amount of errors in both Direct Citations (31.2%) and Manuals (14.7%). For Direct Citations they are distributed between wrong information (44%), incomplete entries (32%), and misrepresentation (28%)10, and for Manuals between misrepresentation (80%) and incomplete entries (20%). Crossref also introduces a notable amount of errors in both Direct Citations (15.3%) and Manual (32%) references. In Crossref almost all errors (Direct Citation 94.1%, Manual 100%) are due to incomplete entries because Crossref only includes the first author when representing article references. For articles covered in Crossref the full author information can then be gathered from the article entry corresponding to the reference, but for Direct Software Citations and Software Manuals this can result in a loss of information due to missing persistent identifiers.
5.4.5. Identifier
Information on the publication venue of a software is analyzed as a combination of ID, Archive Link, and URL, where the most relevant information is chosen in the given order if available11. The results for database accuracy of software identifiers are given in Figure 10, and a further summary of the results is provided in Table 1. A software identifier is included by the publisher in almost half of references in both Direct Citation (45.6%) and Manual citations (40.4%), always in a structured manner. Semantic Scholar loses information on identifiers in a high number of cases for Direct Citations (26.4%) and always loses it for Manuals. Crossref loses information in fewer cases with 6.8% for Direct and 14.6% for Manual. Further, Semantic Scholar loses structure for 10.7% of Direct Citations, while Crossref loses structure for 4.1% of Direct Citations and 2.4% of Manuals. Errors are only present in rare cases concerning Semantic Scholar and Direct Citations, affecting 16.7% of covered references. The errors are due to misrepresentation in 33.3% of cases and wrong information in 66.7%.
5.4.6. Version
The results for database accuracy of software versions are given in Figure 11, and a further summary of the results is provided in Table 1. The software version is commonly included by publishers in Direct Citations (62%), but less frequently in Manual citations (23.4%). For both citation types, versions are rarely represented in a structured manner with 3.8% in Direct Citations and 2.1% in Manuals. Crossref rarely loses information on software names in 6.8% of Direct and 2.4% of Manuals. Semantic Scholar, on the other hand, loses version information in a considerable number of Direct Citations (14%), but never in Manuals. Crossref does not lose structure information when samples are represented, but adds structure for 0.7% Direct Citations. Semantic Scholar loses structure for 1.7% Direct Citations and 2.5% Manuals. No errors are present in Crossref for Manuals and only a few for Direct Citations (2.4%), all due to misrepresentation of the version as other information. For Semantic Scholar a notable number of errors are present in Direct Citations (25%) and Manuals (33.3%). The errors in Semantic Scholar for Direct Citation are mainly due to wrong information (76.5%), followed by incomplete information (17.6%), and misrepresentation (5.9%), while for Manuals they are due to incomplete information (100%) and misrepresentation (66.7%).
5.4.7. Release dates
As for the identifier, information on release dates is summarized for analyses, where the release date, date of access, and publication year are prioritized for analyses in the given order to always select the most complete information. The results for database accuracy of release date information are given in Figure 12, and a further summary of the results is provided in Table 1. The publication date is commonly included by publishers in both Direct Citation (87.3%) and Manual citations (95.7%). It is often represented in a structured manner for Manuals (59.6%) but less frequently for Direct Citations (31%). Crossref and Semantic Scholar only lose information on publication date in a few references with 1.4% and 7.4% for Direct and 0% and 5% for Manual. In turn, information is added by Semantic Scholar in 0.8% of Direct citations. Semantic Scholar manages to strongly increase the ratio of structured information, for both Direct (76.9%) and Manual citations (95%), while Crossref mainly reflects the structure of the publisher. Notably, Semantic Scholar retains structure in all cases except for 2.5% Direct Citations, and adds structure for 45.5% of Direct Citations and 37.5% of Manuals, while Crossref adds structure in 2.7% of Direct Citations. High numbers of errors are present in Semantic Scholar for Direct Citations (25.2%) and Manuals (57.9%). For Crossref, few errors are present in Manuals (5.1%), all due to incomplete information. Errors in Semantic Scholar for Direct Citation are distributed between wrong information (88.5%), incomplete information (7.7%), and misrepresentation (3.8%), and for Manuals between wrong information (90.9%) and incomplete information (9.1%).
5.4.8. Description and type of citation
Description and type of citation are considered less crucial to software citation than the information discussed so far. Therefore, the results are only briefly discussed with the corresponding alluvial plots available in Figures S10 and S11 in the Supplementary material. A software description is included by publishers in about half of Direct Citations (49.4%) and Manual citations (57.4%), with only some information represented in a structured manner for 8.2% of Direct Citations and 31.9% of Manuals. Crossref and Semantic Scholar only lose information on software descriptions in rare cases for Manuals with 2.40% and 5%, respectively. Semantic Scholar manages to increase the ratio of structured information for both Direct Citations (18.2%) and Manual citations (55%), while Crossref directly reflects publisher structure when information is not lost. Errors are rare for description and only appear in 3% of Direct Citation in Semantic Scholar and 1.3% in Crossref.
The type of citation is commonly not included in both Direct Citation (26.6%) and Manual citations (34%), and never represented in a structured manner. Information is in some cases lost for Direct Citations (Semantic Scholar 9.1%, Crossref 3.4%) and Manuals (Semantic Scholar 2.5%, Crossref 7.3%). Structure is mainly represented as by the publisher for both databases, with Semantic Scholar adding some structure to Direct Citations (0.8%). Errors only appear for Direct Citations, with 15.8% in Semantic Scholar 2.9% in Crossref.
6. LIMITATIONS
In general, manual annotation was required to assess the quality of formal software citation, which is associated with high manual effort. To make this annotation feasible, SoMeSci was chosen as a basis because it allows the extension of existing annotations, greatly reducing the required effort. Overall, we consider the sample size sufficient to make reliable statements about the quality of formal software citation and its representation in bibliographic databases. Our assessment is that it would be possible to perform further large-scale analyses examining the completeness of Direct Citations; however, we assess an automatic evaluation of database accuracy as extremely challenging based on the problems we faced during the annotation.
Overall, we consider the data selection of SoMeSci as suitable for the given analysis because it includes articles using software and articles creating software, covering formal citations from both groups. But there are also limitations resulting from this data selection. A main drawback is that a large number of articles are published by PLoS, which leads to a bias in publishers, affecting specifically the analyses on database accuracy. We argue that PLoS is a representative choice regarding the handling of software citation because PLoS has a high interest in software. PLOS ONE, the largest journal published by PLoS by a considerable margin, allows software submission and publishes corresponding software articles, but also encourages proper software mentions, with the journal’s policy stating that authors should provide all software with versions and related references that are used for statistical analyses12. Moreover, an analysis excluding PLoS articles showed the same general trends with the same major issues from both the publisher side and the bibliographic databases. While these findings are based on a small sample size, they highlight that the identified issues of formal software citation representation are not specific to one publisher but exist broadly across the current bibliographic infrastructure. However, it should be noted that specific publishers might already handle software citation in a suited manner.
Furthermore, all articles are available from the PubMed Central Open Access set and, therefore, have a selection bias towards life sciences and open access publications. This bias is likely to influence the type of software used within articles and can also have an influence on how the corresponding software is published and cited (e.g., differing ratios in commercial and open access software and biomedical tools as compared to other disciplines). It could, for instance, be argued that the use of archive links is more common in the domain of computer science or other software-heavy domains where the reuse and adaptation of source code is more common.
The article selection for the analyses spans a range of publication dates from 2007 and 2020 and can therefore only make statements about software citation in this time span, and the amount of available data is not sufficient to analyze trends in formal software citation throughout this time. However, a previous large-scale analysis including data up to 2021 has shown that the overall number of software citations has stayed at a plateau since 2009 (Schindler et al., 2022). Based on these findings we assume that there were no major changes in software citation practices in the examined time frame and our results are valid. A benefit of the given article selection is that bibliographic data providers had sufficient time to react and represent the formal software citations given within the articles, making the examined data set well suited to assessing the representation of formal software citation in bibliographic databases.
7. DISCUSSION AND CONCLUSION
We analyzed the data quality of software citations regarding the three quality dimensions of structure, completeness, and accuracy and found significant issues across all stages of the data life cycle. Starting with the references as given by the authors, the analyses of software citation types showed a strong trend towards citation of Software Articles, which is suitable for identifying software and its developer, but does not allow identification of the code base. A reason why this practice might be well adapted is that authors are familiar with article citation. This means that the majority of formal software references do not enable the identification of the code base used. Furthermore, we showed that authors using software articles also provide significantly less information in article full-text documents to enable the identification of the software code base. The second largest group is Direct Citations, which—properly executed—is the most complete way to cite software and is discussed in detail below, and we identified a small trend for the citation of software manuals. Compared to prior studies, we observe a higher number of Direct Citations but confirm the results that Software Articles are the most cited resource for scientific software. This shows that better awareness on the author side is required, as the majority of software references used are unsuitable for representing software, with the majority resulting from the outdated notion that articles are more valuable scientific contributions than software (Hafer & Kirkpatrick, 2009).
In practice, the type of citation is influenced by the citation recommendation made by the developers (Du et al., 2022), often placed on the software download website or given with the newly established software citation format13 (CFF). Authors are likely to follow this recommendation in order to provide the desired attribution to the software developers, and because the provided information can be readily used. However, these recommendations often do not recommend Direct Citations but other citation forms, such as Manuals and, therefore, omit essential information such as the version that identifies the code base and is part of the research provenance. For the widely used statistical framework R, for instance, the recommended citation style is a citation of the Manual, which does not include a version number. Further, authors who publish software articles have a vested interest in the article being cited and are likely to recommend the citation of the Software Article because otherwise they do not receive attribution and impact for the creation of the software. This means that action is also required by the developers of software to update software citation recommendations so that they do not impose a conflict on authors.
The analysis of citation completeness showed that software is almost always identifiable from Direct Citations in 88–99% of cases, but even so the practice is only used in about 23% of formal software references. However, unique identifiers and archive links have almost never been utilized to identify software, showing that this is not yet common practice. Attribution of developers is also possible for a majority of references in 88% of cases, while the information regarding code base is only identifiable in 66–73%. In general, the use of version numbers was found to be a common practice, while release dates are almost never used, and mostly in cases where software developers utilize release dates (e.g., “Matlab r2023a”). Overall, the completeness of Direct Citations is at a good level, while the situation regarding Manual citations is similar, with software identification and attribution being given in most cases, while code base identification is at a notably lower level (see Figure 6). Overall, the citation completeness could be further improved with better awareness regarding the importance of code base identification for reproducibility and research provenance, which further highlights the argument made above.
Information provided by publishers is mostly unstructured for Direct Citations, hindering systematic analyses of software usage in science, which rely on structured information. Creator and date information are structured to some extent, but still in less than 50% of cases, while the name is structured in even fewer cases and versions only in single instances. The only information that is almost always consistently structured is identifiers. In general, information from Manuals is more often structured than information from Direct Citations (see Table 1), which reflects the similarity of Manuals to scholarly articles, in contrast to other research objects such as software and data.
Bibliographic databases take information about references either directly from the publishers or by analyzing the reference lists from scholarly articles. Crossref was found to mostly take over information from the publishers retaining the structure and information. In particular, names and identifiers are almost identical in their representation, while some information for version, creator, and identifier is lost. With respect to completeness, Crossref systematically represents author names by the last name of the first author for all references, but maintains all information for the actual elements. To access the data, a persistent identifier (i.e., DOI) can be used to link the reference of a citing article to the actual entry in Crossref. However, this raises a problem when such an identifier is not present, as in the case of most software citations. Moreover, Crossref omits a small number of direct software citations and manuals, as compared to software articles. Overall, Crossref is not performing any special treatment for software citation, but is mostly successful in representing the information available from publishers.
With 22.9%, Semantic Scholar omits a significant number of direct software references. It employs an automatic approach to link similar references to the same element. While this increases the structure of information to some extend, it introduces errors when it comes to direct software citations and software manuals. Wrong information (14%) as well as duplicated information (18%) is present in 30% of represented references in Semantic Scholar, which likely results from adding information that was found by erroneous linking of different references to the same element (see Listing 9). Moreover, Semantic Scholar loses information within the software citations it retains. It does, for instance, drop a high number of versions and the majority of URLs. It also introduces a high number of errors in versions, creators, and dates. Overall, software citation representation in Semantic Scholar is poor, because it seems that the underlying implementation has not been adapted to handle Direct Software Citations. Specifically, the concepts of versions and URLs that are common in software citations are not represented in Semantic Scholar. However, it succeeds in improving some references, and it can be assumed that with proper adaptation it could be successful in representing and even adding structure to software citations when the original published information lacks it.
In general, both databases are currently unable to adequately represent software citations, as proper handling of Direct Citations does not seem to be implemented in either. Some fault does also lie with publishers, where there is also no suitable format for direct software citations, and instead software citations are adapted to match the fields used for regular citations. Based on the current systems, systematic analyses of formal software citations in scientific articles are not possible. In general, our results show that both publishers and bibliographic databases need to update their infrastructure to create suitable and machine-readable software representations, as suggested by Stall et al. (2023). Therefore, we urge the providers of bibliographic data to update their implementations to take the intricacies of software citation into account to enable and facilitate systematic representation and analyses of formal software citation in the future. We recommend including at least two different views of software: Provide all information about the particular software as given by the authors to enable reproducibility, and link different versions of the same software to a common element to credit creators of such. A spot check in the Scopus database, which is generally known as a high-quality data source for bibliometric studies (Baas, Schotten et al., 2020) revealed similar issues, for instance unstructured data, missing specialized treatment for software with different version, or citations to the same software that are not linked and consequently evaluated independently. The data published in the scope of this work can serve as a starting point for analyzing formal software citations and the requirements for representing them and can serve as initial training for machine learning methods.
Authors looking to cite software in their articles face a conflict. Because bibliographic databases currently fail at representing software citations, it cannot be recommended to use only direct software citations in scientific publications. Instead, software usage should additionally be indicated by mentioning the software in the full-text document with all information required to identify it and citing corresponding software articles. This practice allows systematic analyses by employing methods such as the SoftwareKG information extraction pipeline (Schindler et al., 2022), and also gives direct credit to developers. There is still a strong argument for the use of formal software citation, as it clearly identifies software and provides credit without requiring elaborate machine learning methods to extract the knowledge. However, as long as providers of bibliographic data do not adequately represent direct software citations this knowledge will stay inaccessible. We hope this situation can quickly be resolved by updates to existing bibliographic databases. In this context, we advocate the use of formal software citations, as further adoption of this practice increases the need and urgency to address this problem.
7.1. Software
In the following, all software used during this investigation is listed, including software citations and software articles for all software for which they exist. We used both Python (Van Rossum & Drake, 2022) 3.8.16 and R (R Core Team, 2023) 4.3.0 for data processing. For Python we further used the package articlenizer R-14.06.2021 (Schindler, 2021). For R we used the packages tidyverse (Wickham, Averick et al., 2019, 2021) 2.0.0 and magrittr (Bache & Wickham, 2022) 2.0.3 for data processing, patchwork (Pedersen, 2022) 1.1.2, ggalluvial (Brunson & Quentin, 2023) 0.12.5, easyalluvial (Koneswarakantha, 2022) 0.3.1, and xtable (Dahl, Scott et al., 2019) 1.8-4 for output generation, and DescTools (Signorell, 2023) 0.99.48 and rcompanion (Mangiafico, 2023) 2.4.30 for statistical analysis. Further, we used RStudio (Posit Team, 2023) 2023.3.1.446 for development and Quarto (Allaire, Teague et al., 2023) 1.2.475 to generate a literate data analysis document.
AUTHOR CONTRIBUTIONS
David Schindler: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing—original draft, Writing—review & editing. Tazin Hossain: Conceptualization, Data curation, Investigation, Methodology, Resources, Software, Writing—original draft. Sascha Spors: Conceptualization, Funding acquisition, Project administration, Writing—original draft. Frank Krüger: Conceptualization, Data curation, Formal analysis, Funding acquisition, Methodology, Project administration, Supervision, Validation, Visualization, Writing—original draft, Writing—review & editing.
COMPETING INTERESTS
The authors have no competing interests.
FUNDING INFORMATION
This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) SFB 1270/2: 299150580.
DATA AVAILABILITY
The script and data to replicate the analyses described in this work are available at Zenodo (Schindler & Krüger, 2024) and Github14, with the data being made available together with the original SoMeSci data (Schindler, Bensmann et al., 2021b) at Github15 and Zenodo16.
Notes
This term refers to the specific development state of software, typically indicated by a version.
Direct Citations and Websites are summarized under the assumption that Websites are incomplete Direct Citations.
This is when the SoMeSci data set was obtained.
Available at https://github.com/dave-s477/SoMeSci_Citation.
Note that the overall number of references differs between databases because they differ in the number of references ignored because entire articles are missing.
The results are always reported excluding entirely missing references (M).
The number of errors is always reported excluding not covered information (NA).
Errors are not exclusive; therefore the percentages do not need to sum to 1.
All metadata are semantically related and samples for ID and Archive are too rare to analyze individually.
https://journals.plos.org/plosone/s/submission-guidelines, accessed February 23, 2024.
REFERENCES
Author notes
Handling Editor: Rodrigo Costas