Abstract
One of the key goals of the FAIR guiding principles is defined by its final principle – to optimize data sets for reuse by both humans and machines. To do so, data providers need to implement and support consistent machine readable metadata to describe their data sets. This can seem like a daunting task for data providers, whether it is determining what level of detail should be provided in the provenance metadata or figuring out what common shared vocabularies should be used. Additionally, for existing data sets it is often unclear what steps should be taken to enable maximal, appropriate reuse. Data citation already plays an important role in making data findable and accessible, providing persistent and unique identifiers plus metadata on over 16 million data sets. In this paper, we discuss how data citation and its underlying infrastructures, in particular associated metadata, provide an important pathway for enabling FAIR data reuse.
1. INTRODUCTION
Data citation has been a core part of the infrastructure in the movement toward Open Science [1]. Support for data citation was incorporated in version 1.2 of the ANSI/NISO JATS XML schema required for deposition in repositories [2] at the initiative of an expert group convened by FORCE11①. Major publishers and data providers have supported initiatives such as the Joint Declaration of Data Citation Principles (Data Citation Synthesis Group 2014②) and are rolling out support for those principles in their submission, publishing, and data archiving systems [3,4]. Support for data citation through robust data set archival and identifier generation is a common feature of many research data repositories, whether domain specific repositories like ICPSR③ or more generic repositories like Figshare④, Dataverse [5] and Zenodo.⑤ DataCite alone now registers over 16 million unique identifiers (DOIs) for data sets and other non-traditional research outputs.⑥
It is not only that data citations are being created – they are being used [6,7], with projects underway to start measuring and exposing data reuse in the form of views, downloads, and citations of data sets [8]. Data citation has also enabled research data to begin to emerge as a first-class scholarly object, allowing the work involved to be recognized [9,10].
As others have noted (For Attribution – Developing Data Attribution and Citation Practices and Standards 2012[11]), one of the advantages of data citation is that it builds on existing scholarly practice. In Figure 1, we see an example of a data citation as it would appear in a published work. Data citations, as with any form of citation, share their existing affordances (i.e., features): they are copy-and-pastable; they provide credit through clear delineation of authorship; they give simple situatedness through a notion of a repository as a venue; they provide unique identification, and time-stamping; and they are included in the list of references. The data set that is being referred to is thus elevated to the same level as the other scholarly works (e.g., articles, books) that are being cited.
Data citations fit – with some modifications – into existing scholarly workflows – whether it is drafting an article or building a curated list of material in a reference manager. The effort for the researcher to cite data is in some sense the same as that of a research article – figuring out the appropriate citation to use and including it in the reference list. Guidelines for data citation and their recommended format have recently been outlined by a group of publishers [3]. This ensures data citations are consistent in terms of both human and machine readability, and compatible with existing publisher practices. We still do not, for the most part, automatically generate references to data. Instead, they are included manually by researchers in the same fashion as references to research articles.
Here, we want to emphasize the, sometimes invisible, underlying capabilities available in the recommended approaches for data citation [12]. For example, a persistent identifier is required in the format of the citation. Thus, the citation is not just the string of text in an article's reference section but also enables the associated technical infrastructure to support referring to data in a unique and persistent manner. Like the data citation string itself, data citation infrastructure often also builds upon existing scholarly infrastructure. However, it expands this infrastructure to enable new functionality that provides a strong foundation for not just referring to data, but injecting it into the scholarly ecosystem and making it more reusable.
The aim of this paper, is to introduce an exemplar data citation infrastructure as implemented by DataCite, a global non-profit organization that provides persistent identifiers (DOIs [13]) for research data and other research outputs, and to show how its capabilities may be used to enhance the reusability of data. We note that data citation infrastructures such as identifiers.org and ARKS also support many of the capabilities we will discuss [14]. The important point is to illustrate what these capabilities are. We hope that this can serve as guide for data providers to use the capabilities of these infrastructures more completely. Just as data citation builds on existing scholarly practice, so too can the move toward producing more Findable, Accessible, Interoperable and Reusable Data [15] built on the success of data citation.
The rest of this article is organized in four sections. First, we begin by introducing the data citation infrastructure, followed by a discussion of the role of metadata. We then address the use of data citation infrastructure for both expressing data provenance and contextualizing data for reuse. Finally, we touch on the need for the grounding of data citation in the scientific social ecosystem through the scholarly literature.
2. UNDERSTANDING DATA CITATION INFRASTRUCTURE
As Figure 1 shows, after the authors, title, and archival repository of the data set, the citation ends with a persistent identifier. A persistent identifier⑦ is a long-lasting reference to an object and usually directs to a landing page with information about the underlying object. The main idea is that over time the location referred to by the identifier will either still exist, or will need to redirect to a new location for the object, or will state that the object is no longer available. Often DOIs, as in the case of DataCite citations, are used for these persistent identifiers. Other identifiers like ARKs or CURIEs can also be used [14]. In each case, there is an intermediary who is responsible for creating (i.e., registering) these identifiers. At an institutional level, intermediaries such as DataCite (in the case of DataCite DOIs) or California Digital
Library (in the case of ARKs) provide social guarantees about the longevity of these identifiers, and work with institutions to ensure that these are redirected as the underlying institutional infrastructures change. In our example case (Figure 1), the DOI redirects the user to a Landing Page URI at the Zenodo data repository. If for some reason Zenodo changes its URL scheme or ceases to exist, the DOI can be redirected to another location, thus maintaining access to the data.
This intermediation is a critical component of such systems, not just from a social or longevity perspective, but also from a technical perspective. When organizations register persistent identifiers for their data with DataCite, they also deposit associated metadata which is then hosted by the intermediary. This is done following a metadata schema⑧ so that metadata terms are clearly and interoperably defined, and consistently provided. For example, the intermediary can guarantee that information about the author of the metadata is always accessible using the same schema property. Thus, while data citation is often seen as largely addressing the “Findable” and “Accessible” components of the FAIR principles, it is worth emphasizing their role in “Interoperability” and “Reusability”.
3. METADATA AND DATA CITATION INFRASTRUCTURE
Data citation intermediaries provide a convenient home for the addition of metadata that should be available for any digital object. What does this mean in practice? Figure 2 shows example metadata from the data citation above, retrieved from our example citation. Figure 2a shows the redirection, the title and author names. However, Figure 2b shows the beginnings of the power of data citation metadata. Here, what we see is additional metadata beyond that in the citation itself, enabling the provision of information useful in determining the reusability of data. In Figure 2b, we see that the data have a well-defined Creative Commons license, is open access, was funded by the European Commission, and has a prior version.
Therefore, intermediation provides two benefits. First, it offers a convenient location to provide additional metadata. Secondly, it drives the standardization of core metadata on data sets and other digital research objects. For example, DataCite was able to easily provide schema.org formatted metadata for the data sets registered with it, improving availability of data set information for search engines to index [16]. The DataCite Metadata Schema (DataCite Metadata Working Group 2019⑨) provides a wide variety of properties ranging from denoting the type of contribution someone made (e.g., Project Leader, Editor) to the specific geolocation to which data are related. This is just a tiny fraction of the available descriptor space. Thus, hosting the metadata necessary for FAIR data, data citation intermediaries provide permanent and easy access to consistent metadata. In the next section, we discuss a set of metadata properties already available for use that would greatly enhance data reuse.
4. PLACING DATA IN THE GLOBAL PROVENANCE GRAPH
A critical part of understanding whether data can be reused is to understand how they fit in a larger context. This includes understanding how data were produced – their provenance [17,18], unambiguous description of the concepts under consideration, and relationships to other sources. The DataCite metadata schema has introduced the notion of relation types. These 32 types allow the expression of many types of relations including for example, that a data set is derived from another data set (isDerivedFrom); that a data set is a new version of a data set (isNewVersionOf); that a data set is documented by a particular piece of documentation (isDocumentedBy); or that a data set is created using a piece of software (isCompiledBy). By asserting these links, a data set provider can express the provenance of the data [19].
Importantly, it is not just that the data citation allows for the expression of links between data but also links into the existing literature citation graph. Thus, data can be contextualized not only by their relation to data and software but by their relation to the scholarly discourse.
Beyond provenance and the literature context, it is also possible to express the actual entities a data item is about through the use of specific subject identifiers. Here, the emergence of Wikidata is of interest in the stabilization and coalescence of conceptual terminology because of the diversity of scholarly communities with established languages and vocabularies. Wikidata provides a global space for referring to common entities and concepts in a language independent fashion [20]. It provides large numbers of definitions and multilingual links and allows data to be contextualized within a large global knowledge base of facts. Wikidata is increasingly being used to provide a common linking point across research databases [21,22]. Thus, by providing links to this common space through the subject identifier metadata property available, the specific subjects can be defined in this common space. This, for example, could enable one to find all the data sets about a particular gene or protein.
In all these cases, the links are expressed through globally unique persistent identifiers. The provenance graph and other context information thus become part of the global scientific record [23].
While context is critical for reuse, even more interesting is to be able to potentially regenerate or build upon a data set by having access to its entire experimental context in the form of a Research Object [24], which we now discuss.
5. BUILDING RESEARCH OBJECTS USING DATA CITATION RECORDS
A Research Object⑪ is a bundle of all the artefacts associated with an investigation or piece of research into one whole or package that can itself be cited [24]. This may be done by packing a set of elements into a container (e.g., a zip file or a BagIt file [25] with a manifest file that describes the contents. The manifest metadata describes the relationship between elements within the bundle. Data citation metadata provides another possible route to bundling these elements together and exposing the elements of a research object together in an accessible fashion.
The following is a sketch of how this could be done. First, one would create a research object or research object stub. Using DataCite Metadata terminology this would be a Collection. By using the aforementioned relationship types, one can express the relationship to the software, workflows, data, documentation and papers that are all members of the collection. Importantly, these relationships are already expressible using DataCite metadata. The collection then defines the distinct research object package while not necessarily needing to encapsulate all parts and instead holding references to the constituent parts. Additionally, because data citation supports versioning natively, one can express accurately the notion of the true contents of a research object.
Such packaging is crucial for reuse as data themselves do not stand alone, it is by their nature contextualized by both the computational environment in which they can be generated and used, and their broader social embedding [26].
6. REUSE AND THE IMPORTANCE OF THE HUMAN
As discussed above, data citation provides a critical component often lost in the discourse around machine reusability, which is the link to the human. While the goal to promote the ability for machines to understand data is an exciting one, it is poor scholarly practice to reuse data without understanding their original context and conditions of creation [27,28]. In addition, providing all the necessary elements to generate completely machine reusable data may be too resource intensive, outside of the most high-value data [29]. In the end it is the responsibility of the researchers to understand the nature of data, and the appropriate conditions for data reuse, employing the associated metadata and literature artefacts we provide them. By linking data to metadata and to the literature or other human readable documentation, data citation provides a critical outlet to facilitate reusability.
7. CONCLUSION: DATA CITATION INFRASTRUCTURE AS SCAFFOLDING
In this short article, our aim was to point out the powerful features that are already available in data citation infrastructures that make it amenable to supporting reusable data. However, it is just that – support. Data citation infrastructure does not provide the metadata, it provides a uniform place to deposit and access it. Given this, we think there are some steps that tool builders and data repositories should implement to facilitate reuse of data based on this infrastructure:
Make it easy to express the relationships supported by the data citation infrastructure and publish those in the associated metadata repository.
Promote the use of subject identifiers, in particular from Wikidata.
Develop tools that contextualize data within the larger scholarly ecosystem.
Ensure both data creators and users are aware of the possibilities to add and use metadata.
We encourage data providers, data hosts, and the entire FAIR data community to consider how we can use this already available scaffolding to expose the elements needed to make reusable data.
AUTHOR CONTRIBUTIONS
P. Groth ([email protected]) conceptualized and wrote the first draft of the paper. T. Clark ([email protected]), H. Cousijn ([email protected]) and C. Goble ([email protected]) clarified the ideas and concepts in the paper. All authors edited and reviewed the final version of the article.
ACKNOWLEDGEMENTS
This work was partially supported by Horizon 2020, INFRADEV-4-2014-2015, 654248, CORBEL, Coordinated Research Infrastructures Building Enduring Life-science services.
Notes
DataCite Metadata Working Group. 2019. “DataCite Metadata Schema Documentation for the Publication and Citation of Research Data v4.2.” DataCite. https://schema.datacite.org/meta/kernel-4.2/index.html.