Abstract
Since their publication in 2016 we have seen a rapid adoption of the FAIR principles in many scientific disciplines where the inherent value of research data and, therefore, the importance of good data management and data stewardship, is recognized. This has led to many communities asking “What is FAIR?” and “How FAIR are we currently?”, questions which were addressed respectively by a publication revisiting the principles and the emergence of FAIR metrics. However, early adopters of the FAIR principles have already run into the next question: “How can we become (more) FAIR?” This question is more difficult to answer, as the principles do not prescribe any specific standard or implementation. Moreover, there does not yet exist a mature ecosystem of tools, platforms and standards to support human and machine agents to manage, produce, publish and consume FAIR data in a user-friendly and efficient (i.e., “easy”) way. In this paper we will show, however, that there are already many emerging examples of FAIR tools under development. This paper puts forward the position that we are likely already in a creolization phase where FAIR tools and technologies are merging and combining, before converging in a subsequent phase to solutions that make FAIR feasible in daily practice.
1. INTRODUCTION
At a glance, the FAIR principles simply stipulate a number of “best practices” on how to deal with data and their associated metadata. However, a more careful reading of both the principles and their associated publications [1,2] reveals some of the potential complexities when trying to implement FAIR [3]. These issues break down in at least three specific, orthogonal aspects. Firstly, a number of principles provide guidelines about the relationship between data, the representation of the data and the associated metadata that describes the data more fully (e.g., F1, F2, F3, I1, I2, I3, R1, R1.1, R1.2, R1.3). Even though it is clear what is required by these principles, it is not specified how it should be done, i.e., FAIR is not, in itself, a standard [2]. Secondly, there are a number of principles that require extensive infrastructural support like search engines, communication protocols and identifier resolution services (e.g., F4, A1, A2). Thirdly, there are a number of principles that refer to a community consensus or standard either explicitly (R1.3 and, by recursion, I2) or implicitly, concerning for example the definition of “rich”, “shared” and “relevant” (F2, I1, R1). Moreover, the principles are open to interpretation with regard to the type of digital resource and its granularity. For example, when a principle talks about “data” does it refer to a data set as a whole, or could it refer to each individual data record (or item) contained in the data set? Finally, the principles need to be taken as guidelines that primarily aim to enable machines to (autonomously) interact with data [1], thus adding another possible layer of interpretation and implementation complexity.
In this paper we will consider which tools and technologies are currently available and which functionality, to the best of our knowledge, is still lacking to support stakeholders in each step from FAIR Data management planning to FAIR data creation, publication, evaluation and (re)use. As authors we have also developed such tools in recent years and we include them here in order to illustrate possible solutions and highlight open issues. A full and comprehensive review of relevant tools and technologies is out of scope for this paper, but references in this paper are available as a community-editable Wiki page [4] and we welcome contributions there in order to increase awareness of existing efforts and to facilitate technological creolization [5] and convergence.
2. FAIR DATA MANAGEMENT PLANNING
With the increase of data-driven research and the rising importance of digital research objects and other digital artifacts [6], e.g., for the purpose of reuse and reproducibility [7], there is more need than ever for researchers to follow proper data management procedures. Moreover, researchers are increasingly required to provide a Data Management Plan (DMP) that meets the requirements as set out by different funding organizations [8] and serves as an adaptable, guiding document of the data management process during the project. A large number of DMP tools have emerged to assist researchers to create and maintain DMPs.
The main challenge for a DMP tool is to efficiently transfer knowledge regarding the many organizational, procedural and technical aspects of data management and data stewardship to an audience of researchers from different backgrounds and domains in order to produce an application- and domain-relevant DMP and to maximize opportunities for good data handling and reuse during and after the project. Many of these tools use the FAIR guiding principles for data management, but do so in a variety of ways. Here we take a look at two examples: DMPOnline [9] and the Data Stewardship Wizard (DSW, [10]), for a more complete discussion, please see [8]. DMPOnline has recently seen rapid adoption from researchers and organizations as the go-to tool to produce funder-compliant DMPs. It provides an online, collaborative environment with (mostly) open text forms divided into sections following a configurable funder's DMP template. For each section, DMPOnline embeds explanatory text from a configurable set of sources, which may be DMP guidelines from funding organizations or academic institutions and may (or may not) contain FAIR-specific guidance. In contrast, the DSW tool guides the user through a comprehensive, “FAIR-aware” data management knowledge model by asking a number of multiple-choice questions with embedded book excerpts for additional explanation [11]. This organization allows DSW to very efficiently point the user to the relevant data stewardship issues, tools and other resources by omitting the parts from the larger knowledge model that would only apply to other cases. DSW also facilitates automatic evaluation of the questions, for example in order to produce FAIRness metrics or other evaluation score. In the future we are likely going to see a continuation of efforts toward machine actionable DMPs and tooling①, thus enabling DMP interoperability, exchange and (semi-)automatic evaluation of (parts of) the reported data management process. Interestingly, the FAIR metrics (see last section) share similar objectives, which suggest that DMP and FAIR metrics tools may be destined for co-evolution in the future.
3. FAIR DATA PRODUCTION
One of the main challenges following from the FAIR guidelines is that they propose a number of attributes to be associated with the data: unique identifiers [12], (qualified references to) rich metadata, use of vocabularies, provenance, etc. The value of these attributes to any downstream data consumer (be it a human or machine agent) is quite clear, but can also pose a burden on the data producer. We foresee the emergence of a category of tools that support data producers to make sure the data contain the required attributes. These “FAIRifier” tools may come in many different flavors: supporting either generic or domain-specific use cases, FAIRifying at the source or post-hoc, targeting different end-users (e.g., data scientists or data stewards), using different technologies (e.g., semantic Web technology) and supporting (semi)automated or manual workflows.
We have developed a general-purpose FAIRifier on the basis of the OpenRefine data cleaning and wrangling tool [13,14,15] and the RDF plugin②. This FAIRifier enables a post-hoc FAIRification workflow: load an existing data set (from a wide range of formats), (optionally) perform data wrangling tasks, add FAIR (metadata) attributes to the data, generate a linked data version of the data and, finally, push the result to an online FAIR data infrastructure to make it accessible and discoverable. Literal values in a data set can be replaced by identifiers (URLs) either manually, by semi-automatic mapping to pre-loaded ontologies (using the OpenRefine reconciliation function) or by embedded, customizable script expressions. The interoperability of the data set can be improved by connecting these identifiers into a meaningful semantic graph-structure (model) of ontological classes and properties using the integrated RDF model editor. A provenance trail automatically keeps track of each modification and additionally enables “undo” operations and repetition of operations on similar data sets. A FAIR data export function opens up a metadata editor to provide information about the data set itself: title, publisher (author), license, and a range of additional optional metadata.
Future development plans include features to make the FAIRifier easier to use for non-technical users. This includes functionality to suggest transformations and (semi)automatic application of graph models based on libraries of ontologies and graph models created by other (expert) users. Many other tools have demonstrated FAIRification capabilities with different benefits and limitations. To name a few: Karma③ offers a user-friendly interface and automatic model selection capability that are not available in the OpenRefine-based FAIRifier, but lacks some of its other features. RightField [16] and Ontomaton [17] transparently integrate FAIRification to end-users by pre-configuring spreadsheet applications with a semantic data model. The different concepts and functionalities offered by these tools are all worth further evaluation and development in the context of creating a rich ecosystem of FAIRifier tools. Finally, note that the tools mentioned in this section and the next, adopt ontologies [18] and linked data [19]. These technologies align very well with a number of FAIR principles “out of the box”, but other tools may choose a different core technology for their implementation.
4. PUBLISHING FAIR DATA
Data coming from a FAIRifier can still not be considered fully FAIR and machine actionable, unless they have been published to, or otherwise made available via the Internet. Here we focus mainly on the principles collected under the “A” and related infrastructural aspects, for issues regarding Findability of FAIR data sets, please see the last section. Arguably, the main challenge regarding Accessibility is to make every part of the access process machine actionable, so that machines are enabled to automatically negotiate access (based on conditions set by the data owner) and to retrieve data and metadata in order to (semi) automatically evaluate their fitness for purpose. Part of this problem relates to the representation of accessibility conditions and their organizational, regulatory or legal framework [20,21, 22]. Another part requires specific support from the infrastructure, i.e., if conditions permit access, the infrastructure should allow data consumers to get to the data in a straightforward, predictable way. This means choosing between a large number of protocols and APIs and their respective standards and conventions.
We have developed the concept of a FAIR Data Point (FDP) [23] with a dual, ongoing goal: 1) to demonstrate comprehensive compliance to the FAIR principles and metrics and 2) as a light-weight infrastructural component and standard that may be used by existing repositories and infrastructures. Primary design objectives to support these goals were to require only minimal (but extensible) semantic descriptions and to adopt a light-weight interface. An FDP serves relevant, FAIR metadata as RDF over a simple RESTful API [24] on five different hierarchical layers starting with metadata about the FDP itself, followed by Catalogs, Data sets, Distributions and, finally, record-level metadata. Its metadata is mainly based on the widely used DCAT④ and Dublin Core⑤ standards, with minor extensions to comply with FAIR principles (detailed in the FDP specification document⑥). Given a FDP URL, a DCAT-aware REST client can automatically traverse the FDP hierarchy down to the level of actual data records. Traversal may be directed by the client's evaluation of the metadata (e.g., for relevance) or may be halted by the FDP if access restrictions for that level apply. We intend to use the FDP in combination with more refined, currently emerging semantic models to describe access conditions (e.g., based on consent and GDPR regulations) and integration with an Authorization and Authentication Infrastructure for applications in the health domain [25]. There are a number of other standards (most notably Linked Data API⑦, Hydra⑧ and Linked Data Platform⑨) that provide more sophisticated descriptions to the client about API state transitions and additional API functionality such as querying. We consider these efforts complementary to the FDP and combinations are likely possible. We are currently evaluating in which scenarios such combinations would offer additional benefit before extending the FDP core functionality accordingly.
5. EVALUATING THE FAIRNESS OF A RESOURCE
An emerging consideration for the different stakeholders involved in FAIR activities is the assessment of the FAIRness level of resources. It is often useful to assess to which extent a resource (data or metadata) follows the FAIR principles. This assessment can help evaluate if initial goals for the resource have been achieved and also can help identify desirable points for improvements. A number of different initiatives are currently working on defining frameworks, methods and criteria for evaluating FAIRness. Initiatives include the FAIR Metrics Group⑩, the RDA FAIR Data Maturity Model Working Group⑪, the NIH Data Commons Pilot Phase Consortium⑫ and others and they are mostly ongoing efforts. Nevertheless, a number of online evaluation tools and forms have become available [26,27, 28, 29, 30], which illustrates the perceived importance of helping users to measure theirs or other people's FAIRness in all phases of the data life cycle.
For instance, the aforementioned Data Stewardship Wizard incorporated in its knowledge model metrics from the FAIR Metrics Group so that the user can have an indication of the FAIRness level that is expected from the yet to be created data. After data creation, another evaluation can be performed to measure the achieved FAIRness level, and if necessary, a review of the plan can be made to mitigate any problems [31].
6. FINDING AND (RE)USING FAIR DATA
Arguably, efficient use and reuse of data is a major objective of the FAIR guiding principles. Consider an ideal digital world where all data are FAIR: machine agents should then be able to (autonomously) execute a process or workflow to find (principles F) and access (A) any available, relevant data sources and automatically integrate, query and reason over the interoperable (I) data toward a useful result to a problem formulated by either human users or indeed other machine agents. It may seem therefore that reusability (R) is trivially solved if resources fully comply to F, A and I principles and infrastructure exists to support it. However, we would argue that without due consideration of the principles under R, the data would still not be very (re)usable and that the effects and requirements of the R principles permeate through to all the other principles, all steps in the data life cycle, as well as any FAIR supporting infrastructures and tools. Let's for example look at the step of finding relevant data, a problem for which many technical solutions exist, even those exhibiting certain FAIR characteristics. This includes for example the FAIR data search engine prototype, which harvests FDP metadata, indexes it and offers a search UI and API for human and machine searches, respectively [32]. An alternative approach uses structured embedded metadata which may be crawled and indexed by existing online search services: for example, a Web page related to a data set could contain structured “Data set” metadata⑬ and would allow the data set to show up in the Google Data set search⑭ service. Hybrid approaches are also possible: for example the FDP includes a simple UI that embeds schema.org metadata. Even as there appears to be sufficient infrastructure to support “Findability”, the data that are found will not actually be usable if the metadata does not specify the legal conditions under which it may be used (R1.1), if the origin, relevance and trustworthiness of the data is not clear (R1.2) or if it does not follow standards relevant for a given domain (R1.3). The main challenge regarding the reusability of the data is therefore to make sure that any FAIR resource includes such a “plurality of accurate and relevant attributes” (R1) to support data reuse. In the findability use case, these attributes could furthermore be used to improve search results by automatically prioritizing relevant, trustable results that the requester is legally able to use for his specific purpose. We note that non-technical developments are of influence as well: a positive example is the recent adoption of the GDPR [33], which is increasingly cited as motivation for works capturing and modeling data usage conditions and constraints [20]. Such works are important precursors for convergence toward broadly accepted and generically applicable metadata standards for data use and access constraints that have yet to emerge. Finally, communities themselves need to identify, develop and promote the required metadata standards and metadata registry services play an important role toward convergence within and across domain boundaries. Registries may range from full-featured, generic solutions like FAIRsharing⑮, to relatively simple community recommendation lists [34,35, 36].
7. CONCLUSIONS
In this paper we have shown that there are many ongoing efforts that directly or indirectly contribute to the objective of making FAIR a reality. We have shown that these tools contribute to an ecosystem of FAIR tooling that covers everything from FAIR data management planning, to production, publication, evaluation, finding and (re)using FAIR data. Some of these tools contribute to the design and development of (components of) FAIR infrastructures and platforms, while others address a solution to a very specific FAIR challenge. In most cases there are a number of alternative solutions with some overlapping, but also many complementary features. Moreover, almost all of these efforts have dependencies on, or reach full potential only in combination with other FAIR tools and resources. e.g., FAIRifiers are typically more effective with the availability of registries of (community adopted) FAIR data models and metadata standards, FAIR search and accessibility services cannot work without descriptions of usage and license conditions, etc. In our opinion this signals a creolization phase [4] of FAIR tool development. In the near future we will likely see an increase in the number of available FAIR tools, while simultaneously these tools will evolve, converge and merge in ways that cannot currently be foreseen. Periodically checking alignment with the original aim and intention of the FAIR principles will help to converge such efforts toward the realization of mature FAIR tool ecosystems and infrastructures, FAIR-based domain-specific applications like the Personal Health Train⑮ [37] and the generic Internet of FAIR data and Services [38].
AUTHOR CONTRIBUTIONS
M. Thompson ([email protected]) has drafted the first version of this paper; R. Kaliyaperumal ([email protected]), L.O. Bonino da Silva Santos ([email protected]) and K. Burger ([email protected]) have proof-read and contributed improvements to the text; all authors have contributed to the design and implementation of the FAIRifier and FAIR Data Point software and specifications described in the paper.
ACKNOWLEDGEMENTS
Part of this work is funded by the NWA program (project VWData – 400.17.605), by the Netherlands Organization for Scientific Research (NWO), by the European Joint Program Rare Diseases (grant agreement #825575) and ELIXIR-EXCELERATE (H2020-INFRADEV-1-2015-12).
Notes
DMP Common Standards WG. Available at: https://www.rd-alliance.org/groups/dmp-common-standards-wg.
OpenRefine RDF plugin. Available at: https://github.com/stkenny/grefine-rdf-extension.
Karma: A data integration tool. Available at: http://usc-isi-i2.github.io/karma/.