Abstract
We examine the intersection of the FAIR principles (Findable, Accessible, Interoperable and Reusable), the challenges and opportunities presented by the aggregation of widely distributed and heterogeneous data about biological and geological specimens, and the use of the Digital Object Architecture (DOA) data model and components as an approach to solving those challenges that offers adherence to the FAIR principles as an integral characteristic. This approach will be prototyped in the Distributed System of Scientific Collections (DiSSCo) project, the pan-European Research Infrastructure which aims to unify over 110 natural science collections across 21 countries. We take each of the FAIR principles, discuss them as requirements in the creation of a seamless virtual collection of bio/geo specimen data, and map those requirements to Digital Object components and facilities such as persistent identification, extended data typing, and the use of an additional level of abstraction to normalize existing heterogeneous data structures. The FAIR principles inform and motivate the work and the DO Architecture provides the technical vision to create the seamless virtual collection vitally needed to address scientific questions of societal importance.
1. INTRODUCTION
For hundreds of years, scientists have collected and studied plants, animals, rocks, minerals and fossils from our planet. Representing the world's known biological and geological diversity, more than 3 billion physical specimens are housed, organized and cataloged as natural science collections (NSC) in thousands of museums around the world. These represent an unparalleled resource, a scientific infrastructure for discovering and documenting the world's bio- and geo-diversity; its past, present and future and its influence on global challenges in environment and society. Today's systems for exploiting this material, however, are slow, expensive, inefficient and limited [1,2]. Despite significant existing global domain resources such as the Global Biodiversity Information Facility (GBIF), which aggregate and serve primary biodiversity data that include collections-related elements, the systematic absence of linkages to other data classes, such as DNA sequences, literature, ecosystem and medical/chemical data represent significant impediments to maximizing the impact of NSCs. By creating representations of specimens and collections in cyberspace and treating these assets digitally – “digital specimen” and “digital collections” – it is possible to persistently link data classes together, enabling seamless unified access to information. Such data-rich “virtual collections“ offer possibilities for wider, more flexible and meaningful access for a varied range of science and policy applications.
2. CHALLENGES
A true virtual collection resource requires that data it holds must be findable, accessible, interoperable and reusable. These FAIR principles [3] are first principles in building this global scientific asset for natural sciences research; as well being first principles in other, and to the degree possible across, research domains.
The importance of achieving data “FAIRness” – the attribute of data to be findable, accessible, interoperable and reusable – in biodiversity science and geoscience is becoming increasingly clear. As humanity confronts the reality of climate change and all that entails in precedents and outcomes, the need to understand reliably and at a fine level of detail – the variety of life and its environment on the planet is essential to our well-being. What has been lost, what is the current state, and what are the trends? And how do changes affect the balance of ecosystems that mankind depends upon for water, food, health, etc.? While the combined natural science collections and information related to those collections do not provide any easy answers, they do form an invaluable information resource that must be fully exploited toward addressing such questions.
Findable: The first requirement in building digital research infrastructures for bio- and geo-diversity is the ability to find relevant resources, starting with the digital specimens and digital collections that anchor the information facet of infrastructure. Finding resources requires that i) they are uniquely and persistently identified, and ii) their identities are closely bound to enough metadata to discover identifiers when those are unknown. In library and information retrieval terms, this represents the difference between a known item search and a subject search. In network terms, the subject search, run against one or more relevant catalogs or indexes, reveals the identifier(s) and the identifier resolution system reveals the network locations of the identified resources. The current state of specimen resource identification and associated metadata falls far short of this requirement when considered on a global or regional scale but is not impossible today at an institutional level. Meeting this challenge requires completing the massive digitization effort to create the digital surrogates (including metadata) for the physical specimens, based on agreed identifier and metadata standards and, of course, a consistent effort on the part of collection holders to apply those standards. It should also be noted that, like libraries and archives, the resources of interest in this area are expected, and have already been shown, to be useful over centuries. Mechanisms for finding digital specimens and collections must persist over unimaginable changes in technology.
Accessible: Identifying and locating a digital specimen or other resource is not the same as being able to access and use it. Access may require use of an unknown protocol or special permissions. The current sets of digitized specimens and related information are held in various collection management systems with widely different functionalities and management approaches. A recent survey by Koureas (2018, unpublished) across 115 European natural science museums shows more than 100 commercial and in-house solutions are in use. This legacy must be respected and, even following consolidation and harmonization, each different system must be approached on its own terms. This makes it difficult to create a seamless virtual collection and an approach that aggregates multiple heterogeneous collection management system inputs is needed.
Interoperable: The point of building a virtual collection of distributed data resources is to treat those as a unified scientific asset – to be able to easily find and access data across the combined set, and to be able to re-combine and/or otherwise compute across the data to develop new knowledge and test the old. Digital specimens and related data must be represented in a common manner using known formats and metadata schemes without replacing all that exist now. Metadata must be detailed enough for data to be understood by those who do not own and did not create the data. Meaning acquired from interpreting specimens must be made explicit by using appropriate standard representation schemes, and otherwise semantic differences create substantial barriers to interoperability [4]. Additionally, users should not need to know different methods for working with logically similar but locationally separate parts of the collection. The results of applications and analyses that run across the virtual collections must be trusted and that trust comes only from an understanding that apples are compared to apples and oranges to oranges. Applying algorithms, statistical tests, or other analysis to heterogeneous data without understanding whether the measurements or observations being used represent the same information in the same way can result in reasonable sounding results that are misleading in reality. There will be great value in the envisioned virtual collection provided it is built with care and congruent understanding of what is being assembled.
Reusable: Given that data resources can be found, accessed, and sufficiently well represented and understood to be interoperable, those resources can be reused within individual bio- and geo-diversity domains, across domains, and, with effort and care, across related domains. Information is created by interpreting data to attach meaning, and that exists in a context [5]. A pattern of changes in temperature does not mean much if its context is unknown. Each time data cross a boundary into a new context they must carry their original meaning with them or allow the new context to obtain understanding of that meaning through another mechanism. This will not always be self-evident and data that carried a clear meaning in one context can be in danger of losing it in a new context, e.g., when combined with data originating elsewhere under different circumstances. Addressing this issue, we should make sure that data are not misrepresented in the new context of seamless unified access to bio- and geo-diversity data, or indeed for any aggregating environment it is one of the more formidable challenges in achieving FAIRness. Technology alone cannot solve this problem, but it can provide tools, approaches and standards that make it possible for people to solve [5].
3. DIGITAL OBJECT ARCHITECTURE
One approach to solving the challenges described above, to aggregating and manipulating heterogeneous research data, is the set of principles embodied in Digital Object Architecture (DOA). DOA began at CNRI, the home institution of Lannom, but as interest and use has grown it has been handed off for the public good to the non-profit DONA Foundation, Geneva, Switzerland. This is from the DONA website [6]:
“The Digital Object (DO) Architecture (also known as the DO Architecture or simply the DOA) is a logical extension of the Internet architecture that addresses the need to support information management more generally than just conveying information in digital form from one location in the Internet to another. The DOA enables interoperability across participating information systems, whether in the Internet or not. It is a non-proprietary architecture and is publicly available.“
The approach is described in detail elsewhere [6,7, 8]. We describe it only briefly here to show how it can be applied to address the challenges outlined above. DOA's fundamental benefit to the management of heterogeneous data is to provide a means of grouping, managing and processing fragments of data and information in a uniform manner through a new layer of abstraction – the digital object. There is a rich history of adding layers of abstraction to solve problems of complexity in computing and information management and the DO Architecture continues this trend. The Internet today is a prime example: it provides a virtual network connecting many heterogeneous networks through use of routers and a single address space (the Internet Protocol (IP) address). Computer operating systems we all use every day provide a layer of abstraction over bit storage using files, thus allowing general interoperability across computing platforms. High-level programming languages combined with interpreters and compilers allow software applications to be used with ease across multiple computing environments. The DO Architecture aims to do this with networked data and information, as Figure 1 illustrates.
Services shown in the “cloud” can be orchestrated to provide an object view of underlying storage, e.g., file systems, or basic data management systems such as databases. The resulting set of identified objects provide a common, and constant, view with “remote control” management of data distributed in various locations and systems, which can change without changing the virtualized object. These services exist today in one form or another. The well-established and successful use of persistent identifiers for scholarly journal articles is one such example. However, others are not yet widely used, and few are tightly coordinated and orchestrated in the way needed.
We do not claim that simply articulating this vision solves the enormous challenges involved in creating seamless distributed research environments, e.g., semantic interoperability [9] but we feel strongly that it provides a way forward that matches well with the FAIR principles [3,10]. It needs to be developed and evaluated, both for specific disciplines, as is being proposed here, and as an approach to more generic scientific data challenges, such as the findability of provenance data coming out of workflow automation [11].
4. ADDRESSING THE CHALLENGES
Findable: Referential integrity – the ability to reference items reliably and persistently over time and beyond changes in technology – is key to addressing findability. Persistent identifiers are at the heart of the DO Architecture, with each DO assigned an identifier and that identifier globally resolvable to current state data about the DO, e.g., where and how to access data. Note here that everything in the DO Architecture is treated as an object; thus, metadata objects are also uniquely and persistently identified. Metadata can be tightly linked with the object to which it refers and, from the point of view of client software, included with the object, or can be linked through identifiers with each metadata object containing the identifier of the object it describes as well as its own identifier. Searching across appropriate collections of metadata to find the identifier(s) of relevant object(s) requires aggregating that metadata either centrally ahead of time, in real-time across distributed collections, or some combination of the two. In the DO Architecture this is the role of one or more metadata registries, which are special forms of repositories providing access to metadata objects, of which, the registry maintained for film and television assets by the Entertainment Identifier Registry Association (EIDR) is one example [12].
Building FAIR data and services begins with creating digital objects. Establishing the proper granularity level at which to apply DOA is essential. What exactly receives an identifier and becomes the first-class citizen in the environment? In the case of NSC the answer is clear – the digital surrogate of the physical specimen is the primary object type. From that point on, the metadata (especially provenance of the specimen) can be made part of the object by mapping or brokering underlying information into object form, as illustrated in Figure 1. Metadata is made accessible through a registry presenting uniform access methods across heterogeneous collection management systems.
Accessible: Once identified and found, NSC data must be accessible. The identifier system can help with this, providing authentication and authorization requirements as part of the current state data of the identified object(s). Repositories serve as object portals, regardless of where or how the data are stored. This level of indirection allows clients to use a single access protocol across multiple underlying information organization schemes – a role fulfilled by the Digital Object Interface Protocol (DOIP) [13].
Interoperability: Interoperability is a key challenge presented by heterogeneous scientific data collections and is the raison d'ětre of the DO approach. Multiple information systems confront the research community with multiple access paths, multiple data organization and representation schemes, varying degrees of metadata completeness and heterogeneous methods. Invoking a digital object approach does not solve these problems by itself but offers a set of approaches to ameliorate difficulties. Mapping multiple schemes into one or a small number of common schemes, identifying those schemes, and associating a set of methods or named operations with each scheme type allows client software to navigate across and operate within multiple environments without detailed knowledge of underlying systems.
A key ingredient of DO Architecture addressing interoperability is an extended notion of data types. This was explored by the Research Data Alliance [14], has been adopted as an ICT standard in public procurement [15] and is currently under consideration by ISO. These data types are intended to serve as an additional level of indirection such that the type of an object can be associated with a set of common characteristics and identified processes and operations, tied together in one or more publicly accessible registries of types.
Reusable: All the principles of the Digital Object Architecture come together in reusability. To leverage existing data, it must be findable, accessible, and interoperable. By simplifying the current level of heterogeneity through an added layer of abstraction, the resulting objects are ready for further investigation and reuse.
The work of building this set of interoperable objects within a domain of biodiversity and geoscience data appears daunting, but the results promise to be worthwhile. When a pattern of creating and linking objects becomes clear, it can be pursued over time and increasing amounts of data will become available for further research and for linking to new data.
5. DISTRIBUTED SYSTEM OF SCIENTIFIC COLLECTIONS (DISSCO)
DiSSCo is a priority pan-European Research Infrastructure aiming to unify more than 110 individual NSCs across 21 countries into a single open and FAIR scientific information resource [17]. DiSSCo evaluates the DO Architecture approach toward building seamless access for bio/geo specimen data, beginning with digitized surrogates of over a billion specimens from natural science collections across Europe. The Handle System [17,18] will serve as the basis for the Natural Science Identifier (NSId). DiSSCo will prototype use of the CORDRA digital object repository [19]. These, plus other existing components and concepts related to the DO Architecture, including PID Kernel information [20] and data types [14], are shown in context in Figure 2. The FAIR principles inform and motivate the work and the DO Architecture provides the technical vision to create the seamless virtual collection vitally needed to address scientific questions of societal importance.
AUTHOR CONTRIBUTIONS
L. Lannom ([email protected]) is a key member of the team at Corporation for National Research Initiatives (CNRI) that has designed and developed the Digital Object Architecture. A.R. Hardisty ([email protected]) and D. Koureas ([email protected]) conceived and investigated use of Digital Object Architecture to address challenges in building research infrastructure for biodiversity science and geoscience. All authors contributed equally to the writing, review and approval of the present article.