Abstract
According to the FAIR guiding principles, one of the central attributes for maximizing the added value of information artifacts is interoperability. In this paper, I discuss the importance, and propose a characterization of the notion of Semantic Interoperability. Moreover, I show that a direct consequence of this view is that Semantic Interoperability cannot be achieved without the support of, on one hand, (i) ontologies, as meaning contracts capturing the conceptualizations represented in information artifacts and, on the other hand, of (ii) Ontology, as a discipline proposing formal meth- ods and theories for clarifying these conceptualizations and articulating their representations. In particular, I discuss the fundamental role of formal ontological theories (in the latter sense) to properly ground the construction of representation languages, as well as methodological and computational tools for supporting the engineering of ontologies (in the former sense) in the context of FAIR.
1. INTRODUCTION
In their seminal paper, Wilkinson et al. [1] propose the so-called FAIR guiding principles as the cornerstone for maximizing the added value of information artifacts. The principles are organized around the four general notions of Findability, Accessibility, Interoperability and Reusability (hence the acronym). Here, I focus on Interoperability. Firstly, we should be reminded that interoperability is not about finding ways to connect data artifacts but ultimately about affording the interoperation of humans mediated by these artifacts①. Information artifacts are instruments used by humans to harmonize their conceptualizations and, hence, interoperability approaches succeed to the extent that they can safely connect these conceptualizations. Secondly, in the description of the FAIR principles, interoperability is described in a recursive manner stipulating that, in order for an information artifact to be interoperable, it should be described using semantic resources that follow FAIR principles. In other words, these artifacts can only be interoperable if they are grounded on artifacts that are themselves interoperable. In this article, I discuss the role of Formal Ontology, as a discipline, and of representation languages based on formal ontological principles, for grounding this entire enterprise.
The remainder of this paper is organized as follows. In Section 2, I discuss the importance of information integration and information systems interoperation, and the challenges of our current scenario in which the information is needed for addressing fundamental questions exists but is dispersed in multiple silos. In Section 3, I propose a theoretical characterization of the notion of semantic interoperability, as a relation between worldviews, or more technically, a relation between Conceptualizations. Finally, in Section 4, I defend a view of ontologies as kinds of “meaning contracts”, i.e., as artifacts that precisely characterize a given domain conceptualization. Moreover, I discuss the essential role of Formal Ontology and of ontology-driven representation languages for addressing this semantic interoperability challenge.
2. THE DATAVERSE IS A WORLD OF SILOS
Information is the foundation of all rational decision-making. Without the proper information, individuals, organizations, communities and governments can neither systematically take optimal decisions nor understand the full effect of their actions. In the past decades, information technology has played a fundamental role in automating an increasing number of information spaces. Simultaneously, there has been a substantial improvement in information access, motivated not only by advances in communication technology, but also by more recent demands on transparency and public access to information.
Despite these advances, most of these automated spaces remained as independent components in large and increasingly complex silo-based architectures. The problem is that, nowadays, several of the critical questions in large corporations, governments and scientific communities can only be answered by precisely connecting pieces of information distributed over these silos.
An example illustrating this point is put forth by Wilkinson et al. [1]:
“Suppose a researcher has generated a dataset of differentially-selected polyadenylation sites in a non-model pathogenic organism grown under a variety of environmental conditions that stimulate its pathogenic state. The researcher is interested in comparing the alternatively—45 polyadenylated genes in this local dataset, to other examples of alternative-polyadenylation, and the expression levels of these genes—both in this organism and related model organisms—during the infection process. Given that there is no special-purpose archive for differential polyadenylation data, and no model organism database for this pathogen, where does the researcher begin?”
Now, suppose that the information required to answer this question exists “in the ether” but also that (as it is usually the case) it only exists in dispersed forms in a number of autonomous information silos. As a consequence, despite the increasing amount of information produced, as well as the improvements in information access, answering such critical questions is still extremely hard. In practice, they are still answered in a case-by-case fashion and still require a significant amount of human effort, which is slow, costly and error-prone. The problem of combining independently conceived information spaces and providing unified analytics over them is termed the problem of Interoperability [3].
Wilkinson et al. approach this thought experiment by considering (in the case the appropriate data sets exists) the multiple aspects of data set discovery, of operational and authorship rights over them, of formatting, and of integration. In this paper, I focus on the latter. Having access to data sets as well formatting issues are indeed aspects connected to interoperabiltity. The former is connected to physical or communication interoperability, i.e., to how we can connect networked systems to allow for distributed access to information in heterogeneous computational platform. The latter is connected to syntactical interoperability, i.e., to how we can agree on standard syntactical structures for symbol processing that can be shared among parties. We have managed to make substantial advances in both aspects in the past decades and having heterogeneous networked systems that exchange information in standardized formats (e.g., XML) is state of the practice. A much more difficult problem that we are far from solving in a larger scale is that of Semantic Interoperability [3].
3. INFORMATION STRUCTURES AND SEMANTIC INTEROPERABILITY
I subscribe here to the so-called representation view of information systems [4]. Following this view, an information system is a representation of a certain conceptualization of reality. To be more precise, an information system contains information structures that represent abstractions over certain portions of reality, capturing aspects that are relevant for a class of problems at hand. There are three direct consequences of this view.
Firstly, it is that all information systems make ontological commitments. This has been discussed by several authors [3,5], but it was even clear in the first paper to mention the term “ontology” in computer science, namely, the classical Another Look at Data by George Mealy [6]. In that paper, Mealy defends that “data are fragments of a theory of the real world (my emphasis), and data processing juggles representations of these fragments of theory […] The issue is ontology, or the question of what exists”. For an information structure to represent a conceptualization, it must commit to the existence of the entities constituting that conceptualization. For example, an information system that records information about organ transplants commits to a theory about the existence of certain entities such as persons, surgeons, transplants, donors, donees (organ recipients), etc. This is illustrated in Figure 1. Let us postpone questions of notation for now. The important point here is that this commitment to a particular ontology of transplants is what defines the real-world semantics [3,7] of that information structure and this is inevitable, even if the designers of those information systems are not aware of this commitment. Paraphrasing Collier [8], I defend that the opposite of ontology is not non-ontology, but just bad ontology.
Secondly, a direct consequence of this view is that the quality of an information system directly depends on how truthful its information structures are to the aspects of reality it purports to represent. These structures must represent all the relevant aspects of the underlying conceptualization in an unambiguous way and constrain the possible states of that information system to the states that represent intended state of affairs according to that conceptualization [9,10]. For example, they should proscribe the existence of data populations in that transplant system that reflect state of affairs in which the donor, the donee and the surgeon involved in a transplant are the same person!② This is a fundamental issue for semantic interoperability (as we will see soon) because, if these information structures are under-constrained [3,10], we can have two information system agreeing on their possible data populations without having them agreeing on their intended populations. This is the so-called False Agreement problem [5]. To put it loosely, it is not a problem if two systems disagree but it can be a significant problem if they falsely believe they agree.
Thirdly, in order to connect two information systems A and B, we first need to understand the precise relation between the abstractions of entities in reality represented in A and B. Take for example the situation depicted in Figure 2. These two systems commit to different theories (ontologies) of transplants. Can we assume that just because the same term (e.g., Person of Transplant) is used in both structures that they mean the same thing? Of course, not! The only way to precisely characterize the relation between, for example, the type Person in system A and the homonymous type in system B is to find out the relation between their respective referents in their respective underlying conceptualizations. If that relation happens to be one of identity, then we can code the meta-properties of that relation (i.e., reflexivity, symmetry, transitivity and Leibniz's Law [11]) in the representation of that relation in the corresponding system. If that relation, however, turns out to be a different one (e.g., specialization, historical dependence, existential dependence, parthood [11]), we can also code its representation accordingly.
Semantic interoperability can, thus, be characterized in the following way: two systems A and B semantically interoperate if the coded relations connecting the information structures of A and B: (i) preserve the semantics of the referents represented in those structures; (ii) reflect the real-world meta-properties of the represented relations; and (iii) yield a resulting information structure that constraints the possible states of the resulting system to the intended ones, i.e., to those that represent intended state of affairs according to the conceptualizations underlying A and B.
As this example illustrates, in order to safely interoperate systems A and B, we need to safely integrate the information structures of A and B. In order to do that, we need a set of conceptual tools that help use to: (i) produce ontologically consistent information structures; (ii) uncover the worldview embedded in existing information structure; (iii) clarify the nature of the notions constituting that worldview; and (iv) calculate the relations between notions constituting different worldviews. These tasks of domain analysis, conceptual clarification and meaning negotiation are the very business of the discipline of Formal Ontology.
4. NO ONTOLOGY WITHOUT ONTOLOGY
As explained in [1], one of the essential FAIR principles is interoperability, which means to guarantee that:
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation;
I2. (meta)data use vocabularies that follow FAIR principles;
I3. (meta)data include qualified references to other (meta)data.
Item I2 refers to vocabularies. However, given the semantic interoperability characterized in the previous section, these cannot only be vocabularies. Vocabularies are terminological resources and the only way to safely make references to other (meta)data artifacts (i.e., the only way to satisfy item l3) is by precisely clarifying and characterizing the nature of the relations between the referents represented by these artifacts. For this reason, at the bare minimum, we need more than merely terminological resources, but formal, shared and explicit representations of conceptualizations, or, what the area of knowledge representation has conventionally called ontologies. This desiderata is reflected in item I1, which also requires the use of broadly applicable knowledge representation languages for that. The immediate question that comes to mind is then: what are the criteria that a representation language must have in order to satisfy requirements I2 and I3? Those attributes described in I1 are, perhaps, necessary but they are clearly not sufficient! For example, First-Order Logics (FOL) is, historically, the most used language for Knowledge Representation. It is also, obviously, formal, accessible, shared and broadly applicable. However, as explained in depth in [12,13], it is not a suitable language for systematically addressing requirements I2 and I3.
The reason follows directly from the definition of semantic interoperability I previously defended. In order to address these three interoperability requirements, we need a language that support us in: (a) systematically making ontologically consistent representation choices; (b) making explicit the ontological nature of the elements represented, i.e., the ontological commitment that is being made; and (c) identifying and characterizing the nature of the relations between real-world entities represented in these different data artifacts. In order words, we need a language that is truly ontological in nature, i.e., a language that explicitly commits to a foundational ontology [14].
As previously discussed, providing formal theories for addressing these challenges is the very business of Formal Ontology. This discipline aims at developing formal theories dealing with general aspects of reality such as identity, dependence, parthood, truthmaking, causality, etc. These domain-independent theories can then be used to investigate and articulate representations of conceptualizations across all domains. A foundational ontology is a particular consistent system of such ontological theories.
FOL is not a true ontological language in this sense. In fact, it is exactly its ontological neutrality that allows it to be attractive as a formalism that can be employed to a large variety of cases, ranging from the foundation of mathematics to representing particular domains. However, also because of its ontological neutrality, users have no support when choosing how to better represent elements constituting a domain conceptualization (a point beautifully demonstrated in [15]); there is also no support for making explicit the ontological commitments that the user thinks he or she is making. To put it boldly, FOL allows one to represent almost everything, including the things one should not represent! As discussed by [16] almost three decades ago:
“Formal semantics of current knowledge representation languages usually account for a set of models which is much larger than the models we are interested in, i.e., real-world models. As a consequence, the possibility to state something which is reasonable for the system but not reasonable in the real world is very high. What we need, instead, is a semantics which is not neutral with respect to some basic ontological assumptions.”
Another alleged candidate for addressing I1 is the Ontology Web Language (OWL). However, as explained in [12,13], the acronym actually hides a misnomer: there is no Ontology in OWL. OWL is a logical language that can be used to produce logical specifications. As such it inherits all the problems of FOL with the additional non-trivial aspect of having a much lower expressivity than FOL. As a result, as demonstrated in [17,18], there are many critical interoperability problems that can go undetected when integrating information structures represented in languages such as OWL.
An example of a language that is truly ontological in this sense is OntoUML [11,14]. OntoUML has been designed to conform to the Unified Foundational Ontology (UFO) such that the modeling primitives of this language reflect the ontological distinctions put forth by UFO. Moreover, the grammar of this language includes formal constraints that reflect the axiomatization of UFO, i.e., the grammatically valid models of OntoUML are those that respect the axiomatization of the ontological theories comprising UFO.
Over the years, OntoUML has been successfully employed in academic, industrial and governmental settings to create conceptual models in a number of different domains, ranging from Geology and Biodiversity Management, to Telecommunications and Bioinformatics, among many others [14]. In fact, research shows that it is among the most used ontology-driven conceptual modeling languages in the literature [19]. Moreover, empirical evidence shows that it significantly contributes to improving the quality of domain representations without requiring an additional effort to produce them [20]. Moreover, as shown by [19], UFO is the second-most used foundational ontology in conceptual modeling and the one with the fastest adoption rate.
By building on ontological semantics of this language, the OntoUML community has developed several methodological and computational tools. Regarding the former, I can cite a pattern grammar for the language comprising a library of ontological design patterns and their relations [21,22], as well as a library of ontological anti-patterns [10]. Regarding the latter, we can mention computational support for pattern-based ontology construction [22,23], formal ontology verification, ontology verbalization, proactive antipattern detection and rectification [10], and ontology validation by visual simulation. In particular, regarding this last point, as shown in [9], this computational support allows the user to exactly check if the states admitted by a particular information structure (a conceptual model, an ontology) are the ones representing the state of affairs intended by its underlying conceptualization. Finally, as discussed in [14], there are many proposals for model transformation mapping OntoUML models to a variety of languages, including OWL [24]. The idea, also strongly defended here, is that a truly ontological language must be used to address the requirements of domain and comprehensibility appropriateness [11], and semantic interoperability. However, from these models, one can generate several operational (or codification) ontologies addressing non-functional requirements (e.g., efficient computational reasoning, executability) for different classes of applications.
In Figure 3, I present the models of Figure 2 now in OntoUML③. In the model on the left, the language makes a clear distinction between the kinds of objects that exist in this domain, namely, people and organs. Kinds are types that necessarily classify their instances, being responsible for their principle of identity, individuation and persistence [25]. The kind Organ is specialized in the subkinds Heart and Brain. Subkinds are also types that necessarily classify their instances: all hearts (brains) are necessarily hearts (brains). Instances of Person can contingently instantiate the type Living Person and the type Deceased Person. These types are mutually disjoint and they exhaust the extension of the type Person. Instances of Person can also contingently instantiate the types Surgeon, Donor and Donee. However, people move from Living Person to Decease Person due to a change in their intrinsic properties, they move in an out of the extension of Surgeon, Donor and Donee due to a change in relational properties. In this case, by being associated to the relational context of a Transplant, the former contingent types are called Phases and the latter are called Roles. Entities such as Transplants are existentially dependent on a multitude of individuals, and thus, connecting them. These are called Relators [26]. Finally, the model can establish the distinction between a mandatory part (here, every person must have as a part an individual of the type Heart, which can change from situation to situation) and an essential part (here, every person must have as a part a specific instance of Brain, which remains the same from situation to situation) [27]. Now, in the model of the right, we have examples of these different types of ontological categories as well, with the addition of a classification type [28,29]. A classification type is a higher-order type in OntoUML, i.e., it is a type whose instances are types. In this case, its instances are different types of Transplants.
By using a language that makes explicit the ontological categories to which each of these types and relations belong, we can clearly analyze the connection between the categories in these two models. For example, Person in system A (Person-A) cannot be identical to Person in system B (Person-B) because the former is a kind of entity and the latter is a phase of Human Beings (for example, suppose that legally speaking Human Beings that loose their cognitive capacities are no longer instances of Person). In fact, Person-A is identical to Human Being in system B and, hence, the relation between Person-A and Person-B is not one of identity but one of generalization. Analogously, the relation between Transplant-A and Transplant-B is not one of identity. The instances of Transplant-A are individual transplants that occur in particular time and space; the instances of Transplant-B are types of transplant. So the relation between them is one of instantiation, i.e., the instances of Transplant-A are instances of instances of Transplant in the sense of system B (Transplant-B)!
In summary, the “I” (interoperability) of FAIR is only possible with the support of information structures that are ontologically consistent and that make explicit the ontological commitments that they inevitably make. We need more than vocabularies. We need good domain ontologies. To construct these ontologies, however, we need engineering support based on Ontology, as a discipline with its numerous and mature theories and methods. Or, as beautifully put by the philosopher Achille Varzi: “No ontology without Ontology”! [30]
Notes
In the FAIR literature, authors commonly speak of semantic interoperability involving interoperation between humans, between humans and machines, and between machines. In the spirit of the semiotic engineering literature [2], I defend here that, with the possible exception of a scenario in which by “machine” we mean strong artificial intelligence (AI), semantic interoperability is always about interoperation with meaning preservation between humans, even in the cases in which these are mediated by machines and information structures. In Section 3, I elaborate on the notion of semantic interoperability defended in this paper.
Assuming that this is indeed an unintended state of affairs in this domain.