The Semantic Data Dictionary – An Approach for Describing and Annotating Data

It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a data set, existing data dictionaries typically are not machine-readable and do not follow a common specification standard. We introduce the Semantic Data Dictionary, a specification that formalizes the assignment of a semantic representation of data, enabling standardization and harmonization across diverse data sets. In this paper, we present our Semantic Data Dictionary work in the context of our work with biomedical data; however, the approach can and has been used in a wide range of domains. The rendition of data in this form helps promote improved discovery, interoperability, reuse, traceability, and reproducibility. We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature. We discuss our approach, present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey data set, present modeling challenges, and describe the use of this approach in sponsored research, including our work on a large National Institutes of Health (NIH)-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics, Learning, and Semantics project. We evaluate this work in comparison with traditional data dictionaries, mapping languages, and data integration tools.


INTRODUCTION
With the rapid expansion of data-driven applications and the expansion of data science research over the past decade, data providers and users alike have relied on data sets as a means for recording and accessing information from a variety of distinct domains. Data sets are composed of distinct structures that require additional information to help users understand the meaning of the data. A common approach used by data providers involves providing descriptive information for a data set in the form of a data dictionary, defined as a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format" [1]. Data dictionaries are useful for many data management tasks, including aiding users in data conversion processes, testing data generation, validating data, and storing data usage criteria [2].
When storing data into a system that adheres to the structure of a particular data dictionary, that document can be used to aid in validation both when inputting new data into the system or updating existing data. By including additional information about a data set itself, data dictionaries can be used to store data usage criteria. Additionally, data conversion is aided by the inclusion of the format and units of the data points, which allows users to use conversion formulae to convert the data into another format or unit. When considering these benefits, we see that the use of data dictionaries has had a significant impact on data use and reuse. Nevertheless, we argue that data dictionaries can be improved by leveraging emerging Semantic Web technologies.
The use of data dictionaries to record descriptions about data sets and their elements has become widely adopted by data providers, often with the intent of aiding reusability. These data dictionaries are useful to data users in reducing ambiguity when interpreting data set content. Considering the structure and annotations that traditional data dictionaries are comprised of, we find that for each column header in a data set, these documents often contain a label that is more informative than the column name, as well as a comment describing the column header. Such annotations in themselves are essential for an end-user to understand the data, as column names are often arbitrary or encoded. Existing data dictionaries often contain structural information about a data set column, such as the format of the data, the data type, or the associated units of measurement. As this information is required for the proper analysis of data, we commend data providers for including it in their data dictionaries. For data sets that contain categorical codes, data providers have done well to document the possible values and include descriptive labels for each category.
While many publicly available data sets include documents resembling data dictionaries, we find that, across institutions, these documents do not adhere to a common metadata standard. Metadata, defined as "structured data about data" [3], should be able to be processed using software. Existing data dictionary The Semantic Data Dictionary -An Approach for Describing and Annotating Data standards typically are aimed at human consumption and do not subscribe to models that are machineunderstandable, and thus lack support for formal semantics. Consequently, tasks involving the combination of data from multiple data sets that are described using data dictionaries are not easily automated.

A Need for Semantics
From the data set production perspective, data sets can convey much more information than the data themselves. Data set entries often correspond to physical observations, such as the weight of a sample, an event duration, or a person's gender. Traditional data dictionaries do well in describing these measurements but cannot represent the measured objects. There is a need to annotate these implicit concepts (representing the measured objects) that are indispensable to a complete understanding of the data but do not correspond to columns in the data set. Annotations of both explicit and implicit concepts allow for the conversion of a tabular format of data into a semantically richer graphical representation.
There may be a variety of ways that a data user can benefit from a semantic representation of data, such as enhanced provenance attributions, query capabilities, and the ability to infer new knowledge. We argue for the applicability of the Semantic Data Dictionary (SDD) as a standard model for representing machinereadable metadata for data sets. The SDD comprises a set of specifications formalizing the assignment of a semantic representation to data by annotating data set columns and their values using concepts from best practice vocabularies and ontologies. It is a collection of individual documents, where each plays a role in creating a concise and consistent knowledge representation. Each of these components, described in Section 3, is implemented using tables. In Appendix B, we provide the specifications for each of the SDD tables. Throughout the remainder of this article, we describe modeling methods, include informative examples from projects employing this approach, discuss modeling challenges, and evaluate our approach against traditional data dictionaries, mapping languages, and data integration tools.
As science moves towards a more open approach, priority has been given to publishing scientific data in a way that is Findable, Accessible, Interoperable, and Reusable (FAIR) [4]. The FAIR principles are used to evaluate the quality of published data sets or the workflow that is used to produce data. As part of our approach to evaluating our methodology, we examine adherence to the FAIR guiding principles. While we have considered guidelines in designing our approach, and they have been adopted for many projects, the FAIR principles are not without limitations. For example, methods for the facilitation of data sharing are not specified, which may result in error perpetuation from differing interpretations of design choices, and more vigorous privacy concerns need to be addressed [5]. The use of the FAIR guidelines and traditional data integration approaches alone do not guarantee enough granularity of representation to support the pooling of data across studies, thereby limiting the potential impact for more significant statistical analyses. However, this capability has been demonstrated using the SDD approach for the Children's Health Exposure Analysis Resource (CHEAR) project [6].

Motivation
In Section 2.1, we consider institutions that provide guidelines for the use of data dictionaries to record descriptive content for a data set. While existing guidelines have helped create human-understandable documents, we believe that there is room for improvement by introducing a formalization that is machinereadable. With the current advances in Artificial Intelligence technologies, there is an increased need for data users to have annotated data that adhere to Semantic Web standards [10,11]. We consider the benefits of combining data from disparate sources in such a way that it can be used in a unified manner. Harmonization across data sets allows for the comparison between similar columns, using a controlled vocabulary. The ability to combine data from various sources and formats into a single cohesive knowledge base allows for the implementation of innovative applications, such as faceted browsers or data visualizers.
Data and provenance understanding refer respectively to data interpretability and the ability to discern provenance attributions, both by humans and machines. This level of knowledge is necessary for the reuse  https://github.com/tetherless-world/SemanticDataDictionary The Semantic Data Dictionary -An Approach for Describing and Annotating Data of data and the reproduction of scientific experiments. Annotation of data improves query and integration capabilities [12], and the use of Semantic Web standards enhances the ability to find the data through a Web search [13]. Unfortunately, it is difficult for data users, who have a second-hand understanding of the data compared to data providers, to create these annotations themselves. As an example, a study related to data dissemination revealed that three researchers, independently analyzing a single data set and using similar approaches, arrived at noticeably dissimilar interpretive conclusions [14]. Additionally, difficulties arise for someone without a technology background to develop competence in technical approaches, due to challenges associated with technological semantics, such as research problems being defined, clarified, and communicated in a way that is perceptable by a general audience [15]. Therefore, the desire to create a standard for people from a wide variety of domains, including those who are untrained in Computer Science and semantic technologies, is an additional motivation. Easing the semantic annotation process for these users is a significant challenge. A machine-readable standard for data set metadata can improve data harmonization, integration, reuse, and reproducibility.

Claims
We claim that the formalism of the Semantic Data Dictionary addresses some of the limitations of existing data dictionary approaches. Traditional data dictionaries provide descriptions about the columns of a data set, which typically represent physical measurements or characteristics, but omit details about the described entities. Existing data dictionaries do not acknowledge the notion that the data values are instances of concepts that may have relationships with other instances of concepts, such as entity-entity, attributeattribute, or entity-attribute relations.
In contrast, the SDD approach allows for the direct annotation of concepts implicitly referenced in a data set. Existing data dictionaries focus on the structure of the data rather than the inherent meaning, including value ranges, formats, and data types. Further information about the data, including the units, meaning, and associated objects, is provided in text descriptions that are not machine-interpretable. The SDD, on the other hand, focuses on the semantics of the data and includes the above information in a way that is readily able to be processed. The SDD consists of an intrinsic model with relationships that can be further customized, allowing the annotator to describe relationships between both explicit and implicit concepts inherent in the data set. By considering these characteristics of SDDs, we argue that a standardized machine-readable representation for recording data set metadata and column information is achieved.
We also claim that the SDD approach presents a level of abstraction over methodologies that use mapping languages. This is achieved by simplifying the programming knowledge requirements by separating the annotation portion of the approach from the software component. As a result, the SDD approach improves the ease of use for a domain scientist over other semantic tools. Additionally, by presenting the annotation component in a form that resembles traditional data dictionaries, this approach provides a bridge between the conventional data dictionary approaches, used by domain scientists, and the formal techniques used by Semantic Web researchers.

RELATED WORK
The SDD approach leverages state-of-the-art advancements in many data and knowledge related areas: traditional data dictionaries, data integration, mapping languages, semantic extract-transform-load (ETL) methods, and metadata standards. In this section, we present related work in each of those extensive areas by highlighting their accomplishments and discussing their limitations.

Data Dictionaries
There are several patents relating to the use of dictionaries to organize metadata [16,17,18]. However, published articles mentioning data dictionaries tend to refrain from including the associated formalism. Thus, we expanded our scope to search for data dictionaries that included standards published on the Web, several of which are discussed below.
The Stony Brook Data Governance Council recommendations list required elements and presented principles associated with data dictionaries  . However, the ability to semantically represent the data is not permitted. Additionally, while data columns can be explicitly described, this approach does not allow the description of implicit concepts that are being described by the data set, which we refer to as object elicitation. The ability to annotate implicit concepts (described in Section 3.2) is one of the distinguishing features of our work. The Open Science Framework  and the United States Government (USG) Statistical Community of Practice and Engagement (SCOPE)  also guide the creation of a data dictionary that includes required, recommended, and optional entries. These data dictionaries support the specification of data types and categorical values, but minimally allow for the encorporation of semantics and do not leverage existing ontologies or vocabularies. The data dictionary specifications for the Biosystematic Database of World Diptera include both general and domain-specific elements [19]. Nevertheless, use of this data dictionary outside of the biological domain appears improbable. Based on the Data Catalog Vocabulary (DCAT [20]), the Project Open Data Metadata Schema provides a data dictionary specification  . Of the data dictionaries' recommendations examined, the Project Open Data Metadata Schema was the most general and the only one to use Semantic Web standards.
There are many recommendations for constructing data dictionaries; however, we found that most are project-or domain-specific, and we find no clear evidence that they are consistently applied by users outside of these individual groups. The exploration of these data dictionaries reveals the need for a standard formalization that can be used across institutions and projects.

Data Integration Approaches
Data integration is a technique that utilizes data from multiple sources to construct a unified view of the combined data [21]. Here we consider existing approaches that have been employed to address data integration challenges.
The Semantic Web Integration Tool (SWIT) can be used to perform transformation and integration of heterogeneous data through a Web interface in a manner that adheres to the Linked Open Data (LOD) principles [22]. While the writing of mapping rules is simplified through the use of a Web interface, the use of this approach may still prove difficult for users without a Semantic Web background. Neo4j is designed as a graph database (GDB) system that supports data integration based on the labeled property graph (LPG) model, which consists of attributed nodes with directed and labeled edges [23]. Despite being implemented using an LPG model rather than Resource Description Framework (RDF), Neo4j can read and write RDF, and by using GraphScale [24], it can further employ reasoning capabilities [25]. Nevertheless, data integration capabilities, such as using ontologies to semantically annotate data schema concepts and the associated objects, are limited.
To provide an integrated view of data collected on moving entities in geographical locations, RDF-Gen was developed as a means of SPARQL-based knowledge graph generation from heterogeneous streaming and archival data sources [26]. While this approach is promising and does support the representation of implicit objects, we find, due to the requirement of creating SPARQL-based graph transformation mappings, that it would likely be difficult for domain scientists to use. DataOps is an integration toolkit that supports the combination of data in varying, different formats, including relational databases, Comma Separated Value (CSV), Excel, and others, which can be accessed via R [27]. While existing user interface components can be used to ease the annotation process and the use of DataOps in industry is expanding, the expertise required to use this approach presents a steep learning curve. OpenRefine is a standalone, open-source tool capable of cleaning and transforming large data sets [28]. Some limitations of this approach pertain to difficulties in performing subset selection, cell-based operations, and data set merging.
It is important to note that most data integration approaches fall short when eliciting objects and relations to comprehensively characterize the semantics of the data. We continue this discussion on data integration by considering mapping languages and semantic extract-transform-load (ETL) applications.

Mapping Languages
In this section, we introduce mapping languages that can be used to convert a relational database (RDB), tabular file, or hierarchical structure to an RDF format and their related tool support.
The RDB to RDF Mapping Language (R2RML) is a W3C standard language for expressing mappings from relational databases to RDF data sets [29]. R2RML mappings contain properties to define the components of the mapping, including the source table, columns retrieved using SQL queries, relationships between columns, and a template for the desired output Uniform Resource Identifier (URI) structure. The R2RML The Semantic Data Dictionary -An Approach for Describing and Annotating Data limitations stem from the requirement of writing the mapping using RDF format, the need to be familiar with the R2RML vocabulary to write mappings, and the support for only relational databases. R2RML extensions exist to address these limitations. The RDF Mapping Language (RML) extends the R2RML vocabulary to support a broader set of possible input data formats, including CSV, XML, and JSON [30]. In this regard, RML extends the R2RML logical table class to be instead defined as a logical source, which allows the user to specify the source URI, reference, reference formulation, and iterator. RML is supported by a tool to define mappings called the RMLEditor, which allows users to make edits to heterogeneous data source mappings using a graphical user interface (GUI) [31]. Both R2RML and RML are robust and provide a solid cornerstone for general RDF generation from tabular data. Still, they fall short when dealing with some particularities of our problem scenario, including the creation of implicit relationships for elicited objects and the annotation of categorical data values. The xR2RML language leverages RML to expand the R2RML vocabulary to support the increase of several RDF data formats as well as the mapping of nonrelational databases [32]. With the use of R2RML mappings, the OpenLink Virtuoso Universal Server has an extension to import relational databases or CSV files that can then transform into RDF [33]. Due to the usage requirement of a mapping language to specify graph transformations, a domain scientist may be reluctant to employ the above approaches.
KR2RML is an extension to R2RML addressing several of its limitations, including support for multiple input and output data formats, new serialization formats, transformations and modeling that do not rely on knowledge about domain-specific languages, and scalability when handling large amounts of data [34]. KR2RML is implemented in an open-source application called Karma. Karma is a system that uses semantics to integrate data by allowing users to import data from a variety of sources, clean and normalize the data, and create semantic descriptions for each of the data sources used [35]. Karma includes a visual interface that helps automate parts of the modeling process by suggesting proposed mappings based on semantic type assignments, and hence reduces some of the usage barriers associated with other mapping language methodologies. Nevertheless, some distinguishing factors between this and our approach include the following: when using the SDD approach, there is no need to write mapping transformation rules, and through the use of the Codebook (described in Section 3.3), the SDD approach supports cell value annotation.
CSV2RDF is a W3C standard for converting tabular data into RDF [36]. Introduced to address the limitation of R2RML that only relational data could be mapped, CSV2RDF extends R2RML to allow the mapping of additional structured data formats, such as CSV, TSV, XML and JSON [37]. The applicability of CSV2RDF for converting large amounts of data has been demonstrated using publicly available resources from a data portal [38]. CSV2RDF has also been used in an approach to automatically convert tabular data to RDF [39].
The Sparqlification Mapping Language (SML) progresses towards a formal model for RDB2RDF mappings, maintaining the same expressiveness as R2RML while simplifying usage by providing a more concise syntax, achieved by combining traditional SQL CREATE VIEW statements with SPARQL CONSTRUCT queries [40]. SML is intended to be a more human-readable mapping language than R2RML. The R2R

The Semantic Data Dictionary -An Approach for Describing and Annotating Data
Mapping Language, also based on SPARQL, is designed for writing data set mappings represented as RDF using "dereferenceable" URIs [41]. While it is possible for the user to specify metadata about each mapping, the possible mappings that can be specified correspond to direct translations between the data and the vocabulary being used, rather than allowing for detailed object elicitation.
Another mapping language based on SPARQL is Tarql, where databases are referenced in FROM clauses, mappings can be specified using a SELECT or ASK clause, and RDF can be generated using a CONSTRUCT clause [42]. One limitation of this approach is that it uses SPARQL notation for tasks that were not originally intended by the grammar, rather than exending SPARQL with additional keywords. The D2RQ mapping language, which allows for querying on mapped databases using SPARQL, is a declarative language that allows for querying through the use of the RDF Data Query Language (RDQL), publication of a database on the Semantic Web with the RDF Net API, reasoning over database content using the Jena ontology API, and accessing database information through the Jena model API [43]. Some limitations of D2RQ include integration capabilities over multiple databases, write operations such as CREATE, DELETE, or UPDATE, and support for Named Graphs [44].
While many of the mapping languages above focus on the conversion of RDBs to knowledge graphs, RDB2OWL is a high-level declarative RDB-to-RDF/OWL mapping language used to generate ontologies from RDBs [45]. It is achieved by mapping the target ontology to the database structure. RDB2OWL supports the reuse of RDB table column and key information, includes an intuitive human-readable syntax for mapping expressions, allows for both built-in and user-defined functions, incorporates advanced mapping definition primitives, and allows for the utilization of auxiliary structures defined at the SQL level [45].
In addition to the difficulties associated with writing mapping transformations, we find that mappinglanguage-based methodologies have limited object and relation elicitation capabilities, and cell value annotation is typically not permitted. These limitations are addressed in the SDD approach.

Semantic Extract-Transform-Load
The extract-transform-load (ETL) operations refer to processes that read data from a source database, convert the data into another format, and write the data into a target database. In this section, we examine several ETL approaches that leverage semantic technologies. LinkedPipes ETL (LP-ETL) is a lightweight, linked data preparation tool supporting SPARQL queries, including debug capabilities, and can be integrated into external platforms [46]. LP-ETL contains both back-end software for performing data transformations, as well as a front-end Web application that includes a pipeline editor and an execution monitor. A pipeline is defined as "a repeatable data transformation process consisting of configurable components, each responsible for an atomic data transformation task" [46]. As transformations in this approach are typically written as SPARQL CONSTRUCT statements, this methodology would be difficult to employ for someone who is unfamiliar with SPARQL. Semantic extract-transform-load-er (SETLr) is a scalable tool that uses the

The Semantic Data Dictionary -An Approach for Describing and Annotating Data
JSON-LD Template (JSLDT) language  for the creation of RDF from a variety of data formats [47]. This approach permits the inclusion of conditionals and loops (written in JSLDT) within the mapping file, allowing for the transformation process to iterate through the input data in interesting ways. Nevertheless, there may be a steep learning curve for researchers without a programming background to adopt this approach.
Eureka! Clinical Analytics is a Web application that performs ETL on Excel spreadsheets containing phenotype data [48]. Since this application was designed for use on clinical projects, it cannot easily be generalized for use in domains outside of biomedicine. The open-source Linked Data Integration Framework (LDIF) leverages Linked Data to provide both data translation and identity resolution capabilities [49]. LDIF uses runtime environments to manage data flow between a set of pluggable modules that correspond to data access, transformation, and output components. Improvements in the framework resulted in the extension of the importer capabilities to allow for input in the form of RDF/XML, N-Triples, and Turtle, import data by crawling RDF links through the use of LDspider, and replicate data through SPARQL CONSTRUCT queries [50]. One limitation of LDIF is that the runtime environment that supports RDF is slower than the in-memory and cluster environment implementations do not support RDF. Other approaches use existing semantic technologies to perform ETL [51,52,53]. These approaches, however, have a similar hurdle for adoption, in that they are often perceived as challenging by those unfamiliar with Semantic Web vocabularies and standards. SDDs provide a means of performing Semantic ETL without requiring writing of complex transformation scripts.

Metadata Standards
The collection of SDD specifications that we discuss in Section 3 serve to provide a standard guideline for semantically recording the metadata associated with the data set being annotated. In this section, we examine existing metadata standards for describing data that incorporate semantics. The ISO/IEC 11179 standard includes several components, including the (1) framework, (2) conceptual model for managing classification schemes, (3) registry metamodel and basic attributes, (4) formulation of data definitions, (5) naming and identification principles, (6) registration instructions, and (7) registry specification for data sets  . This standard is intended to address the semantics, representation, and registration of data. Nevertheless, a limitation of ISO/IEC 11179 is that it mainly focuses on the lifestyle management of the metadata describing data elements rather than of events associated with the data values [54]. The Cancer Data Standards Repository (caDSR) implements the ISO/IEC 111791 standard to organize a set of common data elements (CDEs) used in cancer research [55]. The Clinical Data Interchange Standards Consortium (CDISC) has produced several Unified Modeling Language (UML) models that provide schemas for expressing clinical data for research purposes [56]. However, as these schemas are based on the Health Level 7 (HL7) reference implementation model (RIM), which focuses on representing information records instead of things in the world, semantic concepts are used as codes that tag records rather than to provide types for entities.

THE SEMANTIC DATA DICTIONARY
The Semantic Data Dictionary approach provides a way to create semantic annotations for the columns in a data set, as well as for categorical or coded cell values. This is achieved by encoding mappings to terms in an appropriate ontology or set of ontologies, resulting in an aggregation of knowledge formed into a graphical representation. A well-formed SDD contains information about the objects and attributes represented or referred to by each column in a data set, utilizing the relevant ontology URIs to convey this information in a manner that is both machine-readable and unambiguous.
The main output of interpreting SDDs are RDF graphs that we refer to as knowledge graph fragments, since they can be included as part of a larger knowledge graph. Knowledge graphs, or structured graphbased representations that encode information, are variably defined but often contain a common set of characteristics: (i) real world entities and their interrelations are described, (ii) classes and relations of entities are defined, (iii) interrelating of entities is allowed, and (iv) diverse domains are able to be covered [57]. We have published a number of SDD resources, such as tutorials, documentation, complete examples, and the resulting knowledge graph fragments  . Full sets of annotated SDDs for several public data sets are also available here. To support the modularization and ease of adoption of the annotation process, we implement the SDD as a collection of tabular data that can be written as Excel spreadsheets or as CSV files. The SDD is organized into several components to help modularize the annotation process. We introduce the components here and go into further detail on each throughout the remainder of this section. A document called the Infosheet is used to specify the location of each of the SDD component tables. Furthermore, the user can record descriptive metadata about the data set or SDD in this document. The Dictionary Mapping (DM) is used to specify mappings for the columns in the data set that is being annotated. If only this component is included with the SDD, an interpreter can still be used to convert the data into an RDF representation. Therefore, we focus the majority of our discussion in this section on the DM table. We also briefly describe the remaining SDD components that allow for richer annotation capabilities and ease the annotation process. The Codebook is used to interpret categorical cell values, allowing the user to assign mappings for data points in addition to just the column headers. The Code Mapping table is used to specify shorthand notations to help streamline the annotation process. For example, the user can specify 'mm' to be the shorthand notation for uo:0000016  , the class in the Units of Measurement Ontology (UO [58]) for millimeter. The Timeline table is used to include detailed annotations for events or time intervals. Finally, the Properties table allows the user to specify custom predicates employed during the mapping process. We use SmallCaps font when referring to columns in an SDD table and italics when referring to properties from ontologies. Further information on the SDD modeling process is available on the SDD documentation website  .
 https://tetherless-world.github.io/sdd/resources  A listing of ontology prefixes used in this article is provided in Appendix

Infosheet
To organize the collection of tables in the SDD, we use the Infosheet (Appendix Table B.1), which contains location references for the Dictionary Mapping, Code Mapping, Timeline, Codebook, and Properties tables. The Infosheet allows for the use of absolute, relative, or Web resource locations. In addition to location references, the Infosheet is used to include supplemental metadata (Appendix Table B.2) associated with the SDD, such as a title, version information, description, or keywords. In this regard, the Infosheet serves as a configuration document, weaving together each of the individual pieces of the Semantic Data Dictionary and storing the associated data set-level metadata.
The properties that are included support distribution level data set descriptions based on the Health Care and the Life Sciences (HCLS) standards  11 , as well as the Data on the Web Best Practices (DWBP)  12 . The HCLS standards contain a set of metadata concepts that should be used to describe data set attributes. While the resulting document was developed by stakeholders working in health related domains, the properties included are general enough to be used for data sets in any domain. The DWBP were developed by a working group to better foster communications between data publishers and users, improve data management consistency, and promote data trust and reuse. The associated document lists 35 best practices that should be followed when publishing data on the Web, each of which includes an explanation for why the practice is relevant, the intended outcome, possible implementation and testing strategies, and potential benefits of applying the practice.
In Section 4, we provide an example of using the SDD approach to annotate the National Health and Nutrition Examination Survey (NHANES). An example Infosheet for the demographics table of this data set is provided in Appendix Table C.1.

Dictionary Mapping
The Dictionary Mapping (DM) table includes a row for each column in the data set being annotated (referred to as explicit entries), and columns corresponding to specific annotation elements, such as the type of the data (ATTRIBUTE, ENTITY)  13 , label (LABEL), unit (UNIT), format (FORMAT), time point (TIME), relations to other data columns (INRELATIONTO, RELATION), and provenance information (WASDERIVEDFROM, WASGENERATEDBY). Figure 1 shows the conceptual diagram of the DM. Such a representation is similar to the structure of general science ontologies, such as the Semanticscience Integrated Ontology (SIO) [59] or the Human-Aware Science Ontology (HAScO) [60]. We use SIO properties for the mapping of many of the DM columns, as shown in the Dictionary Mapping specification in Appendix Table B.3, while also leveraging the PROV-O ontology [61] to capture provenance information. Despite specifying this default set of mappings, we note that the Properties table of the SDD can be used to determine the set of predicates used in the mapping process, allowing the user to customize the foundational representation model.  11 https://www.w3.org/TR/hcls-data set/  12 https://www.w3.org/TR/dwbp/  13 When referencing columns from any of the SDD tables, the Small Caps typeface is used.

The Semantic Data Dictionary -An Approach for Describing and Annotating Data
In addition to allowing for the semantic annotation of data set columns, unlike traditional mapping approaches, the SDD supports the annotation of implicit concepts referenced by the data. These concepts, referred to as implicit entries, are typically used to represent the measured entity or the time of measurement. For example, for a column in a data set for a subject's age, the concept of age is explicitly included, while the idea that the age belongs to a human subject is implicit. These implicit entries can then be described to have a type, a role, relationships, and provenance information in the same manner as the explicit entries. For example, to represent the subject that had their age measured, we could create an implicit entry, ??subject  14 .
F igure 1. A conceptual diagram of the Dictionary Mapping that allows for a representation model that aligns with existing scientific ontologies. The Dictionary Mapping is used to create a semantic representation of data columns. Each box, along with the "Relation" label, corresponds to a column in the Dictionary Mapping table. Blue rounded boxes correspond to columns that contain resource URIs, while white boxes refer to entities that are generated on a per-row/column basis. The actual cell value in concrete columns is, if there is no Codebook for the column, mapped to the "has value" object of the column object, which is generally either an attribute or an entity.

Attributes and Entities
ATTRIBUTE and ENTITY are included in the DM to allow for the type assignment of an entry. While both of these columns map to the property rdf:type  15 , they are both included as it may be semantically significant to distinguish between characteristics and objects. If an entry describes a characteristic, ATTRIBUTE should be populated with an appropriate ontology class. The entity that contains the characteristic described, which can be either explicit or implicit, should be referenced in ATTRIBUTEOF. While columns in a data set typically describe an observed characteristic, this is not always the case. If an entry describes an object, such as a person, place, thing, or event, ENTITY should be populated with an appropriate ontology class.

Annotation Properties and Provenance
A set of annotation properties, including comments, labels, or definitions, allows for the description of an explicit or implicit entry in further detail. While LABEL is the only column included in the DM Specification  14 When including implicit entries in an SDD table, the prefix "??" is used as a distinguishing labeling feature. The typewriter typeface is used in this article when referring to instances of implicit entries.  15 The italics typeface is used when a property from an ontology is mentioned.

The Semantic Data Dictionary -An Approach for Describing and Annotating Data
for an annotation property, if support for comments and definitions is included in an SDD interpreter, we recommend the use of the rdfs:comment and skos:definition predicates, respectively. In terms of including provenance, WASDERIVEDFROM can be used to reference pre-existing entities that are relevant in the construction of the entry, and WASGENERATEDBY can be used to describe the generation activity associated with the entry.

Additional Dictionary Mapping Columns
The ROLE, RELATION, and INRELATIONTO columns of the DM are used to specify roles and relationships associated with entries. A reference to objects or attributes an entry is related to should be populated in INRELATIONTO. By populating ROLE, the sio:hasRole property is used to assign the specified role to the entry. Custom relationships using properties that are not included in the SDD can be specified using RELATION. Events in the form of time instances or intervals associated with an entry should be referenced in TIME.
The unit of measurement of the data value can be specified in UNIT. In general, we recommend the use of concepts in the Units of Measurement Ontology (UO) for the annotation of units, as many existing vocabularies in various domains leverage this ontology. A W3C XML Schema Definition Language (XSD) primitive data type  16 can be included in FORMAT to specify the data type associated with the data value.

Dictionary Mapping Formalism
We define a formalism for the mapping of DM columns to an RDF serialization. The notation we use for formalizing the SDD tables is based on an approach for translating constraints into first-order predicate logic [62]. While most of the DM columns have one-to-one mappings, we can see the interrelation of the mapping of ROLE, RELATION, and INRELATIONTO. In the formalism included below, 'Value' represents the cell value of the data point that is being mapped.

Codebook
The Codebook table of the SDD allows for the annotation of individual data values that correspond to categorical codes. The Codebook table contains the possible values of the codes in CODE, their associated labels in LABEL, and a corresponding ontology concept assignment in CLASS. If the user wishes to map a Codebook value to an existing Web resource or instance of an ontology class, rather than a reference to a concept in an ontology, RESOURCE can be populated with the corresponding URI. We recommend that the class assigned to each code for a given column be a subclass of the attribute or entity assigned to that column. A conceptual diagram of the Codebook is shown in Figure 2(a). The Codebook Specification is provided in Appendix

Code Mapping
The Code Mapping table contains mappings of abbreviated terms or units to their corresponding ontology concepts. This aids the human annotator by allowing the use of short-hand notations instead of repeating a search for the URI of the ontology class. The set of mappings used in the CHEAR project is useful for a variety of domains and is available online  17 .

Timeline
If an implicit entry for an event included in the DM corresponds to a time interval, the implicit entry can be specified with greater detail in the Timeline table. Timeline annotations include the corresponding class of the time associated entry, the units of the entry, start and end times associated with an event entry, and a connection to other entries that the Timeline entry may be related to. Shown in Figure 2(b) is a conceptual diagram of the Timeline. The Timeline Specification is provided in Appendix Table B.5. The formalism for mapping the Timeline is included below. ∃NAME ∧ ∃TYPE ⇒ TYPE(NAME) ∃NAME ∧ ∃LABEL ⇒ rdfs:label(NAME) ∃NAME ∧ ∃START ⇒ ∃S ∧ sio:hasStartTime(NAME, S) ∧ sio:hasValue(S, START) The Semantic Data Dictionary -An Approach for Describing and Annotating Data ∃NAME ∧ ∃END ⇒ ∃E ∧ sio:hasEndTime(NAME, E) ∧ sio:hasValue(E, END) ∃NAME ∧ ∃START ∧ ∃END ∧ START ≡ END ⇒ ∃T ∧ sio:existsAt(NAME, T) ∧ sio:hasValue(T, START) ∃NAME ∧ ∃UNIT ⇒ ∃U ∧ UNIT(U) ∧ sio:hasUnit(NAME, U) ∃NAME ∧ ∃INRELATIONTO ⇒ sio:inRelationTo(NAME, INRELATIONTO) Figure 2. (a) A conceptual diagram of the Codebook, which can be used to assign ontology classes to categorical concepts. Unlike other mapping approaches, the use of the Codebook allows for the annotation of cell values, rather than just columns. (b) A conceptual diagram of the Timeline, which can be used to represent complex time associated concepts, such as time intervals.

Property Customization
The Semantic Data Dictionary approach creates a linked representation of the class or collection of data sets it describes. The default model provided is based on SIO, which can be used to express a wide variety of objects using a fixed set of terms, incorporates annotation properties from RDFS and Simple Knowledge Organization System (SKOS), and uses provenance predicates from PROV-O. Shown in Appendix Table B.6 are the default sets of properties that we recommend.
By specifying the associated properties with specific columns of the Dictionary Mapping Table, the properties used in generating the knowledge graph can be customized. This means that it is possible to use an alternate knowledge representation model, thus making this approach ontology-agnostic. Nevertheless, we urge the user to practice caution when customizing the properties used to ensure that the resulting graph is semantically consistent (for example, not to replace an object property with a datatype property).
In the formalism presented above and the DM, CB, and TL specifications of Appendix Tables B.3 Table are included to allow the alteration of any of these predicates. The two additional rows pertain to ATTRIBUTE and ENTITY, which, like TYPE, by default map to rdf:type, but can be customized to use an alternate predicate if the user wishes. In this way, by allowing for the complete customization of the predicates that are used to write the formalism, the SDD approach is ontology-agnostic. Note that the predicates used in the Infosheet Metadata

The Semantic Data Dictionary -An Approach for Describing and Annotating Data
Supplement of Table B.2, which are based on the best practices described in Section 3.1, are not included in the Properties Specification.

EXAMPLE -THE NATIONAL HEALTH AND NUTRITION EXAMINATION SURVEY
The National Health and Nutrition Examination Survey (NHANES) contains publicly available demographic and biomedical information. A challenge in creating a knowledge representation from this data set is determining how to represent the implicit entities referenced by the data, such as a participant of the study or the household that they live in. Additionally, information about a participant may be dispersed throughout multiple tables that consequently need to be integrated, resulting in difficulties when following traditional mapping approaches.
NHANES data dictionaries include a variable list that contains names and descriptions for the columns in a given data set component, as well as a documentation page that consists of a component description, data processing and editing information, analytic notes, and a Codebook. Unfortunately, the data set description provided is textual and is therefore not readily processed.
We find that neither the data documentation nor the codebooks included in NHANES incorporate mappings to ontology concepts. Thus, we provide a simple example of how several columns from the NHANES demographics data set would be represented using the SDD approach. The terms in this example are annotated using the CHEAR, SIO, and National Cancer Institute Thesaurus (NCIT) ontologies. Shown in Tables 1, 2, and 3 are a portion of the SDD we encoded for the NHANES demographics data set, in which we respectively present a subset of the explicit DM entries, implicit DM entries, and the Codebook entries. An example Infosheet for the NHANES demographic data set is provided in Appendix In Table 1, we provide the explicit entries that would be included in the DM. The data column SEQN corresponds to the identifier of the participant. The resource created from this column can be used to align any number of NHANES tables, helping address the data integration problem. Another column included is the categorical variable that corresponds to education level. Also included are two variables that correspond

The Semantic Data Dictionary -An Approach for Describing and Annotating Data
to the age of the participant taking the survey and the age of the specified reference person of the household, referred to as the head of the household (HH in Table~\ref{tab:NHANESDemoExplicit}), defined as the person who owns or pays rent for the house. We see how the use of implicit entries, as well as the use of specified Code Mapping units, helps differentiate the two ages. The corresponding implicit entries referenced by the explicit entries are annotated in Table 2. In Table 3, we include a subset of the Codebook for this example. The SDD Codebook here is similar to the original NHANES Codebook, with the addition of COLUMN, so that multiple codebooks do not have to be created to correspond to each categorical variable, and CLASS, used to specify a concept from an ontology to which the coded value maps.

CURRENT USE
In this section, we provide a case study on projects that have leveraged the SDD for health-related use cases. We focus on work done for the Health Empowerment by Analytics, Learning, and Semantics (HEALS) project, while also briefly discussing efforts in other programs. In our funded research, our sponsors often desire the representation of their data in a semantically consistent way that supports their intended applications. They wish to play a role in the annotation process by contributing their subject matter expertise. We find that the SDD approach is more accessible to domain scientists than other programming intensive approaches. Additionally, they appreciate that the ability to reuse SDDs limits the amount of necessary future updates when, for example, a data schema changes.

Health Empowerment by Analytics, Learning and Semantics
As part of the RPI and IBM collaborative Health Empowerment by Analytics, Learning, and Semantics (HEALS) project  20 , SDDs have been used to aid in semantic representation tasks for use cases involving breast cancer and electronic health record (EHR) data.

Breast Cancer Use Case
For the creation of an application used for the automatic re-staging breast cancer patients, the SDD approach was used to create a knowledge representation of patient data from the Surveillance, Epidemiology, and End Results (SEER) program [63]. In order to integrate treatment recommendations associated with a given biomarker into the application, an SDD for the Clinical Interpretation of Variants in Cancer (CIViC) database was also created. By applying the SDD approach to help solve this problem, seamless data integration between these two distinct sources was demonstrated, which would have been more difficult to achieve using some of the methods described in Section 2.2. For example, if any of the mapping language or Semantic ETL approaches were applied, the writing of a script that requires an intrinsic understanding of the data set would be necessary, rather than needing to just fill out the SDD tables. While this approach still requires an understanding of the data set, if the SDD approach was used for describing the data sets mentioned above, the data apprehension requirement on the user would be greatly reduced. Another advantage demonstrated by using this approach was that, since a limited set of properties are leveraged in the semantic model that was created, the cost of implementing the application, in terms of programming resources and overhead, was reduced. A subset of the explicit entries from the SEER DM are shown in Table 4. Additional cancer-related work for the HEALS project involves the annotation of a subset of The Cancer Genome Atlas (TCGA) through the NCI Genomic Data Commons (GDC) portal. While these SDDs are not included here, they are openly available on our SDD resources webpage. The clinical subset of the TCGA data that was annotated contains patient demographic and tumor information, and the methylation portion contains genetic information. By using the same ontology classes that were used for the SEER data set to annotate these concepts, we are able to leverage TCGA data to further enrich the cancer staging application described above.

Electronic Health Record Data
To create a knowledge representation from electronic health record (EHR) data, we annotated the Medical Information Mart for Intensive Care III (MIMIC-III) data set using SDDs. While this effort involved annotating 26 relational tables, we only include a subset of the Dictionary Mapping of the admission table in Table 5. Using this approach, we can represent implicit concepts associated with the data. The inclusion of implicit concepts provides connection points for linking the various EHR data tables into a single coherent knowledge representation model that reflects the reality recorded by the data. This would be difficult to accomplish using many alternate approaches we examined that do not support object elicitation.

Additional Use Cases
Several institutions are employing the Semantic Data Dictionary approach for a variety of projects. The Icahn School of Medicine at Mount Sinai uses SDDs for the NIH CHEAR and the follow-on HHEAR projects to annotate data related to demographics, anthropometry, birth outcomes, pregnancy characteristics, and biological responses. The Lighting Enabled Systems & Applications (LESA) Center is using SDDs to annotate sensor data. SDDs are being used in Brazil for the Big Data Ceara project, through Universidade de Fortaleza, and the Global Burden of Disease project, through Universidade Federal de Minas Gerais.

Remarks
In this section, we discussed how SDDs help represent knowledge for a variety of other projects that involve collaborative efforts with domain scientists, exhibiting the applicability of this approach for researchers in a variety of specializations. For the HEALS project, we have shown DMs for use cases that involve breast cancer and EHR records. As well as patient demo graphic characteristics from the SEER data, we encode the size of the patient's tumor, the number of lymph nodes affected, whether or not the cancer metastasized, and several genetic biomarkers. Using this data, the successful automation of re-staging breast cancer patients was accomplished. While we only show a single DM for the MIMIC-III data set, this use case involves the annotation of multiple relational data tables and demonstrates how data integration can be performed using SDDs.

MODELING CHALLENGES FOR DOMAIN SCIENTISTS
An initial strategy of training that was followed by qualitative evaluation was used to examine the difficulty experienced by researchers who do not have a Semantic Web background when first using the Semantic Data Dictionary. Domain scientists, including epidemiologists and biostatisticians, were presented with initial training by a Semantic Web expert. Supporting materials were developed in collaboration with a domain expert and then were made available to provide guidance and examples to facilitate domain scientists' use of the Semantic Data Dictionary.
First, a template for completing the Semantic Data Dictionary that included pre-populated fields for common demographic concepts, such as age, race, and gender, was provided to domain scientists to use for each study. Second, a help document was created that included instructions and representations of more complex concepts, including measurements of environmental samples, measurements of biological samples, and measurements taken at specific time-points. Third, a practical workshop was held where a semantic scientist provided training in semantic representation to the domain scientists. Following the workshop and distribution of supporting materials, domain scientists completed at least one Semantic Data Dictionary for an epidemiologic study and were then asked about the challenges they faced. Despite this training and workshop being conducted in a context related to epidemiology and health, the key takeaways resulted in general lessons learned.
The first identified challenge was the representation of implicit objects implied by the features in the data set. This is an uncommon representation in the public health domain. While the modeling of simple concepts may be intuitive (e.g. maternal age has a clear implicit reference to mother), the representation of complex ideas, such as fasting blood glucose levels, proves to be more difficult as the implicit object, and relationships between concepts, is not as intuitive for domain scientists. A second modeling challenge involved discussions on how to represent time-associated concepts that power the ontology-enabled tools and allow domain scientists to harmonize data across studies. Additionally, when a concept was not found in a supporting ontology, there were questions of how to best represent the concept in a

The Semantic Data Dictionary -An Approach for Describing and Annotating Data
semantically-appropriate way. In many cases, these challenges resulted in a need to go back to a Semantic Web expert for clarification.
To alleviate these challenges, we have refined and expanded the number of publicly-available resources that include documentation, step-by-step modeling methods, tutorials, demonstrations, and informative examples. We increased the complexity of examples and incorporated time-associated concepts to initial templates and help documents. To facilitate further communication, a Web-based Q&A document has been shared between the Semantic Web experts and the domain scientists to enable timely feedback and answers to specific questions on the representation of concepts and the need to generate new concepts.
In addition to the solutions presented above, we plan for future training events to explicitly demonstrate the use of the Semantic Data Dictionary. We will provide an overview on the semantic representation, as well as guidelines for using the corresponding documentation and training materials.

EVALUATION
To evaluate the Semantic Data Dictionary approach, we categorize metrics from earlier evaluations on mapping languages [64,65] and requirements of data integration frameworks. In addition to evaluating the SDD for adherence to these metrics, we survey similar work to determine the extent to which they meet the metrics in comparison. We include a set of evaluation metrics that we organized into four categories. These categories are respectively related to data, semantics, the FAIR principles, and generality.
To measure the degree to which an approach meets each metric, we provide a value of 0, 0.5, or 1, depending on the extent to which an approach responds to an evaluation parameter. In general, if an approach does not meet a metric, it is given a score of 0. If it meets a metric partially, we assign a score of 0.5. We also assign this score to approaches that meet a metric by omission, such as being ontologyagnostic by not supporting the use of ontologies at all. If an approach completely meets the metric, it is given a score of 1. We list the criteria used for the assignment of numerical values below (refer to Table 6 for the complete list of categorized metrics).

Data Integration Capabilities
In this category, we consider how the approach can harmonize and ingest data, allows for subset data selection, and permits a data type assignment. We evaluate whether the approach is harmonizable in the sense that it has the capability of creating a cohesive representation for similar concepts across columns or data sets in general. We check that knowledge generated across data sets can be compared using similar terms from a controlled set of vocabularies. For this metric, we respectively assign a score of 0, 0.5, or 1 if data integration capabilities are not supported, somewhat supported, or wholly supported.
Next, we consider whether the approach is ingestible, outputting data in a standard format that can be uploaded and stored (ingested) and supports inputs of varying formats. We assign a score of 1 if the resulting data representation can be stored in a database or triplestore, and if it can input data of varying formats. If one of the two features is supported, we assign a score of 0.5. If neither is supported, we assign a score of 0. Furthermore, we consider a subset selection metric, where we check if the approach allows the user to select a subset of the data, either in terms of columns and rows, on which to perform the annotation. For this metric, a score of 0 is assigned if this capability is not included in the approach. We assign a score of 0.5 if either a subset of the rows or the columns can be specified for annotation, but not both. If the approach allows for the selection of both a subset of rows or of columns to be annotated, we assign a score of 1.
Finally, we include the data type assignment metric, measuring the extent to which XML data types can be assigned to attributes when mapping data. We assign a score of 0 for this metric if the approach does not allow for the assignment of data types when mapping data. If the assignment of a limited set of data types that are not based on XML standards is incorporated, a score of 0.5 is assigned. If the approach allows the assignment of XML data types, a score of 1 is given.

Formal Semantics Capabilities
In this category, we consider if the approach allows for object or relation elicitation, as well as value, time, or space annotation. We also check if the resulting data representation is queryable and if the approach supports both domain-specific and general ontology foundations. Finally, graph materialization is the last assessment metric we apply. Data usually consist of attributing value to observations, measurements, or survey results. Data set descriptions contain metadata, but often omit details on the objects that the values describe. For a complete semantic representation, one must also consider the ability to represent implicit objects that are associated with the data points, which we measure using the object elicitation metric. If the approach does not include the ability to represent implicit objects, a score of 0 is assigned. If implicit objects are considered but not annotated in detail, we assign a score of 0.5. We assign a score of 1 if implicit objects can be represented and richly annotated.
In addition to being able to represent implicit concepts, we consider relation elicitation, where relationships between implicitly elicited objects can be represented. A score of 0 is assigned if an approach does not allow for the representation of relationships between elicited objects. If relationships between elicited objects can be represented, but not annotated in detail, a score of 0.5 is assigned. We assign a score of 1 if relationships between elicited objects can be represented and richly annotated.
Next, we consider if the resulting representation is queryable, so that specific data points can be easily retrieved using a query language. A score of 0 is assigned for this metric if specific content from the knowledge representation cannot be queried. If it can be queried using a relational querying method, such as SQL, but not a graph querying method, a score of 0.5 is assigned. If content can be queried using a graph querying method, such as SPARQL, we assign a score of 1.
We further consider the annotation of cell values, rather than just column headers, using the value annotation metric. This covers the ability to annotate categorical cell values, assign units to annotate noncategorical cell values, and specify attribute mappings of object properties related to cell values. If the approach does not allow for the annotation of cell values at all, or allows for a limited set of annotations

The Semantic Data Dictionary -An Approach for Describing and Annotating Data
for cell values, we assign scores of 0 and 0.5, respectively. We assign a score of 1 if an approach includes the ability to annotate categorical cell values, assigns units to annotate non-categorical cell values, and specifies attribute mappings of object properties related to cell values.
We consider the ability to represent specific scientific concepts, including time and space. Using the time annotation metric, we check for the ability to use timestamps to annotate time-series values, as well as named time instances to annotate cell values. A score of 0 is assigned for this metric if an approach does not allow for the representation of time. If the approach allows for the representation of time, but does not permit detailed annotations, we assign a score of 0.5. We assign a score of 1 if the approach allows for detailed annotation of time, such as the use of timestamps to annotate time-series values and named time instances to annotate cell values.
The space annotation metric is added to check for the use of semantic coordinate systems to annotate the acquisition location of measurements. We assign a score of 0 if an approach does not allow for the representation of space. If it allows for the representation of space, but does not permit detailed annotations, we assign a score of 0.5. A score of 1 is assigned if the use of semantic coordinate systems to annotate the acquisition location of measurements is supported.
We examine domain knowledge support by checking if the approach permits the design of mappings using pre-existing domain-specific ontologies or controlled vocabularies. A score of 0 is assigned for this metric if the approach does not permit the design of reusable mappings driven by domain knowledge. We assign a score of 0.5 if it permits the design of reusable mappings using either pre-existing ontologies or controlled vocabularies, but not both. If annotations from both pre-existing ontologies or controlled vocabularies are allowed, we assign a score of 1.
Using the top-level ontology foundation metric, we consider the ability to use general upper ontologies as a foundation for the resulting model. If an approach cannot specify mapping rules based on foundation ontologies, a score of 0 is assigned for this metric. If a subset of mapping rules based on general foundation ontologies can be specified, we assign a score of 0.5. A score of 1 is assigned if the approach allows for the specifiation of all mapping rules based on general foundation ontologies. Essentially, we are checking if the semantic model that results from the annotation approach is structured based on a given ontology. While we recommend the use of well-known upper ontologies such as SIO or Basic Formal Ontology (BFO [66]), in evaluating this metric we allow the approach to leverage any ontology.
Finally, with the graph materialization metric, we assess the persistence of the generated knowledge graph into an accessible endpoint or file. If the approach does not allow for the materialization of the generated graph, a score of 0 is assigned. If the generated graph is reified into an accessible endpoint or downloadable file, but not both, a score of 0.5 is assigned. If both materializations into an accessible endpoint and a downloadable file are supported, we assign a score of 1.

FAIR
In the FAIR category, we consider the metrics associated with the FAIR guiding principles, including if the approach and resulting artifacts are findable, accessible, interoperable, and reusable. Furthermore, we also consider the related metrics of reproducibility and transparency, which are not included in the FAIR acronym. While several of the metrics we measure in the other categories of our evaluation aid with the creation of FAIR data, such as the incorporation of provenance or the inclusion of documentation as discussed in Section 7.3.1, we include these six metrics in the FAIR category since they are directly associated with intent of the principles in enhancing data reuse and are explicitly discussed in the introductory article on the FAIR principles [4].
For the findable metric, we consider the use of unique persistent identifiers, such as URLs, as well as the inclusion of Web searchable metadata so that the knowledge is discoverable on the Web. If the knowledge representation is neither persistent nor discoverable, we assign a score of 0 for this metric. If the knowledge representation is one of the two, we assign a score of 0.5. A score of 1 is assigned if the knowledge representation is both persistent and discoverable.
We consider a knowledge representation to be accessible if resources are openly available using standardized communication protocols, with the consideration that data that cannot be made publicly available is accessible through authentication. Accessibility also includes the persistence of metadata, that even if data are retired or made unavailable, their description still exists on the Web. As additional consideration for evaluating accessibility, we examine whether or not the associated software for an approach is free and publicly available. If resources and metadata are not published openly, a score of 0 is assigned for this metric. If some resources and metadata are persistent and openly available, we assign a score of 0.5. A score of 1 is assigned if all of the resources and metadata from a given approach are both persistent and openly available using standardized communication protocols.
For the interoperable metric, we consider the use of structured vocabularies, such as best practice ontologies, that are RDF compliant. Mainly, we are checking to see if the knowledge representation is published using an RDF serialization. If the knowledge representation does not use a structured vocabulary, a score of 0 is assigned. If it uses structured vocabularies that are not RDF compliant, we assign a score of 0.5. A score of 1 is assigned if the knowledge representation uses formal vocabularies or ontologies that are RDF compliant.
To test if an approach or the resulting knowledge representation is reusable, we consider the inclusion of a royalty-free license that permits unrestricted reuse, and that consent or terms of agreement documents are available when applicable. We also discuss if included metadata about the resource is detailed enough for a new user to understand. A score of 0 is assigned for this metric if an approach does not include a royalty-free license. If a royalty-free license that permits unrestricted use of some portions of the tool is included, a score of 0.5 is assigned. We assign a score of 1 if the approach includes a royalty-free license that permits unrestricted use of all portions of the tool.

The Semantic Data Dictionary -An Approach for Describing and Annotating Data
We examine if an approach is reproducible in terms of scientific activities introduced within a given methodology, such that experiments can be independently conducted and verified by an outside party. If the approach creates a knowledge representation that cannot be reproduced, a score of 0 is assigned. If the knowledge representation that can be produced by an outside party with the help of the involved party, rather than entirely independently, we assign a score of 0.5. A score of 1 is assigned if the approach for creating a knowledge representation can be independently produced.
Finally, we consider if data and software are transparent, such that there are no "black boxes" used in the process of creating a knowledge representation. Transparency is readily achieved by making sure that software is made openly available. If the associated code for a given approach is not openly accessible, we assign a score of 0. We assign a score of 0.5 if some of the associated code is open, while other portions are not openly available. This generally applies to approaches that are both free and paid versions of software. If all of the associated code for an approach is open source, a score of 1 is given.

Generality Assessment
To evaluate the generality of an approach, we investigate whether or not the method is domain-agnostic, is ontology-agnostic, and adheres to existing best practices. We weigh whether the method incorporates provenance attributions, is machine-understandable, and contains documents to aid the user, such as documentation, tutorials, or demonstrations.
We analyze whether an approach is domain-agnostic, in that its applicability does not restrict usage to a particular domain. A score of 0 is assigned for this metric if the approach only applies to a single field of study. If the approach applies to multiple fields of study but does not work for specific domains, a score of 0.5 is assigned. We assign a score of 1 if the approach can be generalized to any areas of study.
On a similar vein, we judge if the method is ontology-agnostic, where usage is not limited to a particular ontology or set of ontologies. If the approach depends on a particular ontology or set of ontologies, a score of 0 is assigned. If the dependence on particular ontologies is unclear from the examined literature and documentation, we assign a score of 0.5. A score of 1 is assigned for this metric if the approach is independent of any particular ontology.
We examine the literature and documentation associated with a given approach or knowledge representation to see if it leverages best practices. In particular, we consider the applicable best practices related to the HCLS and DWBP guidelines. Among the practices we test for include the ability of the approach to incorporate descriptive metadata, license and provenance information, version indicators, standardized vocabularies, and use locale-neutral data representations. A score of 0 is assigned if the literature associated with an approach does not acknowledge or adhere to existing best practice standards. If existing standards are acknowledged but are not adhered to or are partially adhered to, we assign a score of 0.5. If the literature acknowledges and adheres to existing best practices, a score of 1 is assigned.

The Semantic Data Dictionary -An Approach for Describing and Annotating Data
We consider the inclusion of provenance, involving the capture of existential source information, such as attribution information for how a data point was measured or derived. A score of 0 is assigned for this metric if the approach does not include attributions to source or derivation information. If attribution information that does not use Semantic Web standards is included, we assign a score of 0.5. If the approach covers attributions recorded using a Semantic Web vocabulary, such as the PROV-O ontology, a score of 1 is assigned. In terms of documentation, we further search for the inclusion of assistive documents, tutorials, and demonstrations. We assign a score of 0 for this metric if just one of either documentation, tutorials, or demonstrations is included. If two or all of the above are involved, we assign scores of 0.5 or 1, respectively.
Finally, we consider the machine-readable metric, determining whether the resulting knowledge representation from an approach is discernable by software. In addition to the consideration of the machinereadability of output artifacts such as produced knowledge graphs, we also examine input artifacts, such as the document that contains the set of semantic mappings. If neither input nor output artifacts can be parsed using software, a score of 0 is assigned for this metric. If either input or output artifacts can be parsed, but not both, a score of 0.5 is assigned. We assign a score of 1 if both input and output artifacts are machine-readable.

RESULTS
In Table 6, we provide a high-level comparison between the Semantic Data Dictionary, traditional data dictionaries, mapping languages and semantic approaches that leverage them, and data integration tools. Of the conventional data dictionaries examined in Section 2.1, we use the Project Open Metadata Schema data dictionary for comparison since it was the only reviewed guideline that used a standard linked data vocabulary. Of the mapping languages, we use R2RML for comparison, as it is a standard that is well adopted by the Semantic Web community. Of the data integration tools we surveyed, we use Karma for this evaluation, as it is an example of a data integration approach that was designed with both the FAIR principles and ease of use for the end-user in mind. Rather than only using these approaches in conducting the evaluation, we think of these examples as guidelines and consider traditional data dictionaries, mapping languages, and data integration tools in general when assigning numerical scores.
We have demonstrated the benefits of using a standardized machine-readable representation for recording data set metadata and column information, which is achieved through SDDs, over earlier data dictionary formats. Furthermore, we demonstrate that the SDD approach presents a level of abstraction over methodologies that use mapping languages, allowing improved ease of use for a domain scientist over other semantic tools. In this regard, SDDs tend to provide a bridge between conventional data dictionary approaches used by domain scientists and formal semantic approaches used by Semantic Web researchers, thereby accommodating both user groups. We recognize that the RDF mapping tools that exist are intended to provide a bridge by reducing manual mapping or KG creation work that would otherwise be necessary, but also acknowledge that they may be unusable to domain scientists.

DISCUSSION
In presenting this work, we consider two general types of users. We consider those using SDDs to semantically annotate data as well as those using SDDs in place of traditional data dictionaries in order to understand the data being described. For the first group of users, benefits of using SDDs include that the annotation process is accessible for users outside of the Semantic Web domain and that existing SDDs can be reused to ease the creation of new annotations. Some benefits for the second group include that (i) traditionally humans alone can understand data descriptions in existing data dictionaries but SDDs can be interpreted by machines as well, (ii) SDDs are written using fixed vocabularies which reduce ambiguity, and (iii) the SDD provides a standard specification that can be used to interpret existing annotations.
By including a fixed set of tables for the annotator to fill out that are interpreted and converted using a standard set of rules, the SDD framework provides consistency by creating a formal semantic representation using direct RDF mappings, resulting in an increased likelihood of diverse annotators creating similar representations. This is in contrast with other mapping approaches, where multiple annotators are much less likely to produce similar results when addressing the same data set. The SDD approach reduces such representational biases as it abstracts away structural modeling decisions from the user, both cultivating scalability of production and simultaneously lowering the barrier of entry since not all of the authors have to be computer scientists. Moreover, the vocabulary used in an SDD can be easily updated by replacing terms from any of the tables, where similar updates are much less amenable when using standard mapping methods. An advantage of these features of the SDD is that users can focus on their topic of specialization rather than on the RDF, reducing the need for domain scientists to also become ontology experts. Given a recommended set of ontologies to use, any user should be able to create their own SDD for a given data set.
From the evaluation of Section 7, we find that in the data category, SDDs perform much better than traditional data dictionaries, and equally well as mapping languages and data integration tools. SDDs outperform the three other approaches in the semantics category. In terms of semantics, a notable impact of this work is our approach to object and relation elicitation, where detailed annotations for objects implicitly referenced by the data can be included. SDDs and mapping languages perform equally well in the FAIR category, surpassing the scores of data integration tools and traditional data dictionaries. SDDs, mapping languages, and data integration tools tied for the best performance in the generality category, greatly outperforming traditional data dictionaries. While traditional DDs performed the worst over all four categories, they do outperform mapping languages and data integration tools in the value annotation metric.

CONCLUSION
While the use of SDDs addresses many of the shortcomings associated with the prior art, we do acknowledge several limitations of this approach. In Section 6, we mention several challenges faced by epidemiologists in creating SDDs. We found that the domain scientists had difficulties representing complex ideas, implicit concepts, and time associations. Additionally, determining the best ontology term to use The Semantic Data Dictionary -An Approach for Describing and Annotating Data when creating annotations was not always clear. These challenges relate to the limitation that this approach has some reliance on the annotator containing knowledge about relevant ontologies in the domain of discourse. Several steps to help alleviate these challenges are discussed in Section 6.
Another limitation of this approach is that it currently only supports the annotation of tabular data. Adopting techniques from some of the methods discussed in Section 2.2.2 can help with a future extension to support XML data. Additions to support the annotation of unstructured text data is beyond the scope of this work. Finally, we acknowledge that the annotation process discussed in this article is mostly done manually. This limitation decreases the likelihood of the adoption of this approach by those wishing to streamline the annotation process or incorporate the approach as part of a larger workflow. While automated annotation is not yet supported, existing research on an SDD editor being conducted by members of the Tetherless World Constellation (TWC) involves the incorporation of Natural Language Processing (NLP) techniques to suggest concepts from ontologies based on text descriptions.
Our approach was outperformed in a few of the evaluation metrics, including space annotation, domain knowledge support, and the leveraging of best practices. Space annotation, to some degree, is supported through the use of implicit entries and property customization. Nevertheless, the SDD approach received a 0.5 rather than a 1 for this metric since, unlike Karma, which supports the annotation of geospatial data, and contains tutorials for how to annotate such data and tools developed specifically for geospatial data integration [67,68,69], it does not readily allow for the incorporation of the longitudinal and latitudinal coordinates. While the SDD approach allows the use of domain ontologies during the annotation process, a score of 0.5 was assigned to the domain knowledge support metric since we have not developed tools that suggest to the user the most appropriate domain concept to use. Nevertheless, as mentioned above, ongoing work on an SDD editor will leverage NLP techniques to allow for this capability. Finally, while many of the DWBP and HCLS recommendations are incorporated into our approach, a score of 0.5 was received in terms of leveraging best practices because additional standards for these guidelines have yet to be incorporated. Additionally, further alignment with that standards mentioned in Section 2.3 should be achieved. The relevant best practices associated with our approach have been a subject of much discussion; further incorporation of these recommendations will be included in future revisions. An ideal knowledge model promotes improved discovery, interoperability, reuse, traceability, and reproducibility. The knowledge model resulting from the SDD approach adheres to Semantic Web standards, resulting in improved discovery on the Web, as well as interoperability with systems that also use RDF data serializations. These artifacts are reusable, as SDD tables created for one data set can be reused to annotate another similar data set. Scientific studies involving SDDs are traceable and reproducible by design, as the artifacts designed during the modeling process can be published and shared, helping to ensure consistency for other researchers attempting to examine the studies.
In this work, we advance the state of the art of metadata capture of data sets by improving on existing standards with the formalization of the Semantic Data Dictionary specification, which produces machinereadable knowledge representations by leveraging Semantic Web technologies. This is achieved by

The Semantic Data Dictionary -An Approach for Describing and Annotating Data
formalizing the assignment of a semantic representation of data and annotating data set columns and their values using concepts from best practice ontologies. We provide resources such as documentation, examples, tutorials, and modeling guidelines to aid those who wish to create their own Semantic Data Dictionaries. We claim that this approach and the resulting artifacts are FAIR, help address limitations of traditional data dictionaries, and provide a bridge between representation methods used by domain scientists and semantic mapping approaches. We evaluate this work by defining metrics over several relevant categorizations, and scoring the Semantic Data Dictionary, traditional data dictionaries, mapping languages, and data integration tools for each metric. As we provide a methodology to aid in scientific workflows, this work eases the semantic annotation process for data providers and users alike.

AUTHOR CONTRIBUTIONS
S.M. Rashid (rashis2@rpi.edu), in drafting the paper, introduced the research, motivation, and claims of this article in Section 1, conducted the majority of the literature review presented in Section 2, summarized the methodology associated with the approach in Section 3, formulated the example of Section 4, detailed the case studies presented in Section 5, performed the evaluation of Section 7 and 8, helped with the discussion in Section 9, and summarized the conclusions of the article in Section 10. J.P. McCusker (mccusj2@rpi.edu) contributed to the content of Section 3 and aided in the formulation of the evaluation of Section 7. P. Pinheiro (pinhep@rpi.edu) helped scope the example of Section 4. M.P. Bax (bax@ufmg. br) helped with the conducting of the literature review of Section 2 and aided in the formulation of the evaluation of Section 7. H.O. Santos (oliveh@rpi.edu) helped synthesize the related literature in Section 2 and presented some limitations of our approach in Section 10. J.A. Stingone (js5406@cumc.columbia.edu) conducted the experiment and drafted the content presented in Section 6. A.K. Das (amardas@us.ibm.com) led the proposal of the research problems associated with the HEALS projects mentioned in Section 5. D.L. McGuinness (dlm@cs.rpi.edu) has guided the overall direction of this research. All the authors have made meaningful and valuable contributions in revising and proofreading the resulting manuscript.

Data Intelligence
The Semantic Data Dictionary -An Approach for Describing and Annotating Data AUTHOR BIOGRAPHY Sabbir M. Rashid is a PhD student at Rensselaer Polytechnic Institute (RPI) working with Professor Deborah L. McGuinness on research related to data annotation and harmonization, ontology engineering, knowledge representation, and various forms of reasoning. Prior to attending RPI, Mr. Rashid completed a double major at Worcester Polytechnic Institute, where he received B.S. degrees in both Physics and Electrical & Computer Engineering. Much of his graduate studies at RPI have involved the research discussed in this article. His current work includes the application of deductive and abductive inference techniques over linked health data, such as in the context of chronic diseases like diabetes. ORCID: 0000-0002-4162-8334 James P. McCusker is the Director of Data Operations at the Tetherless World Constellation at Rensselaer Polytechnic Institute. He works with Deborah McGuinness on using knowledge graphs to further scientific research, especially in biomedical domains. He has worked on applying semantics to numerous projects, including drug repurposing using systems biology, cancer genome resequencing, childhood health and environmental exposure, analysis of sea ice conditions and materials science. He is the architect of the open source Whyis knowledge graph development and management framework, which has been used across many of these domains. ORCID: 0000-0003-1085-6059 Paulo Pinheiro is a data scientist and software engineer managing projects at the frontier between artificial intelligence and databases. His areas of expertise include the following: data policies and information assurance, such as security and privacy; data operation including curation, quality monitoring, semantic integration, provenance management and uncertainty assessment; data visualization; and data analytics including automated reasoning. Paulo holds a PhD in Computer Science from the University of Manchester, UK. McGuinness has been recognized with awards as a fellow of the American Association for the Advancement of Science (AAAS) for contributions to the Semantic Web, knowledge representation, and reasoning environments and as the recipient of the Robert Engelmore award from Association for the Advancement of Artificial Intelligence (AAAI) for leadership in Semantic Web research and in bridging Artificial Intelligence (AI) and eScience, significant contributions to deployed AI applications, and extensive service to the AI community. Deborah is a leading authority on the Semantic Web and has been working in knowledge representation and reasoning environments for over 30 years and leads the research group that designed and implemented the research presented in this paper. ORCID: 0000-0001-7037-4567