An Analysis of Crosswalks from Research Data Schemas to Schema.org

ABSTRACT The increased number of data repositories has greatly increased the availability of open data. To enable broad discovery and access to research dataset, some data repositories have begun leveraging the web architecture by embedding structured metadata markup in dataset web landing pages using vocabularies from Schema.org and extensions. This paper aims to examine metadata interoperability for supporting global data discovery. Specifically, the paper reports a survey on which metadata schema has been adopted by participating data repositories, and presents an analysis of crosswalks from fourteen research data schemas to Schema.org. The analysis indicates most descriptive metadata are interoperable among the schemas, the most inconsistent mapping is the rights metadata, and a large gap exists in the structural metadata and controlled vocabularies to specify various property values. The analysis and collated crosswalks can serve as a reference for data repositories when they develop crosswalks from their own schemas to Schema.org, and provide the research data community a benchmark of structured metadata implementation.


INTRODUCTION
In recent years, it has become more and more common to share research data together with its corresponding description through metadata, thanks to initiatives such as Open Science and the FAIR (Findable, Accessible, Interoperable and Reusable) data principles [26]. To make data publicly accessible, researchers and data collectors deposit their datasets into a data repository and provide metadata that conforms to the repository's metadata schema ; data repositories or metadata aggregators provide data discovery capabilities to make dataset discoverable through indexed metadata. With the increase of datasets managed in data repositories, some challenges arise including exchanging metadata, discovering relevant datasets, and supporting (semi)automatic metadata processing [29].
Data repositories typically host metadata, embed metadata in a web page and publish the web page on the Web to make the dataset discoverable; such a web page, as shown in Figure 1a, is referenced as a metadata landing page. Like any other web pages, a web landing page is encoded with HTML tags, optimised for human readability. Before the recent explosion in commercial web index and search technology, repositories also offered access to structured, machine-readable metadata for their holdings using various metadata content and serialization schemes such as Dublin Core XML, Ecology Markup Language (EML), the U.S. Content Standard for Digital Geospatial Metadata (CSDGM), ISO 19115/19139, and so on. This metadata was accessed through a standard API like Open Archives Initiative Protocol for Metadata Harvest (OAI-PMH) or the Open Geospatial Consortium Catalogue Service for the Web (OGC-CSW).
Around 2004, developers started introducing semantic markup in HTML documents to add information about the web page subject and content to improve the display of search results, making it easier for people to find the right web pages. In 2011, a consortium of search engines including Bing, Google, Yahoo! and Yandex began developing a vocabulary of entities and properties that could be used in this semantic mark-up to make it interoperable across browser systems [11]. The Schema.org vocabulary is the outcome of this effort, with version 1 released in 2013. This initial release included an Entity for describing datasets (https://www.w3.org/wiki/WebSchemas/Datasets), which was significantly revised in 2016 (https://github. com/schemaorg/schemaorg/pull/1247).
This approach of publishing machine-readable metadata, i.e., structured metadata as shown in Figure 1b, brings new opportunities for making research data FAIRer. For instance, the use of these common vocabularies makes it easier for commercial web search engines like Google dataset search , or any metadata aggregators, to crawl and index metadata across data repositories globally in a more useful, consistent and robust way. The interoperability of metadata sharing the same schema allows metadata from different sources to be harvested and indexed without any intermediate mapping between schemas. Furthermore, it makes it easier to create federated queries across resources from different sources relevant to a research need. Metadata aggregators are exploring new methods for metadata syndication via the web architecture. The NSF

An Analysis of Crosswalks from Research Data Schemas to Schema.org
EarthCube GeoCODES platform is indexing schema.org metadata in landing pages from 12 US NSF data facilities. DataCite has already offered to crawl metadata through its embedded web page [10], DataOne and ARDC's catalogue service Research Data Australia are planning to offer a similar service.
However, these opportunities also come with new challenges. Schema.org provides a domain agnostic vocabulary to describe common data entities. By design, Schema.org expects and has enabled domains of practice to extend this core vocabulary [11]. Similar to other domains of practice, research data communities have their own needs for extending Schema.org core to describe research data and its relationships to other resources. These extensions include, for instance, specific data types and their corresponding properties pertaining to a particular domain as well as support for persistent identifiers to meet needs for a specific community: for example, bioschemas.org [12] for life sciences, science-on-schema.org for earth and environmental sciences [14] and CodeMeta for research software. To investigate interoperability and usability of Schema.org for describing research data, we collected 14 crosswalks from research data schemas to Schema.org [28], this crosswalk is a crucial step for repositories to publish structured metadata [27]. A schema crosswalk is commonly expressed as a table showing An Analysis of Crosswalks from Research Data Schemas to Schema.org equivalent terms across one or more data schemas. To source research data schemas, we used a survey asking participating data repositories to share any crosswalk they had, as well as gaps and challenges that they identified while creating the crosswalk. For schema providers, we used openly published crosswalks available on the Internet; in particular, we found crosswalks corresponding to DCAT, Dublin Core and ISO19115 to Schema.org . This collection of crosswalks helps us to identify and bridge gaps in research data communities when they mapped their metadata schemas to Schema.org. This paper covers a report on the survey and an analysis of the crosswalks. The sections below are organised as follows: we review the type of metadata schemas for research data in Section 2, present the analysis of a survey and crosswalks in Section 3 and conclude the paper with a discussion of findings in Section 4.

General and Discipline-specific Metadata
There are many metadata standards for documenting research datasets; Wallis et al. [25] analysed 9 metadata schemas for describing scientific data and synthesised 22 metadata-related goals. In general, a metadata schema should address the seven requirements for metadata schemas of all resources-abstraction, extensibility, flexibility, modularity, comprehensiveness, sufficiency, and simplicity; and four requirements for any schema to support data interchange, retrieval, achieving and publication.
The metadata directory implemented by the RDA Metadata Standard Directory Working Group includes about 65 standards , ranging from general to extremely discipline specific [1]. General metadata schemas, for example, Data Catalogue Vocabulary (DCAT) and Dublin Core include data properties that are common to almost all types of dataset. This general metadata can be widely adopted and easily used by metadata providers, and supports broad data discovery use cases from data seekers, regardless of their research areas.
Discipline specific metadata, for example, the Data Documentation Initiative (DDI for Social and Behavioral Science data) and the Space Physics Archive Search and Extract (SPASE for heliophysics data), usually include properties from general metadata standards, and provide additional properties and richer vocabularies to allow detailed and more granular contextual information. This enriched information increases data discovery efficiency and effectiveness for those with domain knowledge, and assists the assessment of data reusability.
It is common practice for data repositories to publish metadata for their holdings, allowing it to be harvested by aggregating metadata catalogs that offer indexing and user interfaces to support data search. Such aggregation typically involves a mapping or crosswalk between metadata schemes or profiles used by the various contributing repositories if there isn't a schema agreed by all repositories for exchanging ISO19115-DCAT-Schema.org mapping: https://www.w3.org/2015/spatial/wiki/ISO_19115_-_DCAT_-_Schema.org_mapping http://rd-alliance.github.io/metadata-directory/standards/ An Analysis of Crosswalks from Research Data Schemas to Schema.org metadata. This landscape may change due to major search engines starting to harvest structured metadata using the standardized, schema.org vocabulary embedded in metadata landing pages that can be parsed and interpreted by machine, to provide more accurate results and richer presentation of results [4].

Schema.org Vocabulary and Structured Metadata
Schema.org is among the most visible metadata vocabularies on the open Web, according to NISO [19]. The driving factor in the design of Schema.org was to make it easy for webmasters to publish information with a single schema for a wide range of topics that included people, places, events, products and so on [11]. Schema.org is a general schema or a set of vocabularies, the current version (V13.0, 2021-07-07) consists of about 792 types (as RDF classes) and 1447 properties. The W3C Schema.org Community Group, that is governed by a steering group , is the main forum for the schema collaboration and the development new types and properties can be added if there is community need and supporting use case, for example, the new type 'LearningResource' was added as a subtype of 'CreativeWork' in 2020 July release (9.0)  11 . As another example, Bioschemas  12 , focusing on life science, have successfully incorporated many biomedical terms into the schema.org vocabulary. The CodeMeta project  13 has developed the CodeMeta vocabulary for the description of software; 58 out of 68 Codemeta properties are from existing Schema.org vocabulary, 10 proposed new properties are based on the analysis of crosswalk from 23 software metadata, vocabulary and ontology to Schema.org. There are also a steering group and communities who support developing conventions for usage of the data model and guidelines for consistently implementing the data model. For example, the Schema.org Cluster of the Earth Science Information Partners (ESIP) working to develop best practices and to provide education and outreach to the Earth science community for web accessible structured data  14 [14], The Ocean InfoHub Project  15 provides an architecture solution for providing a Schema. org based interoperability layer and supporting technology to allow existing and emerging ocean data and information systems to interoperate with one another.
In order to make data widely discoverable, many research data repositories have started to implement structured metadata markup in their metadata landing page. As of March 26, 2020, Google dataset search has indexed 31M datasets from 4,600 domains, where the top 10 domains include data.gov, figshare.com, datacite.org. Geosciences and social sciences together accounted for 45% of the datasets, followed by biology (~15%) and other research topics [21]. Search results include those from NASA, NOAA, and many research repositories such as Harvard's Dataverse repository [20]. This approach allows for broader dissemination of metadata throughout the community to promote discoverability of datasets.

Metadata Interoperability
The 'I' in 'FAIR' represents "interoperable" and is one of the four FAIR data principles [26], which apply to both data and metadata. According to this principle, metadata should use community agreed standards and vocabularies, and contain links to related information using persistent identifiers. Because there exist a number of community agreed metadata schemas for meeting specific community needs, mapping between schemas is necessary to make it possible for repositories to exchange and share metadata records [24].
There are different types of metadata interoperability, for example, Nilsson et al. [18] proposed four interoperability levels for Dublin Core Metadata. For a data repository to implementing interoperable metadata, we adopt the three levels of metadata interoperability proposed by Chan and Zeng [6]: • Schema level-efforts are focused on the elements of the schemas, common results may include crosswalks, application profiles, derived element sets, et al.; • Record level-efforts are intended to integrate the metadata records through the crosswalk of elements, common results include converted records, new records resulting from combined values of existing records; and • Repository level-efforts are focused mapping values associated with particular elements, the results enable cross-collection searching.
We focus our analysis of crosswalks at the schema level: the elements of the schemas, being independent of any applications. In particular, we will apply crosswalk to analyse the interoperability among studied schemas. A crosswalk (or a mapping) is a chart or table (visual or virtual) that represents the semantic or technical mappings of data elements from one schema (source schema) to data elements in another schema (target schema) that has a similar function or meaning. The crosswalks guide record level interoperability, which enables repository level interoperability so that heterogeneous repositories can be searched simultaneously with a single query as if there were a single repository [2].

ANA LYSIS OF MAPPINGS FROM RESEARCH DATA SCHEMAS TO SCHEMA.ORG
As discussed above a crosswalk attempts to map equivalent or comparable metadata elements from two schemas. We acknowledge that a crosswalk developed by a specific repository or a schema development community would better reflect a proper and realistic mapping, as those repositories and communities can provide a better interpretation of their implemented metadata terms. For this reason, we launched the survey "Current practices in using schemas to describe research datasets"  16 on 27th Feb. 2019 to gather information on how Schema.org is applied by data repositories to describe research data and related resources. We envisaged the gathered information would help repositories and the proposed Research Metadata Schema WG understand current practices, identify commonalities, gaps and barriers in using schemas for describing and discovering research datasets.

An Analysis of Crosswalks from Research Data Schemas to Schema.org
In Section 3.1, we highlight relevant parts of the survey and indicate which schemas are adopted or implemented by participating respondents, followed by our analysis of crosswalks from the available mappings to Schema.org.

Su rvey on Repository's Metadata Schema and the Implementation of Schema.org
Twenty-two organisations/data repository representatives participated in the survey. One respondent failed to answer the survey questions, so that submission has been excluded from this summary. As shown in Table 1, six of 21 responses are from the general repositories covering all domains: four of them are either based on or direct adoption of the DataCite schema; one is an application profile of DCAT-DCAT-AP, while the other follows the Registry Interchange Format-Collections and Services (RIF-CS) schema, which is a profile of ISO 2146, originally developed for library registry services now used as a data interchange format.
Among the 13 disciplinary repositories or projects, five are from the domain of Geoscience and Arctic Research and have adopted the ISO19115 schema or ISO19115 compatible schema (EML). ISO19115 is an internationally adopted schema for describing geographic information and services. ISO19115 provides information about the identification, the extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographic data  17 . One Social and Behaviour Science repository adopted the international standard 'Data Documentation Initiative' (DDI), for describing the data produced by surveys and other observational methods in the social, behavioural, economic, and health sciences  18 . The remaining nine disciplinary repositories and the two "other" repositories adopted community developed profiles or schemas. Most of them are compatible or interoperable with international standards, for example, the cultural heritage datasets in the 'Other' category defines a metadata profile based on Schema.org, DCAT and VoID  19 , while the European Clinical Research Infrastructure Network (ECRIN) schema is an extension of DataCite [5], and GigaDB from the Life Sciences and Biomedical domain can export metadata in general purpose metadata such as DataCite and Schema.org.
We observe the following two trends from the survey responses:  2) Repositories, regardless of discipline, general, or specific, tend to use a general-purpose schema but also support domain specific standards or vocabularies. For example, the Dataverse project  22 supports general citation metadata compatible with the DataCite metadata schema [8] and DCMI metadata terms but also a suite of domain specific metadata for Geoscience, Social Science and Humanities. RIF-CS supports subject vocabularies from a range of disciplines for satisfying a range of data discovery needs. This observation also applies to discipline specific repositories, for example, the DAta Tag Suite (DATS), a data description model adopted by DataMed  23 , has both core elements and additional elements: the core elements are generic and applicable to any type of dataset, while the additional elements are specific for life, environmental and biomedical science domains [22].
The observed trend is that general repositories adopt general purpose standards that support data discovery use cases at a high level for data searches across domains providing. Domain repositories adopt schemes that are compatible with general metadata profiles for metadata interoperability, but add elements to support a range of more granular disciplinary queries for more precise data discovery within a domain.

.2 Analysis of the Mappings
We collected the 14 crosswalks from the following schemas to Schema.org through the survey and other publicly available crosswalks: Since the survey results were collected, some crosswalks may have been updated (e.g DCAT to Schema. org alignment) and some schemas (including Schema.org) may have been revised with additional properties. In October 2021, the first author cross checked all crosswalks, as well as referencing publicly available crosswalks. These included, for example, ISO-19115 (from this W3C group  24 and Habermann [13]), DCAT alignment with Schema.org  25 , DataCite Schema to Dublin Core mapping  26 , the CodeMeta crosswalks  27 . During the writing of this paper, the second author also added a mapping from ISO19115-1 to Schema. org. For the purposes of this analysis we used this subsequent mapping as it covers more elements than the original ISO-19115-1:2014 to schema.org mapping we collected from this website  28 . This resulted in 385 properties from the 14 crosswalks being mapped to the 40 Schema.org properties.

Categories of mapped terms
We classified the 40 mapped Schema.org properties/terms into 6 categories from the NISO (2004) metadata classification model. As shown in Table 2, we use three top level categories: descriptive metadata, administrative metadata and structural metadata; administrative metadata is further classified into technical metadata, rights metadata and preservation metadata. We summarise the analysis of mapped terms as follow: • Descriptive metadata: Most of the mapped terms (17 out of 40) fall into the descriptive metadata category. The mapped descriptive terms cover six of seven recommended citation metadata from the DataCite guide  29 :

Creator (PublicationYear): Title. Version. Publisher. (resourceTypeGeneral). Identifier
The citation term "resourceTypeGeneral" (recommended) is the only term not explicitly included in the mapping, and we infer it to be of the type: dataset, since we asked for and collected all mappings from schemas for descripting data. All 14 source schemas include the 6 mapped citation metadata, except for the term "version" and "publisher" that occurred in the 13 out 14 source schemas.
• Administrative metadata: Technical metadata: 'encodingFormat' and 'contentSize' are the two mapped technical metadata terms by the majority of the source schemas. The mappings are consistent: the term 'format' is used by 9 out of 13 source schemas, the exact term 'encodingFormat' by one source schema, and the alternate terms 'resource file type', 'mediaType', 'distributionFormat' each by a source schema. Rights metadata: There are three mapped terms in rights metadata. The property "license" has a mapping from 12 source schemas, however, five of them have the original term "rights". The term "rights" is the only one from the 15 Dublin Core terms (http://purl.org/dc/elements/1.1) that doesn't have an exact mapping in Schema.org. In Dublin Core, "rights" is defined as "information about rights held in and over the resource", "license" is subproperty of "rights" and has the definition "a legal document giving official permission to do something with the resource". According to this definition, the closest semantically matched term in Schema.org is "copyrightHolder" (https:// schema.org/copyrightHolder): The party holding the legal copyright to the CreativeWork. Preservation Metadata: There are 11 mapped preservation metadata terms: five of them are dates about data creation, modification, availability and copyright; another five about data access method or location; and one about data (observation/process/reprocess) frequency. The mappings of the dates and the access methods are consistent, except that the term 'expectedArriveFrom/ expectedArriveUntil' is mapped from four different terms: 'distribution date', 'released date', 'available', and 'embargo'.

An Analysis of Crosswalks from Research Data Schemas to Schema.org
We also examine how many mapped terms are recommended by the Google dataset search guide  30 . The Google guide recommends 23 properties (in italics in Table 2) to be included in structured data. Three of them are required terms ("name", "description", "distribution.contentURL"), while the other 20 are recommended. The 23 terms are distributed among all six NISO metadata categories, and are mapped by more than half of source schemas, especially those falling into the descriptive metadata category. Note that this analysis is on the schema level, and does not take into account whether a repository has implemented a property value at the metadata record level. Benjelloun et al. [3] from Google Research analysed the percentage of datasets in their index with specific properties, showing that the property "name" and "description" both have 100% coverage, followed by "provider" (84.59%), "keywords" (80%) and "URL" (68.08%), while all other recommended properties had less than 50% coverage (e.g. "authors"-14.12%, "isAccessibleForFree"-3.04%). This indicates that even if there is a property mappable at the schema level, a repository may decide not to implement that mapping or to populate that property with a value. The reason, most likely, is that the repository does not have sufficient records requiring that property to warrant its implementation. Table 2. Classifi cation of the mapped Schema.org properties or terms.

NISO Metadata Type
Schema.org properties (The numbers in brackets indicate the number of crosswalks that have a term mapped to the schema.org property. Properties in italics are those recommended by the Google dataset search guide  31 .)

Gap analysis
From the survey, there are structural metadata elements that are recommended by source schemas that do not have mappings to Schema.org. These include elements that clearly describe: • Relationships between datasets, for example: hasVersion, isNewVersionOf, isContinuedBy, isOriginalFromOf, isDerivedFrom (from DataCite); • Relationships between a dataset and responsible agent, for example: hasFunder, isFundedby, isCompiledBy, isOwnedBy, hasPrincipleInvestigator; • Relationships between a dataset and the activity by which is was collected, for example: dataset -> Cruise, dataset -> study design; and • Relationships between a dataset and instrument/software/other services used to produce the data, for example, isProducedBy/produces, isPresentedBy/presents, isOperatedOnBy/operatedOn, isAnnotatedBy/ annotate (from RIF-CS).
These structural, relation metadata properties are more granular than the PROV-O Ontology [16]. These gaps reflect both the difference between documentation needed to describe scientific datasets for research and that for more commercial data published on the Web (e.g. movies, businesses, product catalogs, etc.), and the difference between general data schemas and discipline specific schemas.
From information gathered from the survey and through inspection of the source schemas, we observe that: • Controlled vocabularies, thesauri or code lists are used to specify property values for various elements in the source schema. Schema.org doesn't offer any vocabularies for property values, but the serieralization of Schema.org allows it to incorporate external vocabularies. For example, when populating the property schema:keyword or schema:about, one can specify a text string (either from a vocabulary or not) that can facilitate discovery but not interoperability, while an optimal way is to specify a URI reference to a term from a controlled vocabulary. There is a proposal to add a DefinedTerm element (https://schema.org/DefinedTerm) that could be substituted for plain text values to provide a URI along with the term, but this has not, as yet, been formally adopted into schema.org. • A controlled vocabulary is a set of pre-defined, authorised terms that are used to specify a property value so that consistency can be achieved within and across repositories. A controlled vocabulary can be standard and controlled by an authoritative organisation (for example, Library of Congress Subject Headings, Australia and New Zealand Standard Research Classification-ANZSRC), a locally defined subset of a standard vocabulary, or a locally defined vocabulary [23]. Ideally, terms in the vocabulary have dereferenceable URIs for unambiguous identification. This case requires a controlled vocabulary to be openly accessible, referenceable and identifiable with a unique and persistent identifier to the vocabulary, for each term in the vocabulary [7]. Research Vocabularies Australia  32 is an example of such a service for finding, accessing and reusing vocabularies.
• There are semantically equivalent properties which are named differently among the schemas. For example, schema:variableMeasured, dats:dimensions, spase:parameter and sosa:observedProperty (DCATv3) all have the same meaning, related to observed or measured data variables. schema.name, and schema:title likewise have equivalent meaning in other schemas. Thus, when developing a crosswalk it is necessary to check how each property is defined in each schema, and how it is actually used in the implemented examples. For example, schema.isBasedOn (a resource from which this work is derived or from which it is a modification or adaptation) can be mapped from datacite:IsOriginalFrom, datacite:isSourceOf, datacite:isDerivedFrom, datacite:isVersionOf.
• It is also inevitable that many terms from one schema are mapped to one term in Schema.org, due to Schema.org being a general schema and the simplicity is one of its design rules. For example: the granular relations from RIF-CS:(collection/relatedInfo/isVersionOf, collection/relatedInfo/isEnrichedBy, collection/relatedInfo/isDerivedFrom, collection/relatedInfo/hasValueAddedBy) and datacite:(isOriginal FromOf, isSourceOf, isDerivedFrom) can all be mapped to schema:isBasedOn (A resource from which this work is derived or from which it is a modification or adaptation).
• Rich granular information may be lost where 'many to one' mapping occurs. Whether this loss of information is significant depends on the purpose of a mapping and how this granular information is utilised by a data discovery system. For example, if a use case is to make a dataset widely findable from the web, then adding more descriptive metadata is more important than having a detailed relation; if a use case is to track the history or provenance of a dataset, then this granular relation information is important to have. These two use cases can complement each other: a general repository can have descriptive metadata for discovery and include links so that when a user finds ta potentially relevant dataset, they can follow a link to metadata with more granular contextual information to assess the fitness of the dataset for intended purpose.

Visualization Tool f or Facilitating Mapping
To make the crosswalks more useful for analysis, and for those who are going to do a crosswalk for their own schema, the World Data System-International Technology Office has developed a tool to visualise the above 14 crosswalks (and one from CodeMeta vocabulary to Schema.org)  33 . The tool provides a userfriendly display of the collected crosswalks. By utilising the visualisation tool, crosswalk developers across domains can reference existing mappings, repeating the same types of matches between the Schema.org terms and similar elements found in different metadata schemas, regardless of whether the metadata format is standard or bespoke. The visualization tool is intended as a prototype service for the research data management community, in support of metadata managers who are investigating options for including schema.org markup into existing well formed metadata. The visualizations include various tables, a Sankey diagram, and a Gap Analysis, to support different views for crosswalk inspection. For example: Figure 2 can help to check, given a property from Schema.org, what is its corresponding element in other schemas; and Figure 3 shows these mappings in a 'Filter Table', where a parent type is also shown for properties from Schema.org.
 33 Visualisation of crosswalks: https://rd-alliance.github.io/Research-Metadata-Schemas-WG Figure 2. This fi lter sankey diagram allows a user to choose a schema.org property and see which crosswalked term is connected to which metadata standard. From left to right the labels go schema.org properties, crosswalked terms, then metadata standards. Figure 3. This table is a free text search over both metadata terms and schema.org properties. Wildcard searches are not supported but partial searches are. For example, a search for "publish*" will not return any records, but the search for "publish" will return "datePublished", "publisher", and "Dataset Publisher."

DISCUSSION AND CONCLUSION
In summary, through the analysis of the 14 crosswalks, we find most descriptive metadata are mostly interoperable among the schemas and can be mapped to corresponding Schema.org properties. The most inconsistent mapping is the 'Rights' metadata, which requires clearer and consistent definition among the schemas of the terms Rights, License, Copyright Holders, and Data Use Agreement or Conditions, to name a few. The largest gap exists in the Structural metadata elements: first, there is a lack of consistency among the source metadata schemas themselves; and second, there are no rich relation terms in Schema.org. As Structural metadata is important in the linked-data world, the data community needs to agree what Structural metadata from disciplinary schemas could be generalised and applied to all types of data. There also exists a gap in controlled vocabularies to specify various property values, for example, observational variables [17] and a subject classification vocabulary (e.g. Library of Congress Subject Headings) for populating Keyword or Subject elements to describe a dataset.
The gaps are due to the Schema.org design principle that starts simple and increases complexity when community need arises [11]. This challenge is complicated by the fact that relatively simple, domain independent vocabularies satisfy the most common web data search needs, but the research community tends to use more granular and rigorous schema and controlled vocabularies in describing and cataloging research dataset. Lagoze [15] argued that attempting to intermix a single descriptive vocabulary for coarse granularity queries with the complex semantics needed to enable 'drill-down' into more granular queries, leads to metadata sets that are not ideally suited for either purpose; Lagoze advocated for establishing frameworks for the creation of more complex descriptions that can coexist with similar ones as separate packages.
Like any other schemas or vocabularies, Schema.org is evolving. To address the above gaps, the terms schema:DefinedTerm and schema:inDefinedTermSet were introduced as pending changes in Schema.org V12.0, and schema:hasDefinedTerm in Version 13.0  34 to enable the markup of external property names and pre-defined property values from discipline specific vocabularies. This approach balances the simplicity for a general schema and complexity of disciplinary schemas by following some principles that guide the development of metadata schema, especially the modularity principle and the extensibility principle [9]. The recent trend, as observed in the survey and from the development of application profiles by domains (e.g. DCAT-Application Profile and Bioschema profiles,) also follows Duval's metadata development principles.
In summary, we present the analysis of crosswalks to Schema.org from a cross section of domain implemented metadata schema. The analysis is limited by the survey and the conceptual mapping that focuses on the meaning of the elements or properties when mapping between two schemas. This analysis could be enhanced to include the analysis of implemented marked up metadata across repositories to get a more comprehensive picture of the interoperability of published structured metadata on the Web.  34 https://schema.org/docs/releases.html