The data set knowledge graph: Creating a linked open data source for data sets

Abstract Several scholarly knowledge graphs have been proposed to model and analyze the academic landscape. However, although the number of data sets has increased remarkably in recent years, these knowledge graphs do not primarily focus on data sets but rather on associated entities such as publications. Moreover, publicly available data set knowledge graphs do not systematically contain links to the publications in which the data sets are mentioned. In this paper, we present an approach for constructing an RDF knowledge graph that fulfills these mentioned criteria. Our data set knowledge graph, DSKG, is publicly available at http://dskg.org and contains metadata of data sets for all scientific disciplines. To ensure high data quality of the DSKG, we first identify suitable raw data set collections for creating the DSKG. We then establish links between the data sets and publications modeled in the Microsoft Academic Knowledge Graph that mention these data sets. As the author names of data sets can be ambiguous, we develop and evaluate a method for author name disambiguation and enrich the knowledge graph with links to ORCID. Overall, our knowledge graph contains more than 2,000 data sets with associated properties, as well as 814,000 links to 635,000 scientific publications. It can be used for a variety of scenarios, facilitating advanced data set search systems and new ways of measuring and awarding the provisioning of data sets.


INTRODUCTION
The number of data sets available on the web has increased steadily.Google Dataset Search (Brickley, Burgess, & Noy, 2019), for instance, covered more than 6 million data sets in September 2018 but over 28 million data sets by March 2020 (Benjelloun, Chen, & Noy, 2020).In addition, data portals and registration services, such as OpenAIRE with Zenodo (https://zenodo.org)as well as re3data (http://re3data.org/),have seen a sharp increase in the number of indexed data sets.Furthermore, scientific communities increasingly demand researchers to publish their research data according to the FAIR principles (Wilkinson et al., 2016) fostering the provision and reuse of data sets and their metadata.Having access to and using high-quality, rich, and interoperable metadata of data sets is therefore essential in many scenarios and will continue to gain in importance.
At present, metadata about data sets are collected in diverse ways: (1) Web crawlers exist that search the web for data sets (Brickley et al., 2019).( 2) There are open data portals with collections or catalogs that index metadata and refer to the data set files.(3) There are freely accessible databases that were created and expanded jointly by users (Neumaier, Polleres, Steyskal, & Umbrich, 2017).Aside from the data sources, the metadata about data sets are modeled by means of various standards (e.g.Schema.org and DCAT (Brickley et al., 2019), the DataCite metadata schema (Manghi, Bardi, et al., 2019), and the CKAN and Socrata metadata schemas (Neumaier, Umbrich, & Polleres, 2017)) The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets and with varying degrees of quality (see Section 3).However, using Semantic Web technologies, such as the Resource Description Framework (RDF) (W3C, 2014) allowing to create knowledge graphs based on a standardized data model and format, has turned out to be particularly helpful for modeling metadata and linking it to existing data sources on the web (Latif, Limani, & Tochtermann, 2021;Neumaier, Umbrich, & Polleres, 2017;Vahdati, Karim, Huang, & Lange, 2015).Specifically, in the academic field, several large knowledge graphs have been proposed and are freely available.For instance, the Microsoft Academic Knowledge Graph (MAKG) (Färber, 2019) contains 8 billion triples about publications and associated entities, such as authors, venues, and affiliations.Wikidata (https://wikidata.org),OpenCitations (Peroni & Shotton, 2020), and the Open Research Knowledge Graph (ORKG) (Jaradeh et al., 2019) are further noteworthy knowledge graphs.
Although the existing scholarly knowledge graphs model data sets to some degree, they do not primarily focus on data sets but rather associated entities such as publications (Färber, 2019) (see Section 2).Moreover, they are often not publicly available (Brickley et al., 2019) and do not contain links to the publications in which the data sets are mentioned.We argue that a knowledge graph that fulfills the following criteria is highly beneficial for scholarly data mining: 1.The knowledge graph is publicly available and integrated into the Linked Open Data cloud.
This means that the knowledge graph is based on RDF (W3C, 2014) as a widely used data model, facilitating data interoperability and data integration efforts, and that it is interlinked to other data sources, following the FAIR data principles (Wilkinson et al., 2016).2. The knowledge graph is of high quality with respect to the accuracy and coverage (i.e., high number of provided properties).3.All data sets modeled in the knowledge graph are linked to the scientific publications in which they are mentioned.Linking data sets to publications enables novel ways of knowledge discovery and scientific impact quantification (Baglioni, Manghi, & Mannocci, 2020).Specifically, the MAKG (Färber, 2019) modeling rich metadata about millions of publications from all scientific fields can be used as a link target.
In this paper, we present an approach for constructing an RDF knowledge graph that fulfills these mentioned criteria.Our data set knowledge graph, DSKG, is publicly available at http:// dskg.org and http://doi.org/10.5281/zenodo.4478921,and contains metadata of data sets for the various scientific disciplines.To ensure a high data quality (e.g., high accuracy of statements and high coverage of used properties) of the final knowledge graph, we first analyzed existing data set metadata collections and identified suitable data set collections that are particularly suitable for building a data set knowledge graph.Furthermore, we only considered data sets with their metadata that are mentioned in scientific publications.To this end, we parsed all 146 million publications' abstracts and all 241.5 million citation contexts available in the MAKG (Färber, 2019).Data sets mentioned in the abstracts or citation contexts of these publications are an essential aspect of these papers and we therefore link the data sets to the publications.As the author names of data sets can be ambiguous and our knowledge graph requires unique identifiers (URIs) for each entity, we developed and evaluated a method for author name disambiguation -the first one considering data set authors, to the best of our knowledge.To ensure that the knowledge graph is well integrated into the Linked Open Data cloud, we enrich the knowledge graph with links to ORCID, Wikidata, and the MAKG.Last but not least, we provide data set entity embeddings for machine learning tasks.The embeddings were created by applying RDF2Vec (Ristoski, Rosati, Di Noia, De Leone, & Paulheim, 2019) to our knowledge graph.Overall, our knowledge graph contains 2,208 data sets and 813,551 links to scientific publications.We will update the knowledge graph quarterly via a semi-automatic process.

Quantitative Science Studies
The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets The DSKG can be used for a variety of scenarios regarding data consumption and data analysis: (1) The DSKG can be used as a database and evaluation basis for new applications, particularly in the context of data set search.For instance, based on our preliminary online system http://datasetsearch.net we show how data sets can be retrieved based on scientific problem descriptions.To this end, we utilized the interlinkage between data sets and publications.(2) The DSKG allows for easier data integration through the use of standard RDF vocabulary and by linking resources to other data sources.(3) The DSKG facilitates new ways of scholarly data analysis, such as determining the scientific influence and impact of data sets (Färber, Albers, & Schüber, 2021) ("h-index of data sets"), authors (Yi, Ludo, & Yong, 2021), and affiliations (Lin, Zhu, Lu, Shi, & Niu, 2021).
Overall, our main contributions can be summarized as follows: We analyze the metadata about data sets from several sources with respect to data quality aspects.We link data sets to scientific publications in which they are mentioned.This results in 813,551 links to 634,803 publications in the MAKG.We implement and evaluate a method for author name disambiguation based on our data set knowledge graph.We link our data set knowledge graph to other Linked Open Data sources (namely, ORCID, Wikidata, and MAKG).We provide our data set knowledge graph with a SPARQL endpoint, resolvable URIs, and entity embeddings at http://dskg.org to the public and also share it at http://doi.org/ 10.5281/zenodo.4478921.Our source code for generating the data set knowledge graph is available at https://github.com/michaelfaerber/data-set-knowledge-graph.
Our paper is structured as follows: We first examine related work and delimit it from our work (see Section 2).In Section 3, we analyze existing data set metadata collections, based on which we select the sources for our data set knowledge graph.Section 4 presents the approach for creating the knowledge graph for data sets.We provide statistical key figures of the knowledge graph and an evaluation of our performed author name disambiguation in Section 5.In Section 6, we show possible application scenarios of the knowledge graph.Finally, in Section 7, we summarize our work.

RELATED WORK
In recent years a field of research has arisen around the search for data sets (Chapman et al., 2020).Knowledge graphs play a central role here, as they facilitate semantic search and recommender systems.In the following, we first outline schemas for modeling data sets' metadata.We then describe existing approaches for data set knowledge graph creation and conclude with an overview of scholarly knowledge graphs in general.

Standards for Describing Data Set Metadata
In order to fully exploit the potential of knowledge graphs, interoperability between the knowledge graphs needs to be established (Manola et al., 2019).To achieve this, already widespread RDF standard vocabularies for data set metadata and mappings between them exist.For example, there is a recommended mapping between the vocabularies DCAT and Schema.org(W3C, 2020) The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets The most important vocabularies in this paper are: VoID.The VoID is an RDF vocabulary for the representation of metadata concerning linked RDF data sets.It is therefore not entirely suitable for modeling metadata from open data portals, as these usually provide resources in different formats (Assaf, Troncy, & Senart, 2015;Neumaier, Umbrich, & Polleres, 2017).
DCAT.The DCAT is an RDF vocabulary for the representation of metadata of data sets and data services.The DCAT -Version 2 was published on February 4, 2020 as a W3C recommendation (W3C, 2020).The aim of the W3C is to use the DCAT to solve the problem of heterogeneous metadata schemes in the data portals (Neumaier, Umbrich, & Polleres, 2017).The use of DCAT facilitates the interoperability of data set metadata from different data portals.This should make it easier for applications such as search engines to use metadata from different sources.The data set metadata can be published decentrally on the web and still be used for a common search (Assaf et al., 2015).Schema.org.Schema.org is a collection of schemas for providing structured data on the web.Schema.org'svocabulary can be used with many different encodings, including RDFa, Microdata, and JSON-LD.This structured data enables many applications, such as search engines, to understand the information contained in the web pages.This improves the display of search results in web browsers and makes it easier to find relevant information (Assaf et al., 2015).Schema.orgcovers many areas.In the context of this work, the data set schema (http://schema.org/Dataset)created on the basis of W3C DCAT is particularly relevant (Brickley et al., 2019).

Existing Data Set Knowledge Graphs
Table 1 provides an overview of existing knowledge graphs modeling data sets.Overall, one peculiarity of our knowledge graph is that each data set is linked to at least one scientific publication.As a result, the knowledge graph can be used as a data and evaluation basis for new research approaches that have this requirement facilitating sophisticated data set search.
Existing data set knowledge graphs which combine data from several sources In 2018, Google launched Google Dataset Search (https://datasetsearch.research .google.com/) (Brickley et al., 2019), which is based on collecting metadata descriptions of data sets in Schema.org or W3C DCAT from the web.The standardized metadata are processed into a common graph data model, which essentially corresponds to RDF triples.By crawling metadata from the full web, it is inevitable that the metadata corpus will give a significant number of data set representations with incorrect metadata.For example, there are websites that use http://schema.org/Datasetbut actually do not contain any metadata for a data set.In contrast to the DSKG, the Google Dataset Knowledge Graph is not part of the Linked Open Data Cloud and not publicly available (Benjelloun et al., 2020;Brickley et al., 2019;Canino, 2019).
Open Data Portal Watch (Neumaier, Umbrich, & Polleres, 2016) collects data set metadata from more than 260 freely accessible data portals and focuses on open government data (OGD).The Open Data Portal Watch Framework maps the metadata to the standard DCAT vocabulary.The mapping according to DCAT takes place according to a fixed scheme for the metadata standards of the individual data portals.This creates a uniform representation of the metadata.The Open Data Portal Watch Framework also performs a quality assessment of the metadata.The unified DCAT metadata and their respective quality assessments are available via a SPARQL endpoint.However, the knowledge graph contains hardly any links to external knowledge graphs.

Quantitative Science Studies
The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets DataMed (Ohno-Machado et al., 2017) contains collected metadata from the field of life sciences from 76 data portals.The collected metadata is transformed into the uniform DATS schema (Sansone et al., 2017) and used for the DataMed search for data sets.The DATS core schema contains core elements that can be applied to any type of data set, as well as advanced elements specifically designed for the field of life sciences.The modeled data sets are linked to publications, software, and data portals.DATS can be mapped to Schema.orgelements (Sansone et al., 2017).In contrast to the DSKG, DataMed does not cover all scientific disciplines but is limited to the life sciences.
Existing data set knowledge graphs which use data from one source Ojo and Sennaike (2020) propose an approach to constructing a knowledge graph based on metadata of an open data catalogue.The edges between the data sets of the knowledge graph represent the similarity of the data sets.The similarities between the data sets are constructed using their metadata and the SOM algorithm (Sennaike et al., 2017).The knowledge graph is used to enhance the search and recommendation for data sets within a portal.It contains only the 205 Dublin City Council (DubLinked) instance of the CKAN platform and has no links to other data on the web.Since no common standard vocabulary is used, the interoperability of the knowledge graph is poor.
Younsi Dahbi et al. (2020) present an approach to constructing a knowledge graph on the basis of freely accessible data from the public health sector.The metadata is transformed into RDF.Established vocabularies and schemes, such as DCAT, are reused and expanded with new properties.In addition, the authors interpret the contents of the data sets and generate RDF data from them, which are expressed in a scheme specially adapted to the public health sector.Thus, both the metadata and the contents of the data sets are represented as a knowledge graph using a domain-specific schema.J. Wang et al. (2017) use Schema.org to model research graph data.In particular, the knowledge graph contain data sets, researchers and scientific publications.The original data of the research graph is not described in a uniform vocabulary ensuring interoperability of the data.By using Schema.org, the data can be made available semantically as linked data.

Dataset Metadata Collections
Aside from (RDF) knowledge graphs, metadata about data sets have been modeled and provided in various ways.First of all, we can mention initiatives for research data management (RDM), such as DataCite with re3data (https://datacite.org/re3data.html)and the Research Data Alliance (http://rd-alliance.org/) promoting the exchange and reuse of research data sets on an international level.The DataCite Metadata Working Group has published the DataCite metadata schema for the publication and citation of research data (Group et al., 2017).Several projects at the national and EU level complement the RDM landscape.Noteworthy is in this context in particular the German National Research Data Infrastructure (NFDI), which is an effort to fund consortia regarding research data management with up to 85 million euro per year.Furthermore, we can refer to the Generic Research Data Infrastructure (GeRDI) (https://www.gerdi-project.eu/),a project funded from 2016-2019 to provide a generic open software platform connecting heterogeneous research data repositories facilitating interdisciplinary and FAIR research data management.Last but not least, during the time of writing this paper, Google Research has published a data set on Kaggle with data set metadata derived from Schema.org (https://www.kaggle.com/googleai/dataset-search-metadata-for-datasets).A similar metadata data set based on Schema.organnotations is considered in our analysis in Section 3. The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets

ANALYSIS OF THE DATA SETS
In this section, we analyze data set collections that contain metadata about data sets.We thereby only consider data sets that have values for the basic properties title and description (Benjelloun et al., 2020).Our analysis results are used to assess which data sets are suitable for building a data set knowledge graph that can be used for a variety of use cases, such as data set recommendation.
We came up with the following data sources as being available and relevant for building the knowledge graph.
1. Wikidata.Wikidata is a widely used, cross-domain knowledge graph edited by the crowd.
It contains instances of datasets modeled by various classes.A list of all relevant classes which represent data sets can be found in our online repository https://github.com/michaelfaerber/data-set-knowledge-graph.The instances of the classes and their properties can be accessed via semantic queries and Wikidata's publicly available SPARQL endpoint.At the time of writing, Wikidata contains 4,209 data set instances.2. OpenAIRE.We also consider a subset of the OpenAIRE Research Graph Dump (Manghi, Atzori, et al., 2019)

Degree of Filling of the Attributes
The data set collections contain data set metadata in different quality.In the following, the metadata are evaluated according to the coverage of different information domains.We evaluate the coverage of the information domains based on the Dublin Core Metadata Element Set (DCMES) because the DCAT vocabulary makes extensive use of terms from Dublin Core and the DCMES information domains provide important information about data sets (W3C, 2020).DCMES describes 15 core fields which provide essential metadata about resources (NISO, 2004).
The 15 core fields can be summarized in 6 overarching information domains: Date: For each data set, we store the creation date, modification date, blocking period, and similar dates.In the DCMES this information is stored in the core field date.
People and Organizations: We can model the name of the persons or organizations who were involved in the creation or publication of the resource.In DCMES this information is represented as the core fields creator, publisher and contributor.Description of the content: In the DCMES, the information about the resource and its content is modeled by means of the core fields title, description, subject and coverage.
Technical data: In the DCMES, the information concerning the technical nature of the data is modeled in the core fields format, type and language.ID: In DCMES, unique identifiers for resources and web links are stored in the core field identifier.

Quantitative Science Studies
The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets Rights: In the DCMES, information regarding the property rights related to the resources is stored in the core field rights.
We evaluate the three data set collections with respect to the availability and degree of filling of attributes that contain information on these information areas.Table 2 shows the assignment of the attributes of the data sets to the information domains and the degree of filling of the attributes.It serves as an overview of to what extent the data sets cover the respective information domains.More detailed information regarding the coverage of the information domains by single properties can be found in our repository online.
We can observe that OpenAIRE has attributes with the highest filling degree to cover the individual information domains.Only when describing the data sets' content, is specific information missing, such as the spatial coverage of the data set.The attributes of the Wikidata subset have the second highest filling degree to cover the information domains.As with the OpenAIRE subset, Wikidata does not contain information about the spatial coverage of the data sets.However, Wikidata covers all other information areas better than the Schema.orgdata set collection.Although the Schema.orgcollection contains information on the spatial coverage of the data sets, it has the lowest degree of filling, averaged over all metadata, and therefore covers the information domains the least.

Qualitative Evaluation
We also carried out a manual evaluation to determine which scientific discipline a data set is to be assigned to and whether the metadata entries are actually valid data sets.For this purpose, 100 randomly selected data sets are assessed manually as to whether they are valid data sets and to which scientific discipline they belong.According to the definition of DCAT, a data set is a collection of data that is published or managed by a single agent and is available in one or more representations for access or download (W3C, 2020).Thus, in our assessment, metadata entries (i.e., items describing data sets) were judged as non-valid data sets if they are pure data portals, pure software, or only websites containing information but not offering a data set download or other kinds of data set access.

Proportion of valid data sets
The results of the manual evaluation are as follows.Out of 100 data set representations, Wikidata contained 86, OpenAIRE 100, and Schema.org68 valid data sets.In the case of Schema.org, a resource can belong to the class http://schema.org/Dataset even though it is not a data set.Such incorrect entries occur, for example, when a website contains a http://schema.org/Datasetdescription, although it does not describe any data set metadata.It can therefore not be guaranteed that all metadata entries actually describe data sets.The resulting problem of many incorrect entries is an unsolved problem and known in research (Benjelloun et al., 2020).

Scientific disciplines
We manually assigned scientific disciplines to each of the 100 randomly selected data sets in order to determine the discipline coverage.For this purpose, we reused the set of disciplines used by Benjelloun et al. (2020) for the analysis of the Google Dataset Search data corpus.In order to determine the scientific discipline of a data set, we used not only the metadata entries available but also, as far as available, the web pages on which the data sets are available online.To be able to compare the data sets as intuitively as possible, we assigned each data set only to its main discipline and omitted possible double assignments.
The results of our analysis are shown in Figure 1.We can see that the resources of the three considered data set collections cover the scientific disciplines to varying extents.The data sets of Wikidata largely cover the disciplines of the humanities and social sciences.The data sets of Ope-nAIRE and Schema.org,however, mainly cover the natural sciences.In particular, the geosciences as The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets well as biology and agricultural science are disproportionately represented.Such an imbalance in the domains can be observed for existing, cross-domain knowledge graphs, as well (Färber, Bartscherer, Menne, & Rettinger, 2018).

Quality of the Metadata Entries
We also analyzed the data set titles and descriptions in the data set collections.The average number of words of the data set descriptions is shown in Table 3. Table 4 shows the average number of words in the data set title.
We observe that the titles and descriptions of entries in Wikidata are significantly shorter than the entries in OpenAIRE and Schema.org.In Wikidata it is intended that descriptions of resources are kept short.The descriptions are mainly used to disambiguate resources (Vrandecic, 2019).If a longer description is required for an application, the descriptive section of the Wikipedia article or the official website of the resource can be used.Of the 4,209 data sets from Wikidata, 901 (21.4 %) have an English Wikipedia entry.The listed websites of the data sets is available for 2,522 (59.9%) data sets.

Bottom Line
We can summarize our analysis results as follows: The Wikidata data set collection contains the fewest resources and the data set titles, like the data set descriptions, are kept short.However, it can be assumed that many more data sets will be added to Wikidata in the future.The metadata of the data sets can be called up via the freely accessible SPARQL endpoint.The OpenAIRE data set collection covers to the highest degree the information areas examined.Furthermore, no incorrect entries were found in the manual evaluation for this data set collection.The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets The Schema.org data set collection covers the examined information domains the least and contains many entries that are not data sets.

Quantitative
In order to provide a good data basis for novel research approaches, it is important that the knowledge graph contains as few incorrect entries as possible.The many incorrect entries in the Schema.orgdata set would reduce the quality of the knowledge graph and represent a problem that should not be neglected.Thus, we decided to use Wikidata and OpenAIRE to build a knowledge graph with the lowest possible proportion of incorrect entries.The knowledge graph is therefore particularly suitable as a data basis for research approaches that focus on a high precision and less on a high recall.

APPROACH TO CREATING THE DATA SET KNOWLEDGE GRAPH
In the following section, we show our approach to creating a knowledge graph for data sets.The overall approach to the structure of the knowledge graph is sketched in Figure 2. We can differentiate between the following steps: 1.The data set metadata used are originally in tabular form.First, we link the data sets to the publications they reference using a string-matching algorithm.The procedure is described in detail in Section 4.3.2. Next, the metadata entries are prepared and cleaned up so that they meet the requirements of the RDF target vocabulary.This includes, among other things, a classification of the resources contained in the metadata and an extensive author disambiguation.This step is described in Section 4.4.
Quantitative The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets 3. We map the processed metadata in the RDF standard vocabulary DCAT, with which the knowledge graph is created.4. The result of running our approach is an RDF knowledge graph based on the four design principles of linked data (Heath & Bizer, 2011).By mapping the metadata to an RDF vocabulary and linking to other data sources on the web, the knowledge graph created is a 5-star data set as defined by Tim Berners-Lee (Heath & Bizer, 2011).
In the following, we describe the knowledge graph schema and the single steps in more detail.

Knowledge Graph Schema
We developed a schema of our knowledge graph as depicted in Figure 3.The figure shows the entity types present in the knowledge graph and their properties.The reused vocabularies and their corresponding prefixes are also given.The elements that are literals and their corresponding data type are indicated in a node with a green background.Elements that are clearly identified by a URI are indicated in a node with a yellow background.The labeled edges between two nodes indicate the relationship between the nodes.
We reused the W3C DCAT vocabulary to build the knowledge graph.In this way, we are reusing a widely used standard vocabulary, which allows us to model metadata from different sources in a standardized way (see Section 2.1).The knowledge graph schema covers six different entity types, such as data sets and authors.In our knowledge graph, a data set describes a metadata instance that contains a data set distribution and a number of other properties for describing the data set.A data
Listing 1. Example of an RDF serialization of a represented data set.
set distribution is a specific representation of a data set.The data set distribution thus represents the accessible form of the data set.This can be a downloadable CSV file, for instance.It also contains properties for license information and general information describing the resource (Neumaier, 2019;W3C, 2020).Listing 1 shows the metadata of an example data set in DCAT.

Mapping of the Metadata According to DCAT
Our data set knowledge graph is created based on the data set collections as outlined and analyzed in Section 3. As the initial data set collections contain the metadata in tabular form, the first step in our framework is to transform the data into RDF using the knowledge graph schema (i.e., ontology) outlined above.

Mapping of the Metadata Properties
As next step, we define mapping rules to map the metadata properties and the metadata is mapped to the RDF vocabulary DCAT.The mapping rules for the metadata properties are shown in Table 5.Note that Schema.org is not included as it is not considered as data source based on our data analysis in Section 3. We generate the mapping rules based on the following observations: Wikidata data model The project WikiProject data sets (https://www.wikidata.org/wiki/Wikidata:WikiProject_Datasets) deals with the coordination and improvement of data set descriptions in Wikidata.As a result, mapping rules between the Wikidata data model and DCAT were drawn up (WikiProject Datasets/Data Structure/DCAT -Wikidata -Schema.orgmapping, 2018).Denny Vrandecic presents mapping rules between the Wikidata data model and Schema.org,which are equivalent to DCAT (Vrandecic, 2019).We use these existing mappings of the Wikidata data model to DCAT.Since the above-mentioned documents are working drafts at the time this work was being carried out, not all classes and properties have been finalized.For example, the drafts do not contain a mapping for the Wikidata property Also known as.For missing mappings, we analyzed whether two properties with the same name exist in the data models and whether they can be mapped onto one another.

OpenAIRE data model
The OpenAIRE Research Graph data model is inspired by several existing metadata standards.In particular, the DataCite metadata schema is reused to describe data sets (Manghi, Bardi, et al., 2019).A draft exists that describes how DataCite metadata can be mapped in a DCAT-compliant representation (Perego, Austin, Friis-Christensen, Vaccari, & Tsinaraki, 2020).

Quantitative Science Studies 13
Downloaded from http://direct.mit.edu/qss/article-pdf/doi/10.1162/qss_a_00161/1971237/qss_a_00161.pdf by guest on 12 November 2021 The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets The illustrations of the OpenAIRE Research Graph data model according to DCAT used by us are based on the assignments in this draft.

Preprocessing of Metadata Entries for the DSKG
We need to adapt the metadata entries according to the requirements of DCAT.DCAT defines the data type and the data format (W3C, 2020) for literals.Thus, if necessary, we adapt the metadata entries to the prescribed data formats.The size of a data set distribution is specified in DCAT in bytes.We therefore convert the size specifications of the metadata entries into bytes.Furthermore, the data available in the metadata are converted uniformly into the ISO 8601 standard used by DCAT for the representation of date and time information.

Mapping of the Data Set Instances
In order not to have duplicate entries of data sets in the knowledge graph, the intersection of duplicate data sets of the OpenAIRE data set and the Wikidata data set is determined.We considered two data sets as duplicates if the term frequency of their titles has a cosine similarity of over 0.9.data sets that are in the intersection are only added to the knowledge graph once.Due to the relatively low number of data sets that are linked to a publication (see Section 4.3), only one data set is duplicated in the two sources considered.

Data Transformation
The transformation of the metadata in tabular form into a knowledge graph takes place with the help of operations from SPARQL (W3C, 2013a) and SPARQL 1.1 Update (W3C, 2013b).The built knowledge graph is described with the help of SPARQL CONSTRUCT clauses.The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets In the WHERE clause, the metadata is extracted from the tables and assigned to the variables in the CONSTRUCT clause.The assignment takes place according to the mapping rules defined in Table 5.In order to comply with the W3C standards and the design principles of linked data, the resources contained in the metadata are designated with URIs.The implementation of the semantic modeling of the resources is given in our repository.

Linking the Data Sets to Scientific Publications
Adding links to other data sources is another important step in integrating the knowledge graph into the Linked Open Data Cloud and facilitating important novel use cases, such as data set recommendation for given publications.In many scientific publications data sets are mentioned that scientists used or created for their research (Gregory, Cousijn, Groth, Scharnhorst, & Wyatt, 2020;Henderson & Kotz, 2015).Therefore, we link the data sets to scientific publications in which they are mentioned.In our knowledge graph, the data sets refer to the publications (W3C, 2020) via the property dct:isReferencedBy.The MAKG is used as the data basis for the scientific publications because the MAKG covers all scientific disciplines and is one of the largest freely available scholarly knowledge graphs.It contains a total of 210 million publications (Färber, 2019).
The MAKG contains 146 million publications' abstracts and 241.5 million citation contexts (i.e., sentences in which other publications are cited via citation markers).We searched for mentions of data sets in these publications' abstracts and citation contexts to create links between data sets and publications.We use a string-based algorithm for that purpose (see our GitHub repository).The titles of the data sets, alternative titles and the data set IDs are used to identify data sets in the scientific publications.For data sets listed in OpenAIRE, we used the attributes title, doi and originalId.For data sets listed in Wikidata, we used label, altLabel, official website, work URL and url.
In order to minimize the number of incorrect links, some preprocessing steps are carried out before applying the matching algorithm.In order not to distort the comparison by meaningless data set titles, such as Dataset, Language or README, data set titles only are used for the comparison if they are not among frequently used English words.We use the English Word Frequency data set (Tatman, 2017) to filter out such titles.To consider the different nature of the metadata entries from the different sources (see Table 4) in the comparison, a case-sensitive comparison is performed for the data sets listed in Wikidata for the data set titles and the alternate titles.In addition, only data set titles that have a minimum length of four letters are considered.Alternative data set titles are only considered if they have a minimum length of five letters.
A total of 2,208 data sets are linked to 634,803 scientific publications in the MAKG.The data sets are mentioned in the abstracts or citation contexts of the linked publications and are therefore an essential aspect of these publications.588 data sets originate from OpenAIRE and 1,620 data sets from Wikidata.More links are found to publications for data sets from Wikidata, since, on the one hand, alternative titles are available for the comparison in addition to the data set title.Furthermore, the data set titles in Wikidata are shorter (see Table 4).As Wikidata enables its users to provide metadata in a decentralized manner, many known data sets from third parties have been registered in Wikidata (Vrandecic, 2019).These known data sets are often mentioned in publications.For example, the ten data sets with the most linked publications are all from Wikidata.Table 6 provides an overview of the ten data sets with the most linked publications.
The distribution of the number of links to scientific publications is shown in Figure 4. We can clearly see that the number of linked publications of the data sets is very unbalanced and that most of the data sets are associated with around 10 to 1,000 publications.We argue that, in particular,  this long tail of data sets makes it difficult for researchers and data scientists to be aware of all relevant data sets for a given field and that data set recommendation based on our knowledge graph is a promising avenue for future work.The distribution of the number of data sets to the number of publications in which they are mentioned is shown in Table 7.As we can see, the vast majority of publications only mention very few data sets.The publications which are included in the DSKG mention on average 1.28 data sets.
In total, linking the data sets to scientific publications results in 813,551 links from the 2,208 DSKG data sets to 634,803 unique publications represented in the MAKG.

Fields of Application of the Data Sets
The fields of application of the data sets are useful for domain-specific data set search and recommendation.The fields of application can be derived from the linked scientific publications.The publications in the MAKG have the The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets property http://purl.org/spar/fabio/hasDisciplinewhich has a resource of the class http://ma-graph.org/class/FieldOfStudyas range.The class contains 19 different subject areas.The specialist areas of the publications that reference the data sets are therefore known and can be added to the metadata of the data sets.The dcat:theme property is used to model the application areas, since DCAT does not provide any property that specifically describes the fields of application area of a data set.This property specifies a subject area of a data set.A data set can be assigned to several fields.A data set is assigned to an application area if at least 25% of its linked publications are assigned to this subject area.The distribution of the fields of application of the data sets in the knowledge graph is described in Section 5.

Semantic Representation of the Metadata
A fundamental idea of linked data is the use of URIs to name resources (W3C, 2014).This enables resources to be linked to other things on the web.To provide the metadata according to the Linked Data principles, we therefore transform the strings representing resources as URIs.This is achieved by classifying strings and resolving word ambiguities.

Classification of Resources
The resources from the OpenAIRE data set are only available as a character string but not as URIs and not with associated entity type.Also, the Wikidata data set contains resources that are not semantically represented (e.g., the author name string).In order to transform the character strings into semantic resources in the knowledge graph, we automatically determine the class of each resource in the first step.We use the Python library spaCy (http://spacy.io) to determine whether a character string represents a person (foaf:Person, e.g., "Menghan Hu") or an organization (foaf:Organization, e.g., "US Department of Health and Human Services").If the class of a character string remains unknown, it is modeled as foaf:Agent.

Resolving Ambiguities
The next step is to resolve ambiguities (i.e., identifying strings representing the same real-world object) so that we can assign unique URIs to the resources.Overall, our knowledge graph contains 1169 resources of type foaf:Person, 246 resources of type foaf:Organization, 102 resources of type foaf:Agent, and 19 resources of type vcard:Kind.Due to the negligibly small number of ambiguous names for entities of the classes foaf:Organization, foaf:Agent and vCard:Kind, no automated disambiguation and resolution of ambiguities is necessary in these cases.The resources of the class foaf:Person are called authors in the following.Due to the large number of entries, an author name disambiguation is necessary, which is described in the following sub-section.
The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets names have a certain similarity are compared.If the compared authors of two publications exceed a certain threshold, the authors are clustered using single linkage clustering.It is then assumed that the publications were written by the same author (i.e., in our case, that datasets were published by the same person).
Based on the approach of Caron and van Eck (2014), we have developed a rule-based approach for the authors name disambiguation that is adapted to the metadata of data sets.To calculate the similarity between two author names and, thus, to know the candidates for author name disambiguation, we use the Jaro-Winkler similarity.The Jaro-Winkler similarity is often used to calculate the similarity of short strings, especially for personal names (Bilenko, Mooney, Cohen, Ravikumar, & Fienberg, 2003).In our implementation, two authors are compared with each other if their names have a Jaro-Winkler similarity of at least 0.9 following Donner (2014) and Hajra, Radevski, and Tochtermann (2015).Concerning the rules for the author name disambiguation, we use factors that have already proven to be reliable in the literature and which are rated according to their importance (Caron & van Eck, 2014;Cen, Dragut, Si, & Ouzzani, 2013;Dendek, Bolikowski, & Lukasik, 2012;Protasiewicz & Dadas, 2016).
The rules used for the author name disambiguation are based on different types of confirmation.They can be divided into two categories.On the one hand, we can use explicit information that comes directly from the data set metadata.On the other hand, we can use implicit evidence derived from the data set metadata (Ferreira, Gonçalves, & Laender, 2012;Protasiewicz & Dadas, 2016 The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets One technique for finding implicit evidence is to identify latent topics of data sets shared by the data set authors.LDA (Blei, Ng, & Jordan, 2003) is among the most popular techniques to obtain latent topics.We create an LDA model for the data sets from the combined titles, descriptions, and keywords.We use ten topics for our LDA model because it has been shown in the literature that an LDA model as an implicit attribute for author disambiguation with ten topics usually achieves the best results (Song, Huang, Councill, Li, & Giles, 2007).When calculating the similarity of two authors, the cosine similarity between the LDA vectors of the underlying data sets is considered as one of the criteria.This counteracts the problem that there are data sets that only have values for a few properties and thus only a few bibliographic elements can be used for the comparison.
Table 8 provides an overview of the rules used in our approach, together with their weighting and the threshold value.Overall, we use the following rules: The first rule checks the initials of the authors of two data sets.The second rule checks that the adjusted first names are equal.It is assumed that the authors' names are in the form first name last name.The third rule evaluates the number of joint co-authors between two data sets.The fourth and fifth rules check whether the data sets have common publishers and contributors.The sixth rule checks the titles of the data sets for common words.With the application of the seventh rule, the years of publication of the data sets are considered.In the eighth rule, the fields of application of the data sets are compared.The ninth rule checks the cosine similarity of the LDA vectors of the data sets calculated using the LDA model.
Two compared authors names are rated as being identical if they exceed the given threshold (θ >= 11, following Caron and van Eck (2014)).
The results and evaluation of the performed author disambiguation are described in Section 5.3.

Linking of the Data Set Authors to ORCID
The transformation of the metadata in RDF as linked data opens up new possibilities to link the resources in the knowledge graph to other resources on the web.In particular, the use of the ORCID record of researchers offers added value.ORCID provides a register with persistent and unique identifiers for the unique identification of scientific authors (Haak, Fenner, Paglione, Pentz, & Ratner, 2012).It was designed to solve the problem of author name disambiguation for scientific publications (Caron & van Eck, 2014).We can thus use ORCIDs as ground truth for evaluating our author name disambiguation approach.As of December 2020, ORCID (https://orcid.org)has issued more than 10 million author identifications.The authors in our knowledge graph are enriched by links to their ORCID record.
In order to determine whether an author in our knowledge graph is identical to a registered author at ORCID, we compare the authors by means of a variety of features, such as the author's name, the titles of the author's publications, the co-authorship of authors, as well as the identifiers of publications (Hajra et al., 2015;Radevski, Hajra, & Limani, n.d.).A challenge here is that scientific authors usually only list their scientific publications on their public ORCID record, but not published data sets (see Table 13).Therefore, in addition to the titles of the published data sets by the authors, we consider the titles of the linked publications of the data sets.Since in total there are over 10 million The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets records in ORCID of which many are incomplete it is a nontrivial task to identify the correct ORCID record for an author.Therefore, we use the following rules for comparing the authors: 1. Similarity between the names of the author Two author profiles are only compared with each other if the names of the author profiles have at least two identical terms (strings).2. Similarity of the titles of the publications The titles of the data sets of the author profile from the knowledge graph are compared with the titles of the published works of the author profile by ORCID.To enable comparison between data set titles and publication titles, the number of identical words, which are not stop words, in the titles is counted.For one identical word a rating of 1 is given, for two a rating of 2 and for more than two a rating of 4. The titles of the linked publications of the data sets are compared with the titles of the published works by ORCID.There is a match for two compared titles if, according to the cosine similarity, the following applies: The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets share of data sets in the knowledge graph for which an author is specified is, at 23.5%, significantly larger than in the data set corpus of the Google Dataset Search.This is due to the fact that the specification of the authors is standard in OpenAIRE, while the publisher is specified less often.In Wikidata, too, the publisher of the data sets is rarely given.As a result, only 17.8% of the data sets have the dct:publisher property.

Areas of Application of the Data Sets
The areas of application of the data sets are determined on the basis of their linked scientific publications in the MAKG using the MAKGs fields of study.In total, no application area can be determined for 286 data sets of our knowledge graph.Given these data sets, at least 25% of the linked publications are not assigned to any subject area (i.e., field of study).One application area can be determined for 1,492 data sets, two application areas for 383 data sets, three application areas for 41 data sets and four application areas for 6 data sets.
Figure 6 gives an overview of the distribution of the application areas of the data sets based on the associated publications' fields of study.The relatively high coverage of computer science and biology -compared to disciplines, such as engineering and medicine -can be traced back to the fact that the number of publications using data sets in these disciplines is higher and therefore we can add more links to publications in these disciplines.Furthermore, Wikidata and OpenAIRE contain a relatively high number of biological data sets (see Figure 1).The fact that many biological data sets are published is also reflected in the corpus of the Google Dataset Search.A total of 15.2% of the data sets of the corpus belong to this discipline (Benjelloun et al., 2020).

Quantitative Science Studies
The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets titles of the linked publications with the specified titles of the works on the ORCID records resulted in a total of 37 matches.The clear identification of the authors of data sets can be made easier in the future if authors include their published data sets in their publication lists or, if possible, link their data sets to their own publications.

APPLICATION SCENARIOS
In this section, we outline how the presented data set knowledge graph can be used, particularly in the context of new application scenarios.
Linked open data.Since the knowledge graph DSKG is part of the Linked Open Data cloud and contains links to other data sources, it contributes to the use of linked data in the context of data sets on the web.By using the SPARQL endpoint available at http://dskg.org/sparqlor the URI resolution, both users and programs can query the data.The reusability of the knowledge graph by third parties is simplified by the reuse of the DCAT vocabulary (Neumaier, Umbrich, & Polleres, 2017).As a result, the knowledge graph can be used as a data source for new or existing applications that are related to data sets on the web (Hallo, Luján-Mora, Maté, & Trujillo, 2016).In particular, the high number of links indicating in which publications data sets were mentioned can be highly beneficial.Applications that previously only used the metadata concerning publications (e.g., as provided by the MAKG) can now use the data set representations to have a better understanding of the publications' key content.
Scholarly search and recommender systems.Our knowledge graph can be used as a data and evaluation basis for innovative search engines for data sets.The online demonstration system http://datasetsearch.net, for instance, illustrates how users can search for data sets based on scientific problem descriptions as the user's input and the DSKG as database.Having the DSKG in RDF allows us to compute data set entity embeddings and to use these embeddings in the context of state-of-the-art neural network-based search and recommender systems.In addition, representing the data in RDF enables developers to deploy semantic search systems that combine the data set metadata with metadata concerning publications, authors, venues, and institutions (Chapman et al., 2020;Färber, 2019) and allows users to search across scholarly data, such as data sets, publications, citations, authors, and venues (Baglioni et al., 2020).Due to the careful selection of the data sources when creating the DSKG (see Section 3), the DSKG exhibits a high level of accuracy of the data set metadata.Thus, we avoid the situations in which the users first need to identify and filter out malicious metadata (e.g., items that are not data sets or have incorrect attributes).
Scholarly data analysis and trend detection.Using SPARQL queries, we can determine statistical key figures regarding modeled data sets and authors.For example, SPARQL queries can be used to determine authors of data sets in a specific application area, whose data sets are linked to a large number of publications.Listing 2 shows a possible SPARQL query to identify the most influential authors in the field of biology.The SPARQL query determines the 50 authors with the most frequently referenced data sets from the total of 422 authors in this area.The authors are returned in descending order according to their number of references.In addition, the data sets of the authors and -if available -their ORCID ID are provided.Listing 3 shows a possible SPARQL query to determine the most scientifically influential data sets in the field of computer science.The SPARQL query returns the ten data sets in this area that are referenced by most publications.
The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data Sets sets are mentioned in a paper collection -as modeled in the DSKG -can be used as a first step for measuring the scientific impact of data sets (Belter, 2014;Konkiel, 2020).Furthermore, thanks to the rich metadata about scholarly entities in the MAKG (Färber, 2019), which is interlinked with the DSKG, such as information about publications, authors, affiliations, conferences, journals, and fields of study, this information provides a great basis for advanced scientific impact quantification and research evaluation.For the measurements, SPARQL queries (see Listings 2 and 3 as examples) can be executed without any data dump downloading and processing by the user.
Overall, the DSKG provides the basis for realizing the vision that, by establishing and using links to other entities, such as publications, data sets are no longer seen as isolated marginal products of research, but as research artifacts in their own right that can be reused and make a major contribution to science.

CONCLUSION
In this paper, we presented the DSKG, a knowledge graph for data sets with a corresponding schema.
Based on an analysis of several data set collections, OpenAIRE and Wikidata were selected as the data basis for the DSKG.The metadata of the data sets which were mentioned in at least one publication of the Microsoft Academic Knowledge Graph (Färber, 2019) provided the basis for the DSKG.
In order to resolve the ambiguities of the author names, we adapted an author name disambiguation approach to data set metadata.In addition to explicit evidence, implicit evidence was taken into account in the form of latent topic modeling.When evaluating the author name disambiguation method, we obtained a satisfactory F1 score of 99.2%.
In addition, we developed a method for linking the data set authors given in the DSKG to their ORCIDs.Using this method, a link to their ORCID record could be added for 17.8% of the authors in the DSKG.
Our knowledge graph is available as Linked Open Data at http://dskg.org.Besides resolving URIs via HTTP content negotiation and providing RDF dump files on Zenodo, we provide a public SPARQL endpoint for querying.The knowledge graph comprises a total of 2,208 data set instances and 813,551 links to scientific publications.
We outlined potential use cases of the created knowledge graph and showed that the DSKG can be used in particular in the context of search and recommender systems, as well as for scientific impact quantification.
We can assume that the number of published data sets will continue to increase in the coming years (Benjelloun et al., 2020), not least because of the increasing implementation of the FAIR principles (Wilkinson et al., 2016).Therefore, the need for knowledge graphs covering data sets will continue to increase.We will tackle this challenge by periodically updating the DSKG and linking the DSKG to future scholarly knowledge graphs that will cover the key content of scientific publications in a fine-grained manner (Jaradeh et al., 2019).

Figure 1 .
Figure 1.Coverage of the scientific disciplines.

Figure 2 .
Figure 2. Approach to create our data set knowledge graph.

Figure 3 .
Figure 3. Schema of the data set Knowledge Graph.

Figure 4 .
Figure 4. Distribution of the number of linked publications using a log scale.Table 7. Distribution of the number of data sets to the number of publications in which they are mentioned # Datasets 1 2 3 4 5 6 7 8 9 10-20 # Publications 510,524 85,537 27,572 8,175 2,102 584 165 69 28 47

Table 2 .
Coverage of data sets (with title and description) regarding the information domains

Table 3 .
Number of words in the data set description

Table 8 .
Criteria for author name disambiguation

Table 9 .
Share of data sets and data set distributions with certain properties