Abstract
Although several large knowledge graphs have been proposed in the scholarly field, such graphs are limited with respect to several data quality dimensions such as accuracy and coverage. In this article, we present methods for enhancing the Microsoft Academic Knowledge Graph (MAKG), a recently published large-scale knowledge graph containing metadata about scientific publications and associated authors, venues, and affiliations. Based on a qualitative analysis of the MAKG, we address three aspects. First, we adopt and evaluate unsupervised approaches for large-scale author name disambiguation. Second, we develop and evaluate methods for tagging publications by their discipline and by keywords, facilitating enhanced search and recommendation of publications and associated entities. Third, we compute and evaluate embeddings for all 239 million publications, 243 million authors, 49,000 journals, and 16,000 conference entities in the MAKG based on several state-of-the-art embedding techniques. Finally, we provide statistics for the updated MAKG. Our final MAKG is publicly available at https://makg.org and can be used for the search or recommendation of scholarly entities, as well as enhanced scientific impact quantification.
PEER REVIEW
1. INTRODUCTION
In recent years, knowledge graphs have been proposed and made publicly available in the scholarly field, covering information about entities such as publications, authors, and venues. They can be used for a variety of use cases: (1) Using the semantics encoded in the knowledge graphs and RDF as a common data format, which allows easy data integration from different data sources, scholarly knowledge graphs can be used for providing advanced search and recommender systems (Noia, Mirizzi et al., 2012) in academia (e.g., recommending publications (Beel, Langer et al., 2013), citations (Färber & Jatowt, 2020), and data sets (Färber & Leisinger, 2021a, 2021b)). (2) The representation of knowledge as a graph and the interlinkage of entities of various entity types (e.g., publications, authors, institutions) allows us to propose novel ways to scientific impact quantification (Färber, Albers, & Schüber, 2021). (3) If scholarly knowledge graphs model the key content of publications, such as data sets, methods, claims, and research contributions (Jaradeh, Oelen et al., 2019b), they can be used as a reference point for scientific knowledge (e.g., claims) (Fathalla, Vahdati et al., 2017), similar to DBpedia and Wikidata in the case of cross-domain knowledge. In light of the FAIR principles (Wilkinson, Dumontier et al., 2016) and the overload of scientific information resulting from the increasing publishing rate in the various fields (Johnson, Watkinson, & Mabe, 2018), one can envision that researchers’ working styles will change considerably over the next few decades (Hoffman, Ibáñez et al., 2018; Jaradeh, Auer et al., 2019a) and that, in addition to PDF documents, scientific knowledge might be provided manually or semiautomatically via appropriate forms (Jaradeh et al., 2019b) or automatically based on information extraction on the publications’ full-texts (Färber et al., 2021).
The Microsoft Academic Knowledge Graph (MAKG) (Färber, 2019), AMiner (Tang, Zhang et al., 2008), OpenCitations (Peroni, Dutton et al., 2015), AceKG (Wang, Yan et al., 2018), and Open-AIRE (OpenAIRE, 2021) are examples of large domain-specific knowledge graphs with millions or sometimes billions of facts about publications and associated entities, such as authors, venues, and fields of study. In addition, scholarly knowledge graphs edited by the crowd (Jaradeh et al., 2019b) and providing scholarly key content (Färber & Lamprecht, 2022; Jaradeh et al., 2019b) have been proposed. Finally, freely available cross-domain knowledge graphs such as Wikidata (https://wikidata.org/) provide an increasing amount of information about the academic world, although not as systematic as the domain-specific offshoots.
The Microsoft Academic Knowledge Graph (MAKG) (Färber, 2019) was published in its first version in 2019 and is peculiar in the sense that (1) it is one of the largest freely available scholarly knowledge graphs (over 8 billion RDF triples as of September 2019), (2) it is linked to other data sources in the Linked Open Data cloud, and (3) it provides metadata for entities that are—particularly in combination—often missing in other scholarly knowledge graphs (e.g., authors, institutions, journals, fields of study, in-text citations). As of June 2020, the MAKG contains metadata for more than 239 million publications from all scientific disciplines, as well as over 1.38 billion references between publications. As outlined in Section 2.2, since 2019, the MAKG has already been used in various scenarios, such as recommender systems (Kanakia, Shen et al., 2019), data analytics, bibliometrics, and scientific impact quantification (Färber, 2020; Färber et al., 2021; Schindler, Zapilko, & Krüger, 2020; Tzitzikas, Pitikakis et al., 2020), as well as knowledge graph query processing optimization (Ajileye, Motik, & Horrocks, 2021).
Despite its data richness, the MAKG suffers from data quality issues arising primarily due to the application of automatic information extraction methods from the publications (see further analysis in Section 2). We highlight as major issues (1) the containment of author duplicates in the range of hundreds of thousands, (2) the inaccurate and limited tagging (i.e., assignment) of publications with keywords given by the fields of study (Färber, 2019), and (3) the lack of embeddings for the majority of MAKG entities, which hinders the development of machine learning approaches based on the MAKG.
In this article, we present methods for solving these issues and apply them to the MAKG, resulting in an enhanced MAKG.
First, we perform author name disambiguation on the MAKG’s author set. To this end, we adopt an unsupervised approach to author name disambiguation that uses the rich publication representations in the MAKG and that scales for hundreds of millions of authors. We use ORCID iDs to evaluate our approach.
Second, we develop a method for tagging all publications with fields of study and with a newly generated set of keywords based on the publications’ abstracts. While the existing field of study labels assigned to papers are often misleading (see Wang, Shen et al. (2019) and Section 4) and, thus, often not beneficial for search and recommender systems, the enhanced field of study labels assigned to publications can be used, for instance, to search for and recommend publications, authors, and venues, as our evaluation results show.
Third, we create embeddings for all 239 million publications, 243 million authors, 49,000 journals, and 16,000 conference entities in the MAKG. We experimented with various state-of-the-art embedding approaches. Our evaluations show that the ComplEx embedding method (Trouillon, Welbl et al., 2016) outperforms other embeddings in all metrics. To the best of our knowledge, RDF knowledge graph embeddings have not yet been computed for such a large (scholarly) knowledge graph. For instance, RDF2Vec (Ristoski, Rosati et al., 2019) was trained on 17 million Wikidata entities. Even DGL-KE (Zheng, Song et al., 2020), a recently published package optimized for training knowledge graph embeddings at a large scale, was evaluated on a benchmark with only 86 million entities.
Finally, we provide statistics concerning the authors, papers, and fields of study in the newly created MAKG. For instance, we analyze the authors’ citing behaviors, the number of authors per paper over time, and the distribution of fields of study using the disambiguated author set and the new field of study assignments. We incorporate the results of all mentioned tasks into a final knowledge graph, which we provide online to the public at https://makg.org (formerly: http://ma-graph.org) and http://doi.org/10.5281/zenodo.4617285. Thanks to the disambiguated author set, the new paper tags, and the entity embeddings, the enhanced MAKG opens the door to improved scholarly search and recommender systems and advanced scientific impact quantification.
Overall, our contributions are as follows:
- ▪
We present and evaluate an approach for large-scale author name disambiguation, which can deal with the peculiarities of large knowledge graphs, such as heterogeneous entity types and 243 million author entries.
- ▪
We propose and evaluate transformer-based methods for classifying publications according to their fields of study based on the publications’ abstracts.
- ▪
We apply state-of-the-art entity embedding approaches to provide entity embeddings for 243 million authors, 239 million publications, 49,000 journals, and 16,000 conferences, and evaluate them.
- ▪
We provide a statistical analysis of the newly created MAKG.
Our implementation for enhancing scholarly knowledge graphs can be found online at https://github.com/lin-ao/enhancing_the_makg.
The remainder of this article is structured as follows. In Section 2, we describe the MAKG, along with typical application scenarios and its wide usage in the real world. We also outline the MAKG’s limitations regarding its data quality, thereby providing our motivation for enhancing the MAKG. Subsequently, in Sections 3, 4, and 5, we describe in detail our approaches to author name disambiguation, paper classification, and knowledge graph embedding computation. In Section 6, we describe the schema of the updated MAKG, information regarding the knowledge graph provisioning and statistical key figures of the enhanced MAKG. We provide a conclusion and give an outlook in Section 7.
2. OVERVIEW OF THE MICROSOFT ACADEMIC KNOWLEDGE GRAPH
2.1. Schema and Key Statistics
We can differentiate between three data sets:
the Microsoft Academic Graph (MAG) provided by Microsoft (Sinha, Shen et al., 2015),
the Microsoft Academic Knowledge Graph (MAKG) in its original version provided by Färber since 2019 (Färber, 2019), and
the enhanced MAKG outlined in this article.
The initial MAKG (Färber, 2019) was derived from the MAG, a database consisting of tab-separated text files (Sinha et al., 2015). The MAKG is based on the information provided by the MAG and enriches the content by modeling the data according to linked data principles to generate a Linked Open Data source (i.e., an RDF knowledge graph with resolvable URIs, a public SPARQL endpoint, and links to other data sources). During the creation of the MAKG, the data originating from the MAG is not modified (except for minor tasks, such as data cleaning, linking locations to DB-pedia, and providing sameAs-links to DOI and Wikidata). As such, the data quality of the MAKG is mainly equivalent to the data quality of the MAG provided by Microsoft.
Table 1 shows the number of entities in the MAG as of May 29, 2020. Accordingly, the MAKG created from the MAG would also exhibit these numbers. This MAKG impresses with its size: It contains the metadata for 239 million publications (including 139 million abstracts), 243 million authors, and more than 1.64 billion references between publications (see also https://makg.org/).
Key . | # in MAG/MAKG . |
---|---|
Papers | 238,670,900 |
Papers with Link | 224,325,750 |
Papers with Abstract | 139,227,097 |
Authors | 243,042,675 |
Affiliations | 25,767 |
Journals | 48,942 |
Conference Series | 4,468 |
Conference Instances | 16,142 |
Fields of Study | 740,460 |
Key . | # in MAG/MAKG . |
---|---|
Papers | 238,670,900 |
Papers with Link | 224,325,750 |
Papers with Abstract | 139,227,097 |
Authors | 243,042,675 |
Affiliations | 25,767 |
Journals | 48,942 |
Conference Series | 4,468 |
Conference Instances | 16,142 |
Fields of Study | 740,460 |
It is remarkable that the MAKG contains more authors than publications. The high number of authors (243 million) appears to be too high given that there were eight million scientists in the world in 2013 according to UNESCO (Baskaran, 2017). For more information about the increase in the number of scientists worldwide, we can refer to Shaver (2018). In addition, the number of affiliations in the MAKG (about 26,000) appears to be relatively low, given that all research institutions in all fields should be represented and that there exist 20,000 officially accredited or recognized higher education institutions (World Higher Education Database, 2021).
Compared to a previous analysis of the MAG in 2016 (Herrmannova & Knoth, 2016), whose statistics would be identical to the MAKG counterpart if it had existed in 2016, the number of instances has increased for all entity types (including the number of conference series from 1,283 to 4,468), except for the number of conference instances, which has dropped from 50,202 to 16,142. An obvious reason for this reduction is the data cleaning process as a part of the MAG generation at Microsoft. Although the numbers of journals, authors, and papers have doubled in size compared to the 2016 version (Herrmannova & Knoth, 2016), the number of conference series and fields of study have nearly quadrupled.
Figure 1 shows how many publications represented in the MAKG have been published per discipline (i.e., level-0 field of study). Medicine, materials science, and computer science occupy the top positions. This was not always the case. According to the analysis of the MAG in 2016 (Herrmannova & Knoth, 2016), physics, computer science, and engineering were the disciplines with the highest numbers of publications. We assume that additional and changing data sources of the MAG resulted in this change.
Figure 2 presents the overall number of publication citations per discipline. The descending order of the disciplines is, to a large extent, similar to the descending order of the disciplines considering their associated publication counts (see Figure 1). However, specific disciplines, such as biology, exhibit a large publication citation count compared to their publication count, while the opposite is the case for disciplines such as computer science. The paper citation count per discipline is not provided by the 2016 MAG analysis (Herrmannova & Knoth, 2016).
Table 2 shows the frequency of instances per subclass of mag:Paper, generated by means of a SPARQL query using the MAKG SPARQL endpoint. Listing 1 shows an example of how the MAKG can be queried using SPARQL.
Document type . | Number . |
---|---|
Journal | 85,759,950 |
Patent | 52,873,589 |
Conference | 4,702,268 |
Book chapter | 2,713,052 |
Book | 2,143,939 |
No type given | 90,478,102 |
Document type . | Number . |
---|---|
Journal | 85,759,950 |
Patent | 52,873,589 |
Conference | 4,702,268 |
Book chapter | 2,713,052 |
Book | 2,143,939 |
No type given | 90,478,102 |
2.2. Current Usage and Application Scenarios
The MAKG RDF dumps on Zenodo have been viewed almost 6,000 times and downloaded more than 42,000 times (as of June 15, 2021). As the RDF dumps were also available directly at https://makg.org/rdf-dumps/ (formerly: http://ma-graph.org/rdf-dumps/) until January 2021, the 21,725 visits (since April 4, 2019) to this web page are also relevant.
Figure 3, 4, and 5 were created based on the log files of the SPARQL endpoint. They show the number of SPARQL queries per day, the number of unique users per day, and which user agents were used to which extent. Given these figures and a further analysis of the SPARQL endpoint log files, the following facts are observable:
- ▪
Except for in 2 months, the number of daily requests increased steadily.
- ▪
The number of unique user agents remained fairly constant, apart from a period between October 2019 and January 2020.
- ▪
The frequency of more complex queries (based on query length) is increasing.
Within only one year of its publication in November 2019, the MAKG has been used in diverse ways by various third parties. Below we list some of them based on citations of the MAKG publication (Färber, 2019).
2.2.1. Search and recommender systems and data analytics
- ▪
The MAKG has been used for recommender systems, such as paper recommendation (Kanakia et al., 2019).
- ▪
Scholarly data is becoming increasingly important for businesses. Due to its large number of items (e.g., publications, researchers), the MAKG has been discussed as a data source in enterprises (Schubert, Jäger et al., 2019).
- ▪
The MAKG has been used by nonprofit organizations for data analytics. For instance, Nesta uses the MAKG in its business intelligence tools (see https://www.nesta.org.uk and https://github.com/michaelfaerber/MAG2RDF/issues/1).
- ▪
As a unique data source for scholarly data, the MAKG has been used as one of several publicly available knowledge graphs to build a custom domain-specific knowledge graph that considers specific domains of interest (Qiu, 2020).
2.2.2. Bibliometrics and scientific impact quantification
- ▪
The Data Set Knowledge Graph (Färber & Lamprecht, 2022) provides information about data sets as linked open data source and contains links to MAKG publications in which the data sets are mentioned. Utilizing the publications’ metadata in the MAKG allows researchers to employ novel methods for scientific impact quantification (e.g., working on an “h-index” for data sets).
- ▪
SoftwareKG (Schindler et al., 2020) is a knowledge graph that links about 50,000 scientific articles from the social sciences to the software mentioned in those articles. The knowledge graph also contains links to other knowledge graphs, such as the MAKG. In this way, the SoftwareKG provides the means to assess the current state of software usage.
- ▪
Publications modeled in the MAKG have been linked to the GitHub repositories containing the source code associated with the publications (Färber, 2020). For instance, this facilitates the detection of trends on the implementation level and monitoring of how the FAIR principles are followed by which people (e.g., considering who provides the source code to the public in a reproducible way).
- ▪
According to Tzitzikas et al. (2020), the scholarly data of the MAKG can be used to measure institutions’ research output.
- ▪
In Färber et al. (2021), an approach for extracting scientific methods and data sets used by the authors is presented. The extracted methods and data sets are linked to the publications in the MAKG enabling novel scientific impact quantification tasks (e.g., measuring how often which data sets and methods have been reused by researchers) and the recommendation of methods and data sets. Overall, linking the key content of scientific publications as modeled in knowledge graphs or integrating such information into the MAKG can be considered as a natural extension of the MAKG in the future.
- ▪
The MAKG has inspired other researchers to use it in the context of data-driven history of science (see https://www.downes.ca/post/69870), (i.e., for science of science [Fortunato, Bergstrom et al., 2018]).
- ▪
Daquino, Peroni et al. (2020) present the OpenCitations data model and evaluate the representation of citation data in several knowledge graphs, such as the MAKG.
2.2.3. Benchmarking
- ▪
As a very large RDF knowledge graph, the MAKG has served as a data set for evaluating novel approaches to streaming partitioning of RDF graphs (Ajileye et al., 2021).
2.3. Current Limitations
Based on the statistical analysis of the MAKG and the analysis of the usage scenarios of the MAKG so far, we have identified the following shortcomings:
- ▪
Author name disambiguation is apparently one of the most pressing needs for enhancing the MAKG.
- ▪
The assigned fields of study associated with the papers in the MAKG are not accurate (e.g., architecture), and the field of study hierarchy is quite erroneous.
- ▪
The use cases of the MAKG show that the MAKG has not been used extensively for machine learning tasks. So far, only entity embeddings for the MAKG as of 2019 concerning the entity type paper are available, and these have not been evaluated. Thus, we perceive a need to provide state-of-the-art embeddings for the MAKG covering many instance types, such as papers, authors, journals, and conferences.
3. AUTHOR NAME DISAMBIGUATION
3.1. Motivation
The MAKG is a highly comprehensive data set containing more than 243 million author entities alone. As is the case with any large database, duplicate entries cannot be easily avoided (Wang, Shen et al., 2020). When adding a new publication to the database, the maintainers must determine whether the authors of the new paper already exist within the database or if a new author entity is to be created. This process is highly susceptible to errors, as certain names are common. Given a large enough sample size, it is not rare to find multiple people with identical surnames and given names. Thus, a plain string-matching algorithm is not sufficient for detecting duplicate authors. Table 3 showcases the 10 most frequently occurring author names in the MAKG to further emphasize the issue, using the December 2019 version of the MAKG for this analysis. All author names are of Asian origin. While it is true that romanized Asian names are especially susceptible to causing duplicate entries within a database (Roark, Wolf-Sonkin et al., 2020), the problem is not limited to any geographical or cultural origin and is, in fact, a common problem shared by Western names as well (Sun, Zhang et al., 2017).
Author name . | Frequency . |
---|---|
Wang Wei | 20,235 |
Zhang Wei | 19,944 |
Li Li | 19,049 |
Wang Jun | 16,598 |
Li Jun | 15,975 |
Li Wei | 15,474 |
Wei Wang | 14,020 |
Liu Wei | 13,578 |
Zhang Jun | 13,553 |
Wei Zhang | 13,366 |
Author name . | Frequency . |
---|---|
Wang Wei | 20,235 |
Zhang Wei | 19,944 |
Li Li | 19,049 |
Wang Jun | 16,598 |
Li Jun | 15,975 |
Li Wei | 15,474 |
Wei Wang | 14,020 |
Liu Wei | 13,578 |
Zhang Jun | 13,553 |
Wei Zhang | 13,366 |
The goal of the author name disambiguation task is to identify the maximum number of duplicate authors, while minimizing the number of “false positives”; that is, it aims to limit the number of authors classified as duplicates even though they are distinct persons in the real world.
In Section 3.2, we dive into the existing literature concerning author name disambiguation and, more generally, entity resolution. In Section 3.3, we define our problem formally. In Section 3.4, we introduce our approach, and we present our evaluation in Section 3.5. Finally, we conclude with a discussion of our results and lessons learned in Section 3.6.
3.2. Related Work
3.2.1. Entity resolution
Entity resolution is the task of identifying and removing duplicate entries in a data set that refer to the same real-world entity. This problem persists across many domains and, ironically, is itself affected by duplicate names: “object identification” in computer vision, “coreference resolution” in natural language processing, “database merging,” “merge/purge processing,” “deduplication,” “data alignment,” or “entity matching” in the database domain, and “entity resolution” in the machine learning domain (Maidasani, Namata et al., 2012). The entities to be resolved are either part of the same data set or may reside in multiple data sources.
Newcombe, Kennedy et al. (1959) were the first ones to define the entity linking problem, which was later modeled mathematically by Fellegi and Sunter (1969). They derived a set of formulas to determine the probabilities of two entities being “matching” based on given preconditions (i.e., similarities between feature pairs). Later studies refer to the probabilistic formulas as equivalent to a naïve Bayes classifier (Quass & Starkey, 2003; Singla & Domingos, 2006).
Generally speaking, there exist two approaches to dealing with entity resolution (Wang, Li et al., 2011). In statistics and machine learning, the task is formulated as a classification problem, in which all pairs of entries are compared to each other and classified as matching or nonmatching by an existing classifier. In the database community, a rule-based approach is usually used to solve the task. Rule-based approaches can often be transformed into probabilistic classifiers, such as naïve Bayes, and require certain previous domain knowledge for its setup.
3.2.2. Author name disambiguation
Author name disambiguation is a subcategory of entity resolution and is performed on collections of authors. Table 4 provides an overview of papers specifically approaching the task of author name disambiguation in the scholarly field in the last decade.
Authors . | Year . | Approach . | Supervised . |
---|---|---|---|
Pooja, Mondal, and Chandra (2020) | 2020 | Graph-based combination of author similarity and topic graph | ✗ |
Wang, Wang et al. (2020) | 2020 | Adversarial representation learning | ✓ |
Kim, Kim, and Owen-Smith (2019) | 2019 | Matching email address, self-citation and coauthorship with iterative clustering | ✗ |
Zhang, Xinhua, and Pan (2019) | 2019 | Hierarchical clustering with edit distances | ✗ |
Ma, Wang, and Zhang (2019) | 2019 | Graph-based approach | ✗ |
Kim, Rohatgi, and Giles (2019) | 2019 | Deep neural network | ✓ |
Zhang, Yan, and Zheng (2019) | 2019 | Graph-based approach and clustering | ✗ |
Zhang et al. (2019) | 2019 | Molecular cross clustering | ✗ |
Xu, Li et al. (2018) | 2018 | Combination of single features | ✓ |
Pooja, Mondal, and Chandra (2018) | 2018 | Rule-based clustering | ✗ |
Sun et al. (2017) | 2017 | Multi-level clustering | ✗ |
Lin, Zhu et al. (2017) | 2017 | Hierarchical clustering with combination of similarity metrics | ✗ |
Müller (2017) | 2017 | Neural network using embeddings | ✓ |
Kim, Khabsa, and Giles (2016) | 2016 | DBSCAN with random forest | ✗ |
Momeni and Mayr (2016) | 2016 | Clustering based on coauthorship | ✗ |
Protasiewicz and Dadas (2016) | 2016 | Rule-based heuristic, linear regression, support vector machines and AdaBoost | ✓ |
Qian, Zheng et al. (2015) | 2015 | Support vector machines | ✓ |
Tran, Huynh, and Do (2014) | 2014 | Deep neural network | ✓ |
Caron and van Eck (2014) | 2014 | Rule-based scoring | ✗ |
Schulz, Mazloumian et al. (2014) | 2014 | Pairwise comparison and clustering | ✗ |
Kastner, Choi, and Jung (2013) | 2013 | Random forest, support vector machines and clustering | ✓ |
Wilson (2011) | 2011 | Single layer perceptron | ✓ |
Authors . | Year . | Approach . | Supervised . |
---|---|---|---|
Pooja, Mondal, and Chandra (2020) | 2020 | Graph-based combination of author similarity and topic graph | ✗ |
Wang, Wang et al. (2020) | 2020 | Adversarial representation learning | ✓ |
Kim, Kim, and Owen-Smith (2019) | 2019 | Matching email address, self-citation and coauthorship with iterative clustering | ✗ |
Zhang, Xinhua, and Pan (2019) | 2019 | Hierarchical clustering with edit distances | ✗ |
Ma, Wang, and Zhang (2019) | 2019 | Graph-based approach | ✗ |
Kim, Rohatgi, and Giles (2019) | 2019 | Deep neural network | ✓ |
Zhang, Yan, and Zheng (2019) | 2019 | Graph-based approach and clustering | ✗ |
Zhang et al. (2019) | 2019 | Molecular cross clustering | ✗ |
Xu, Li et al. (2018) | 2018 | Combination of single features | ✓ |
Pooja, Mondal, and Chandra (2018) | 2018 | Rule-based clustering | ✗ |
Sun et al. (2017) | 2017 | Multi-level clustering | ✗ |
Lin, Zhu et al. (2017) | 2017 | Hierarchical clustering with combination of similarity metrics | ✗ |
Müller (2017) | 2017 | Neural network using embeddings | ✓ |
Kim, Khabsa, and Giles (2016) | 2016 | DBSCAN with random forest | ✗ |
Momeni and Mayr (2016) | 2016 | Clustering based on coauthorship | ✗ |
Protasiewicz and Dadas (2016) | 2016 | Rule-based heuristic, linear regression, support vector machines and AdaBoost | ✓ |
Qian, Zheng et al. (2015) | 2015 | Support vector machines | ✓ |
Tran, Huynh, and Do (2014) | 2014 | Deep neural network | ✓ |
Caron and van Eck (2014) | 2014 | Rule-based scoring | ✗ |
Schulz, Mazloumian et al. (2014) | 2014 | Pairwise comparison and clustering | ✗ |
Kastner, Choi, and Jung (2013) | 2013 | Random forest, support vector machines and clustering | ✓ |
Wilson (2011) | 2011 | Single layer perceptron | ✓ |
Ferreira, Gonçalves, and Laender (2012) surveyed existing methods for author name disambiguation. They categorized existing methods by their types of approach, such as author grouping or author assignment methods, as well as their clustering features, such as citation information, web information, or implicit evidence.
Caron and van Eck (2014) applied a strict set of rules for scoring author similarities, such as 100 points for identical email addresses. Author pairs scoring above a certain threshold are classified as identical. Although the creation of such a rule set requires specific domain knowledge, the approach is still very simplistic in nature compared to other supervised learning approaches. In addition, it outperforms other clustering-based unsupervised approaches significantly (Tekles & Bornmann, 2019). For these reasons, we base our approach on the one presented in their paper.
3.3. Problem Formulation
Existing papers usually aim to introduce a new fundamental approach to author name disambiguation and do not focus on the general applicability of their approaches. As a result, these approaches are often impractical when applied to a large data set. For example, some clustering-based approaches require the prior knowledge of the number of clusters (Sun et al., 2017) and other approaches require the pairwise comparison of all entities (Qian et al., 2015), whereas some require external information gathered through web queries (Pooja et al., 2018), which cannot be feasibly done when dealing with millions of entries, as the inherent bottleneck of web requests greatly limits the speed of the overall processes. Therefore, instead of choosing a single approach, we aim to select features from different models and combine them to fit to our target data set containing millions of author names.
We favor the use of unsupervised learning for the reasons mentioned above: lack of training data, lack of need for maintaining and updating of training data, and generally more favorable time and space complexity. Thus, in our approach, we chose the hierarchical agglomerative clustering algorithm (HAC). We formulate the problem as follows.
3.4. Approach
We follow established procedures from existing research for unsupervised author name disambiguation (Caron & van Eck, 2014; Ferreira et al., 2012) and utilize a two-part approach consisting of pairwise similarity measurement using author and paper metadata and clustering. Additionally, we use blocking (see Section 3.4) to reduce the complexity considerably. Figure 6 shows the entire system used for the author name disambiguation process. The system’s steps are as follows:
Preprocessing. We preprocess the data by aggregating all relevant information (e.g., concerning authors, publications, and venues) into one single file for easier access. We then sort our data by author name for the final input.
Disambiguation. We apply blocking to significantly reduce the complexity of the task. We then use hierarchical agglomerative clustering with a rule-based binary classifier as our distance function to group authors into distinct disambiguated clusters.
Postprocessing. We aggregate the output clusters into our final disambiguated author set.
Below, the most important aspects of these steps are outlined in more detail.
3.4.1. Feature selection
We use both author and publication metadata for disambiguation. We choose the features based on their availability in the MAKG and on their previous use in similar works from Table 4. Overall, we use the following features:
- ▪
Author name: This is not used explicitly for disambiguation, but rather as a feature for blocking to reduce the complexity of the overall algorithm.
- ▪
Affiliation: This determines whether two authors share a common affiliation.
- ▪
Coauthors: This determines whether two authors share common coauthors.
- ▪
Titles: This calculates the most frequently used keywords in each author’s published titles in order to determine common occurrences.
- ▪
Years: This compares the time frame in which authors published works.
- ▪
Journals and conferences: These compare the journals and conferences where each author published.
- ▪
References: This determines whether two authors share common referenced publications.
Although email has proven to be a highly effective distinguishing feature for author name disambiguation (Caron & van Eck, 2014; Kim, 2018; Schulz et al., 2014), this information is not available to us directly and therefore omitted from our setup. Coauthorship, on the other hand, is one of the most important features for author name disambiguation (Han, Giles et al., 2004). Affiliation could be an important feature, though we could not rely solely on it, as researchers often change their place of work. In addition, as the affiliation information is automatically extracted from the publications, it might be on varying levels (e.g., department vs. university) and written in different ways (e.g., full name vs. abbreviation). Journals and conferences could be effective features, as many researchers tend to publish in places familiar to them. For a similar reason, references can be an effective measure as well.
3.4.2. Binary classifier
We adapt a rule-based binary classifier as seen in the work of Caron and van Eck (2014). We choose a simple rule-based classifier because of its simplicity, interpretability, and scalability. The unsupervised approach does not require any training data and is therefore well suited for our situation. Furthermore, it is easily adapted and fine-tuned to achieve the best performance based on our data set. Its lack of necessary training time, as well as fast run time, makes it ideal when working with large-scale data sets containing millions of authors.
For each feature, the similarity function consists of rule-based scoring. Below, we briefly describe how similarities between each individual feature are calculated.
- For features with one individual value, as is the case with affiliations because it does not record historical data, the classifier determines whether both entries match and assigns a fixed score saffiliation.
For other features consisting of multiple values such as coauthors, the classifier determines the intersection of both value sets. Here, we assign scores using a stepping function (i.e., fixed scores for an intersection of one, two, three, etc.).
The following formula represents the similarity function for calculating similarities between two authors for the feature coauthors, though the same formula holds for features journals, conferences, titles, and references with their respective values.Papers’ titles are a special case for scoring, as they must be numericalized to allow a comparison. Ideally, we would use a form of word embeddings to measure the true semantic similarity between two titles, but, based on the results of preliminary experiments, we did not find it worth doing, as the added computation necessary would be significant and would most likely not translate directly into huge performance increases. We therefore adapt a plain surface form string comparison. Specifically, we extract the top 10 most frequently used words from the tokenized and lemmatized titles of works published by an author and calculate their intersection with the set of another author.A special case exists for the references feature. A bonus score sself-reference is applied to the case of self-referencing, that is if two compared authors directly reference each other in their respective works, as can be seen in the work of Caron and van Eck (2014).
- For some features, such as journals and conferences, a large intersection between two authors may be uncommon. We only assign a nonzero value if both items share a common value.
- Other features such as publication year also consist of multiple values, though we interpret them as extremes of a time span. Based on their feature values, we construct a time span for each author in which they were active and check for overlap in active years when comparing two authors (similar to Qian et al. (2015)). Again, a fixed score is assigned based on the binary decision. For example, if author A published papers in 2002, 2005, and 2009, we extrapolate the active research period for author A as 2002–2009. If another author B was active during the same time period or within 10 years of both ends of the time span (i.e., 1992–2019), we assign a score syears as the output. We expect most author comparisons to share an overlap in research time span and thus receive a score of greater than zero. Therefore, this feature is more aimed at “punishing” obvious nonmatches. The scoring function takes the following shape:
3.4.3. Blocking
Due to the high complexity of traditional clustering algorithms (e.g., O(n2)), there is a need to implement a blocking mechanism to improve the scalability of the algorithm to accommodate large amounts of input data. We implement sorted neighborhood (Hernández & Stolfo, 1995) as a blocking mechanism. We sort authors based on their names as provided to us by the MAKG and measure the similarity using the Jaro-Winkler distance (Jaro, 1989), as Winkler (1999) provides good performances for name-matching tasks on top of being a fast heuristic (Cohen, Ravikumar, & Fienberg, 2003).
The Jaro-Winkler similarity returns values between 0 and 1, where a greater value signifies a closer match. We choose 0.95 as the threshold θblocking, based on performance on our evaluation data set, and we choose 0.1 as the standard value for the scaling factor p. Similar names will be formed into blocks where we perform pairwise comparison and cluster authors that were classified as similar by our binary classifier.
3.4.4. Clustering
The final step of our author name disambiguation approach consists of clustering the authors. To this end, we choose the traditional hierarchical agglomerative clustering approach. We generate all possible pairs between authors for each block and apply our binary classifier to distinguish matching and nonmatching entities. We then aggregate the resulting disambiguated blocks and receive the final collection of unique authors as output.
3.5. Evaluation
3.5.1. Evaluation data
The MAKG contains bibliographical data on scientific publications, researchers, organizations, and their relationships. We use the version published in December 2019 for evaluation, though our final published results were performed on an updated version (with only minor changes) from June 2020 consisiting of 243,042,675 authors.
3.5.2. Evaluation setup
For the evaluation, we use the ORCID, a persistent digital identifier for researchers, as a ground truth, following Kim (2019). ORCID have been established as a common way to identify researchers. Although the ORCID is still in the process of being adopted, it is already widely used. More than 7,000 journals already collect ORCID from authors (see https://info.orcid.org/requiring-orcid-in-publications/). Our ORCID evaluation set consists of 69,742 author entities.
Although using ORCID as a ground truth, we are aware that this data set may be characterized by imbalanced metadata. First of all, ORCID became widely adopted only a few years ago. Thus, primarily author names from publications published in recent years are considered in our evaluation. Furthermore, we can assume that ORCID is more likely to be used by active researchers with a comparatively higher number of publications and that the more publications’ metadata we have available for one author, the higher the probability is for a correct author name disambiguation.
We set the parameters as given in Table 5. We refer to these as the high precision configuration. These values were chosen based on choices in other similar approaches (Caron & van Eck, 2014) and adjusted through experimentations with our evaluation data as well as analysis of the relevancy of each individual feature (see Section 3.5, Evaluation Results).
Hyperparameter . | Value . |
---|---|
saffiliation | 1 |
scoauthors1 | 3 |
scoauthors2 | 5 |
scoauthors3 | 8 |
stitles1 | 3 |
stitles2 | 5 |
stitles3 | 8 |
sjournals | 3 |
sconferences | 3 |
syears | 3 |
sreferences1 | 2 |
sreferences2 | 3 |
sreferences3 | 5 |
sself-references | 8 |
θmatching | 10 |
θblocking | 0.95 |
p | 0.1 |
Hyperparameter . | Value . |
---|---|
saffiliation | 1 |
scoauthors1 | 3 |
scoauthors2 | 5 |
scoauthors3 | 8 |
stitles1 | 3 |
stitles2 | 5 |
stitles3 | 8 |
sjournals | 3 |
sconferences | 3 |
syears | 3 |
sreferences1 | 2 |
sreferences2 | 3 |
sreferences3 | 5 |
sself-references | 8 |
θmatching | 10 |
θblocking | 0.95 |
p | 0.1 |
We rely on the traditional metrics of precision, recall, and accuracy for our evaluation.
3.5.3. Evaluation results
Due to blocking, the total number of pairwise comparisons was reduced from 2,431,938,411 to 1,475. Out of them, 49 pairs were positive according to our ORCID labels (i.e., they refer to the same real-world person); the other 1,426 were negative. Full classification results can be found in Table 6. We have a heavily imbalanced evaluation set, with a majority of pairings being negative. Nevertheless, we were able to correctly classify the majority of negative labels (1,424 out of 1,426). The great number of false negative classifications is immediately noticeable. This is due to the selection of features or lack of distinguishing features overall to classify certain difficult pairings.
. | Positive label . | Negative label . | Total . |
---|---|---|---|
Positive classification | 37 | 2 | 39 |
Negative classification | 12 | 1,424 | 1,436 |
Total | 49 | 1,426 | 1,475 |
. | Positive label . | Negative label . | Total . |
---|---|---|---|
Positive classification | 37 | 2 | 39 |
Negative classification | 12 | 1,424 | 1,436 |
Total | 49 | 1,426 | 1,475 |
We have therefore chosen to opt for a high percentage of false negatives to minimize the amount of false positive classifications, as those are tremendously more damaging to an author disambiguation result.
Table 7 showcases the average scores for each feature separated into each possible category of outcome. For example, the average score for the feature titles from all comparisons falling under the true positive class was 0.162, and the average score for the feature years for comparisons from the true negative class was 2.899. Based on these results, journals and references play a significant role in identifying duplicate author entities within the MAKG; that is, they contribute high scores for true positives and true negatives. Every single author pair from the true positive classification cluster shared a common journal value, whereas almost none from the true negative class did. Similar observations can be made for the feature references as well.
. | TP . | TN . | FP . | FN . |
---|---|---|---|---|
saffiliation | 0.0 | 0.004 | 0.0 | 0.083 |
scoauthors | 0.0 | 0.0 | 0.0 | 0.0 |
stitles | 0.162 | 0.0 | 0.0 | 0.25 |
syears | 3.0 | 2.89 | 3.0 | 3.0 |
sjournals | 3.0 | 0.034 | 3.0 | 1.75 |
sconferences | 3.0 | 2.823 | 3.0 | 3.0 |
sself-reference | 0.0 | 0.0 | 0.0 | 0.0 |
sreferences | 2.027 | 0.023 | 2.0 | 0.167 |
. | TP . | TN . | FP . | FN . |
---|---|---|---|---|
saffiliation | 0.0 | 0.004 | 0.0 | 0.083 |
scoauthors | 0.0 | 0.0 | 0.0 | 0.0 |
stitles | 0.162 | 0.0 | 0.0 | 0.25 |
syears | 3.0 | 2.89 | 3.0 | 3.0 |
sjournals | 3.0 | 0.034 | 3.0 | 1.75 |
sconferences | 3.0 | 2.823 | 3.0 | 3.0 |
sself-reference | 0.0 | 0.0 | 0.0 | 0.0 |
sreferences | 2.027 | 0.023 | 2.0 | 0.167 |
Our current setup results in a precision of 0.949, recall of 0.755 and an accuracy of 0.991.
By varying the scores assigned by each feature level distance function, we can affect the focus of the entire system from achieving a high level of precision to a high level of recall.
To improve our relatively poor recall value, we have experimented with different setups for distance scores. At high performance levels, a tradeoff persists between precision and recall. By applying changes to score assignment as seen in Table 8, we arrive at the results in Table 9.
. | High precision . | High recall . |
---|---|---|
saffiliation | 1 | 5 |
scoauthors,1 | 3 | 3 |
scoauthors,2 | 5 | 5 |
scoauthors,3 | 8 | 8 |
stitles,1 | 3 | 3 |
stitles,2 | 5 | 5 |
stitles,3 | 8 | 8 |
syears | 3 | 3 |
sjournals | 3 | 4 |
sconferences | 3 | 4 |
sself-references | 8 | 8 |
sreferences,1 | 2 | 2 |
sreferences,2 | 3 | 3 |
sreferences,3 | 5 | 5 |
. | High precision . | High recall . |
---|---|---|
saffiliation | 1 | 5 |
scoauthors,1 | 3 | 3 |
scoauthors,2 | 5 | 5 |
scoauthors,3 | 8 | 8 |
stitles,1 | 3 | 3 |
stitles,2 | 5 | 5 |
stitles,3 | 8 | 8 |
syears | 3 | 3 |
sjournals | 3 | 4 |
sconferences | 3 | 4 |
sself-references | 8 | 8 |
sreferences,1 | 2 | 2 |
sreferences,2 | 3 | 3 |
sreferences,3 | 5 | 5 |
. | Positive label . | Negative label . | Total . |
---|---|---|---|
Positive classification | 45 | 13 | 58 |
Negative classification | 4 | 1,413 | 1,417 |
Total | 49 | 1,426 | 1,475 |
. | Positive label . | Negative label . | Total . |
---|---|---|---|
Positive classification | 45 | 13 | 58 |
Negative classification | 4 | 1,413 | 1,417 |
Total | 49 | 1,426 | 1,475 |
We were able to increase the recall from 0.755 to 0.918. At the same time, our precision plummeted from the original 0.949 to 0.776. As a result, the accuracy stayed at a similar level of 0.988. The exact diffusion matrix can be found in Table 9. With our new setup, we were able to identify the majority of all duplicates (45 out of 49), though at the cost of a significant increase in the number of false positives (from 2 to 13). By further analyzing the exact reasoning behind each type of classification through analysis of individual feature scores in Table 10, we can see that the true positive and false positive classifications result from the same feature similarities, therefore creating a theoretical upper limit to the performance of our specific approach and data set. We hypothesize that additional external data may be necessary to exceed this upper limit of performance.
. | TP . | TN . | FP . | FN . |
---|---|---|---|---|
score_affiliation | 0.111 | 0.004 | 1.538 | 0.0 |
score_coauthors | 0.0 | 0.0 | 0.0 | 0.0 |
score_titles | 0.133 | 0.0 | 0.0 | 0.75 |
score_years | 3.0 | 2.89 | 3.0 | 3.0 |
score_journals | 3.911 | 0.023 | 3.077 | 0.0 |
score_conferences | 4.0 | 3.762 | 4.0 | 4.0 |
score_self-reference | 0.0 | 0.0 | 0.0 | 0.0 |
score_references | 1.667 | 0.023 | 0.308 | 0.5 |
. | TP . | TN . | FP . | FN . |
---|---|---|---|---|
score_affiliation | 0.111 | 0.004 | 1.538 | 0.0 |
score_coauthors | 0.0 | 0.0 | 0.0 | 0.0 |
score_titles | 0.133 | 0.0 | 0.0 | 0.75 |
score_years | 3.0 | 2.89 | 3.0 | 3.0 |
score_journals | 3.911 | 0.023 | 3.077 | 0.0 |
score_conferences | 4.0 | 3.762 | 4.0 | 4.0 |
score_self-reference | 0.0 | 0.0 | 0.0 | 0.0 |
score_references | 1.667 | 0.023 | 0.308 | 0.5 |
We must consider the heavily imbalanced nature of our classification labels when evaluating the results to avoid falling into the trap of the “high accuracy paradox”: that is, the resulting high accuracy score of a model on highly imbalanced data sets, where negative labels significantly outnumber positive labels. The model’s favorable ability to predict the true negatives outweighs its shortcomings for identifying the few positive labels.
Ultimately, we decided to use the high-precision setup to create the final knowledge graph, as precision is a much more meaningful metric for author name disambiguation as opposed to recall. It is often preferable to avoid removing nonduplicate entities rather than identifying all duplicates at the cost of false positives.
We also analyzed the average feature density per author in the MAKG and the ORCID evaluation data set to gain deeper insight into the validity of our results. Feature density here refers to the average number of data entries within an individual feature, such as the number of papers for the feature “published papers.” The results can be found in Table 11.
. | MAKG . | Evaluation . |
---|---|---|
AuthorID | 1.0 | 1.0 |
Rank | 1.0 | 1.0 |
NormalizedName | 1.0 | 1.0 |
DisplayName | 1.003 | 1.0 |
LastKnownAffiliationID | 0.172 | 0.530 |
PaperCount | 1.0 | 1.0 |
CitationCount | 1.0 | 1.0 |
CreateDate | 1.0 | 1.0 |
PaperID | 2.612 | 1.196 |
DOI | 1.240 | 1.0 |
Coauthors | 11.187 | 4.992 |
Titles | 2.620 | 1.198 |
Year | 1.528 | 1.107 |
Journal | 0.698 | 0.819 |
Conference | 0.041 | 0.025 |
References | 20.530 | 26.590 |
ORCID | 0.0003 | 1.0 |
. | MAKG . | Evaluation . |
---|---|---|
AuthorID | 1.0 | 1.0 |
Rank | 1.0 | 1.0 |
NormalizedName | 1.0 | 1.0 |
DisplayName | 1.003 | 1.0 |
LastKnownAffiliationID | 0.172 | 0.530 |
PaperCount | 1.0 | 1.0 |
CitationCount | 1.0 | 1.0 |
CreateDate | 1.0 | 1.0 |
PaperID | 2.612 | 1.196 |
DOI | 1.240 | 1.0 |
Coauthors | 11.187 | 4.992 |
Titles | 2.620 | 1.198 |
Year | 1.528 | 1.107 |
Journal | 0.698 | 0.819 |
Conference | 0.041 | 0.025 |
References | 20.530 | 26.590 |
ORCID | 0.0003 | 1.0 |
As we can observe, there is a variation in “feature richness” between the evaluation set and the overall data set. However, for the most important features used for disambiguation—namely journals, conferences, and references—the difference is not as pronounced. Therefore, we can assume that the disambiguation results will not be strongly affected by this variation.
Performing our author name disambiguation approach on the whole MAKG containing 243,042,675 authors (MAKG version from June 2020) resulted in a reduced set of 151,355,324 authors. This is a reduction by 37.7% and shows that applying author name disambiguation is highly beneficial.
Importantly, we introduced a maximum block size of 500 in our final approach. Without it, the number of authors grouped into the same block would theoretically be unlimited. The introduction of a limit to block size further improves performance significantly, reducing the runtime from over a week down to about 48 hours, using an Intel Xeon E5-2660 v4 processor and 128 GB of RAM. We have therefore opted to keep the limit, as the tradeoff in performance decrease is manageable and as we aimed to provide an approach for real application rather than a proof of concept. However, the limit can be easily removed or adjusted.
3.6. Discussion
Due to the high number of authors with identical names within the MAG and, thus, the MAKG, our blocking algorithm sometimes still generates large blocks with more than 20,000 authors. The number of pairwise classifications necessary equates to the number of combinations, namely , leading to high computational complexity for larger block sizes. One way of dealing with this issue would be to manually limit the maximum number of entities within one block, as we have done. Doing so will split potential duplicate entities into distinct blocks, meaning they will never be subject to comparison by the binary classifier, although the entire process may be sped up significantly depending on the exact size limit selected. To highlight the challenge, Table 12 showcases the author names with the largest block sizes created by our blocking algorithm (i.e., author names generating the most complexity). The difference in total comparisons for the name block of “Wang Wei” would be 204,717,495 comparisons (total comparisons for 20,235 authors with no block size limit: = 204,717,495) with no block size limit, compared to 5,017,495 comparisons (total comparisons for 20,235 authors with a block size limit of 500: 40 × + = 5,017,495) for a block limit of 500 authors. We have found the difference in performance to be negligible compared to the total amount of duplicate authors found, as it differs by less than 2 million authors compared to the almost 100 million duplicate authors found.
Author name . | Block size . |
---|---|
Wang Wei | 20,235 |
Zhang Wei | 19,944 |
Li Li | 19,049 |
Wang Jun | 16,598 |
Li Jun | 15,975 |
Li Wei | 15,474 |
Wei Wang | 14,020 |
Liu Wei | 13,580 |
Zhang Jun | 13,553 |
Wei Zhang | 13,366 |
Author name . | Block size . |
---|---|
Wang Wei | 20,235 |
Zhang Wei | 19,944 |
Li Li | 19,049 |
Wang Jun | 16,598 |
Li Jun | 15,975 |
Li Wei | 15,474 |
Wei Wang | 14,020 |
Liu Wei | 13,580 |
Zhang Jun | 13,553 |
Wei Zhang | 13,366 |
Our approach can be further optimized through hand-crafted rules for dealing with certain author names. Names of certain origins, such as Chinese or Korean names, possess certain nuances. While the alphabetized Romanized forms of two Chinese names may be similar or identical, the original language text often shows a distinct difference. Furthermore, understanding the composition of surnames and given names in this case may also help further reduce the complexity. As an example, the names “Zhang Lei” and “Zhang Wei” only differ by a single character in their Romanized forms and would be classified as potential duplicates or typos due to their similarity, even though for native Chinese speakers such names signify two distinctly separate names, especially when written in the original Chinese character form. Chinese research publications have risen in number in recent years (Johnson et al., 2018). Given their susceptibility to creating duplicate entries as well as their significant presence in the MAKG already, future researchers might be well suited to isolate this problem as a focal point.
Additionally, there is the possibility to apply multiple classifiers and combine their results in a hybrid approach. If we were able to generate training data of sufficient volume and quality, we would be able to apply certain supervised learning approaches, such as neural networks or support vector machines using our generate feature vectors as input.
4. FIELD OF STUDY CLASSIFICATION
4.1. Motivation
Publications modeled in the MAKG are assigned to specific fields of study. Additionally, the fields of study are organized in a hierarchy. In the MAKG as of June 2020, 709,940 fields of study are organized in a multilevel hierarchical system (see Table 13). Both the field of study paper assignments and the field of study hierarchy in the MAKG originate from the MAG data provided by Microsoft Research. The entire classification scheme is highly comprehensive and covers a huge variety of research areas, but the labeling of papers contains many shortcomings. Thus, the second task in this article for improving the MAKG is the revision of field of study assignment of individual papers.
Level . | # of fields of study . |
---|---|
0 | 19 |
1 | 292 |
2 | 138,192 |
3 | 208,368 |
4 | 135,913 |
5 | 167,676 |
Level . | # of fields of study . |
---|---|
0 | 19 |
1 | 292 |
2 | 138,192 |
3 | 208,368 |
4 | 135,913 |
5 | 167,676 |
Many of the higher-level fields of study in the hierarchical system are highly specific, and therefore lead to many misclassifications purely based on certain matching keywords in the paper’s textual information. For instance, papers on the topic of machine learning architecture are sometimes classified as “Architecture.” Because the MAG does not contain any full texts of papers, but is limited to the titles and abstracts only, we do not believe that the information provided in the MAG is comprehensive enough for effective classification on such a sophisticated level.
On top of that, an organized structure is highly rigid and difficult to change. When introducing a previously unincorporated field of study, we have to not only modify the entire classification scheme, but ideally also relabel all papers in case some fall under the new label.
We believe the underlying problem to be the complexity of the entire classification scheme. We aim to create a simpler structure that is extendable. Our idea is not aimed at replacing the existing structure and field of study labels, but rather enhancing and extending the current system. Instead of limiting each paper to being part of a comprehensive structured system, we (1) merely assign a single field of study label at the top level (also called “discipline” in the following, level 0 in the MAKG), such as computer science, physics, or mathematics. We then (2) assign to each publication a list of keywords (i.e., tags), which are used to describe the publication in further detail. Our system is therefore essentially descriptive in nature rather than restrictive.
Compared to the classification scheme of the original MAKG and the MAG so far, our proposed system is more fluid and extendable as its labels or tags are not constrained to a rigid hierarchy. New concepts are freely introduced without affecting existing labels.
Our idea therefore is to classify papers on a basic level, then extract keywords in the form of tags for each paper. These can be used to describe the content of a specific work, while leaving the structuring of concepts to domain experts in each field. We classify papers into their respective fields of study using a transformer-based classifier and generate tags for papers using keyword extraction from the publications’ abstracts.
In Section 4.2, we introduce related work concerning text classification and tagging. We describe our approach in Section 4.3. In Section 4.4, we present our evaluation of existing field of study labels, the MAKG field of study hierarchy, and the newly created field of study labels. Finally, we discuss our findings in Section 4.5.
4.2. Related Work
4.2.1. Text classification
The tagging of papers based on their abstracts can be regarded as a text classification task. Text classification aims to categorize given texts into distinct subgroups according to predefined characteristics. As with any classification task, text classification can be separated into binary, multilabel, and multiclass classification.
Kowsari, Meimandi et al. (2019) provide a recent survey of text classification approaches. Traditional approaches include techniques such as the Rocchio algorithm (Rocchio, 1971), boosting (Schapire, 1990), bagging (Breiman, 1996), and logistic regression (Cox & Snell, 1989), as well as naïve Bayes. Clustering-based approaches include k-nearest neighbor and support vector machines (Vapnik & Chervonenkis, 1964). More recent approaches mostly utilize deep learning. Recurrent neural networks (Rumelhart, Hinton, & Williams, 1986) and long short-term memory networks (LSTMs) (Hochreiter & Schmidhuber, 1997) had been the predominant approaches for representing language and solving language-related tasks until the rise of transformer-based models.
Transformer-based models can be generally separated into autoregressive and autoencoding models. Autoregressive models such as Transformer-XL (Dai, Yang et al., 2019) learn representations for individual word tokens sequentially, whereas autoencoding models such as BERT (Devlin, Chang et al., 2019) are able to learn representations in parallel using the entirety of the document, even words found after the word token. Newer autoregressive models such as XLNet (Yang, Dai et al., 2019) combine features from both categories and are able to achieve state-of-the-art performance. Additionally, other variants of the BERT model exist, such as ALBERT (Lan, Chen et al., 2020) and RoBERTa (Liu, Ott et al., 2019). Furthermore, specialized BERT variants have been created. One such variant is SciBERT (Beltagy, Lo, & Cohan, 2019), which specializes in academic texts.
4.2.2. Tagging
Tagging—based on extracting the tags from a text—can be considered synonymous with keyword extraction. To extract keywords from publications’ full texts, several approaches and challenges have been proposed (Alzaidy, Caragea, & Giles, 2019; Florescu & Caragea, 2017; Kim, Medelyan et al., 2013), exploiting publications’ structures, such as citation networks (Caragea, Bulgarov et al., 2014). In our scenario, we use publications’ abstracts, as the full texts are not available in the MAKG. Furthermore, we focus on keyphrase extraction methods requiring no additional background information and not designed for specific tasks, such as text summarization.
TextRank (Mihalcea & Tarau, 2004) is a graph-based ranking model for text processing. It performs well for tasks such as keyword extraction as it does not rely on local context to determine the importance of a word, but rather uses the entire context through a graph. For every input text, the algorithm splits the input into fundamental units (words or phrases depending on the task) and structures them into a graph. Afterwards, an algorithm similar to PageRank determines the relevance of each word or phrase to extract the most important ones.
Another popular algorithm for keyword extraction is RAKE, which stands for rapid automatic keyword extraction (Rose, Engel et al., 2010). In RAKE, the text is split by a previously defined list of keywords. Thus, a less comprehensive list would lead to longer phrases. In contrast, TextRank splits the text into individual words first and combines words which benefit from each other’s context at a later stage in the algorithm. Overall, RAKE is more suitable for text summarization tasks due to its longer extracted key phrases, whereas TextRank is suitable for extracting shorter keywords used for tagging, in line with our task. In their original publication, the authors of TextRank applied their algorithm for keyword extraction from publications’ abstracts. Due to all these reasons, we use TextRank for publication tagging.
4.3. Approach
Our approach is to fine-tune a state-of-the-art transformer model for the task of text classification. We use the given publications’ abstracts as input to classify each paper into one of 19 top-level field of study labels (i.e., level 0) predefined by the MAG (see Table 11). After that, we apply TextRank to extract keyphrases and assign them to papers.
4.4. Evaluation
4.4.1. Evaluation data
For the evaluation, we produce three labeled data sets in an automatic fashion. Two of the data sets are used to evaluate the current field of study labels in the MAKG (and MAG) and the given MAKG field of study hierarchy, while the last data set acts as our source for training and evaluating our approach for the field of study classification.
In the following, we describe our approaches for generating our three data sets.
For our first data set, we select field of study labels directly from the MAKG. As mentioned previously, the MAKG’s fields of study are provided in a hierarchical structure (i.e., fields of study, such as research topics) can have several fields of study below them. We filter the field of study labels associated with papers for level-0 labels only; that is, we consider only the 19 top-level labels and their assignments to papers. Table 14 lists all 19 level-0 fields of study in the MAKG; these, associated with the papers, are also our 19 target labels for our classifier. This data set will be representative of the field of study assignment quality of the MAKG overall as we compare its field of study labels with our ground truth (see Section 4.4).
For our second data set, we extrapolate field of study labels from the MAKG/MAG using the field of study hierarchy—that is, we relabel the papers using their associated top-level fields of study on level 0. For example, if a paper is currently labeled as “neural network,” we identify its associated level-0 field of study (the top-level field of study in the MAKG). In this case, the paper would be assigned the field of study of “computer science.”
We prepare our data set by first replacing all field of study labels using their respective top-level fields of study. Each field of study assignment in the MAKG has a corresponding confidence score. We thus sort all labels by their corresponding level-0 fields of study and calculate the final field of study of a given paper by summarizing their individual scores. For example, consider a paper that originally has the field of study labels “neural network” with a confidence score of 0.6, “convolutional neural network” with a confidence score of 0.5, and “graph theory” with a confidence score of 0.8. The labels “neural network” and “convolutional neural network” are mapped back to the top-level field of study of “computer science,” whereas “graph theory” is mapped back to “mathematics.” To calculate the final score for each discipline, we totaled the weights of every occurrence of a given label. In our example, “computer science” would have a score of 0.5 + 0.6 = 1.1, and “mathematics” a score of 0.8, resulting in the paper being labeled as “computer science.”
This approach can be interpreted as an addition of weights on the direct labels we generated for our previous approach. By analyzing the differences in results from these two data sets, we aim to gather some insights into the validity of the hierarchical structure of the fields of study found in the MAG.
Our third data set is created by utilizing the papers’ journal information. We first select a specific set of journals from the MAKG for which the journal papers’ fields of study can easily be identified. This is achieved through simple string matching between the names of top-level fields of study and the names of journals. For instance, if the phrase “computer science” occurs in the name of a journal, we assume it publishes papers in the field of computer science.
We expect the data generated by this approach to be highly accurate, as the journal is an identifying factor of the field of study. We cannot rely on this approach to match all papers from the MAKG, as a majority of papers were published in journals whose main disciplines could not be discerned directly from their names. Also, there exist a portion of papers that do not have any associated journal entries in the MAKG.
We are able to label 2,553 journals in this fashion. We then label all 2,863,258 papers from these given journals using their journal-level field of study labels. We use the resulting data set to evaluate the fields of study in the MAKG as well as to generate training data for the classifier.
In the latter case, we randomly selected 20,000 abstracts per field of study label, resulting in 333,455 training samples (i.e., paper–field-of-study assignment pairs). The mismatch compared to the theoretical training data size of 380,000 comes from the fact that some labels had fewer than 20,000 papers available to select from.
MAG ID . | Field of study . |
---|---|
41008148 | Computer Science |
86803240 | Biology |
17744445 | Political Science |
192562407 | Materials Science |
205649164 | Geography |
185592680 | Chemistry |
162324750 | Economics |
33923547 | Mathematics |
127313418 | Geology |
127413603 | Engineering |
121332964 | Physics |
144024400 | Sociology |
144133560 | Business |
71924100 | Medicine |
15744967 | Psychology |
142362112 | Art |
95457728 | History |
138885662 | Philosophy |
39432304 | Environmental Science |
MAG ID . | Field of study . |
---|---|
41008148 | Computer Science |
86803240 | Biology |
17744445 | Political Science |
192562407 | Materials Science |
205649164 | Geography |
185592680 | Chemistry |
162324750 | Economics |
33923547 | Mathematics |
127313418 | Geology |
127413603 | Engineering |
121332964 | Physics |
144024400 | Sociology |
144133560 | Business |
71924100 | Medicine |
15744967 | Psychology |
142362112 | Art |
95457728 | History |
138885662 | Philosophy |
39432304 | Environmental Science |
Our data for evaluating the classifier comes from our third approach, namely the field of study assignment based on journal names. We randomly drew 2,000 samples for each label from the labeled set to form our test data set. Note that the test set does not overlap in any way with the training data set generated through the same approach, as both consist of distinctly separate samples (covering all scientific disciplines). In total, the evaluation set consists of 38,000 samples spread over the 19 disciplines.
4.4.2. Evaluation setup
All our implementations use the Python module Simple Transformers (https://github.com/ThilinaRajapakse/simpletransformers; based on Transformers, https://github.com/huggingface/transformers), which provides a ready-made implementation of transformer-based models for the task of multiclass classification. We set the number of output classes to 19, corresponding to the number of top-level fields of study we are trying to label. As mentioned in Section 4.4.1, we prepare our evaluation data set based on labels generated via journal names. We also prepare out training set from the same data set.
We choose the following model variants for each architecture:
bert-large-uncased for BERT,
scibert_scivocab_uncased for SciBERT,
albert-base-v2 for ALBERT,
roberta-large for RoBERTa, and
xlnet-large-cased for XLNet.
All transformer models were trained on the bwUnicluster using GPU nodes containing four Nvidia Tesla V100 GPUs and an Intel Xeon Gold 6230 processor.
4.4.3. Evaluation metrics
We evaluate our model performances using two specific metrics: the micro-F1 score and Mathews correlation coefficient.
4.4.4. Evaluation results
4.4.4.1. Evaluation of existing field of study labels
In the following, we outline our evaluation concerning the validity of the existing MAG field of study labels. We take our two labeled sets generated by our direct labeling (first data set; 2,863,258 papers) as well as labeling through journal names (third data set) and compare the associated labels on level 0.
As we can see from the results in Table 15, the quality of top-level labels in the MAG can be improved. Out of the 2,863,258 papers, 1,595,579 matching labels were found, corresponding to a 55.73% match, meaning 55.73% of fields of study were labeled correctly according to our ground truth. Table 15 also showcases an in-depth view of the quality of labels for each discipline. We show the total number of papers for each field of study and the number of papers that are correctly classified according to our ground truth, followed by the percentage.
Label . | # labels . | # matching . | % matching . |
---|---|---|---|
Computer Science | 21,157 | 15,056 | 71.163 |
Biology | 212,356 | 132,203 | 62.255 |
Political Science | 12,043 | 4,083 | 33.904 |
Materials Science | 23,561 | 18,475 | 78.413 |
Geography | 4,286 | 575 | 13.416 |
Chemistry | 339,501 | 285,569 | 84.114 |
Economics | 91,411 | 62,482 | 68.353 |
Mathematics | 109,797 | 92,519 | 84.264 |
Geology | 22600 | 18,377 | 81.314 |
Engineering | 731,505 | 187,807 | 25.674 |
Physics | 694,631 | 500,723 | 72.085 |
Sociology | 10,725 | 9,245 | 86.200 |
Business | 141,498 | 33,641 | 23.775 |
Medicine | 311,197 | 186,194 | 59.832 |
Psychology | 36,080 | 31,834 | 88.232 |
Art | 23,728 | 4,336 | 18.274 |
History | 39,938 | 5,161 | 12.923 |
Philosophy | 19,517 | 6,363 | 32.602 |
Environm. Science | 17,727 | 936 | 5.280 |
Total | 2,863,258 | 1,595,579 | 55.726 |
Label . | # labels . | # matching . | % matching . |
---|---|---|---|
Computer Science | 21,157 | 15,056 | 71.163 |
Biology | 212,356 | 132,203 | 62.255 |
Political Science | 12,043 | 4,083 | 33.904 |
Materials Science | 23,561 | 18,475 | 78.413 |
Geography | 4,286 | 575 | 13.416 |
Chemistry | 339,501 | 285,569 | 84.114 |
Economics | 91,411 | 62,482 | 68.353 |
Mathematics | 109,797 | 92,519 | 84.264 |
Geology | 22600 | 18,377 | 81.314 |
Engineering | 731,505 | 187,807 | 25.674 |
Physics | 694,631 | 500,723 | 72.085 |
Sociology | 10,725 | 9,245 | 86.200 |
Business | 141,498 | 33,641 | 23.775 |
Medicine | 311,197 | 186,194 | 59.832 |
Psychology | 36,080 | 31,834 | 88.232 |
Art | 23,728 | 4,336 | 18.274 |
History | 39,938 | 5,161 | 12.923 |
Philosophy | 19,517 | 6,363 | 32.602 |
Environm. Science | 17,727 | 936 | 5.280 |
Total | 2,863,258 | 1,595,579 | 55.726 |
4.4.4.2. Evaluation of MAKG field of study hierarchy
To determine the validity of the existing field of study hierarchy, we compare the indirectly labeled data set (second data set) with our ground truth based on journal names (third data set). The indirectly labeled data set is labeled using inferred information based on the overall MAKG field of study hierarchy (see Section 4.4.1). Here, we want to examine the effect the hierarchical structure would have on the truthfulness of field of study labels. The results can be found in Table 16.
Label . | # labels . | # matching . | % matching . |
---|---|---|---|
Computer Science | 21,157 | 13,055 | 61.705 |
Biology | 212,356 | 145,671 | 68.598 |
Political Science | 12,043 | 8,035 | 66.719 |
Materials Science | 23,561 | 13,618 | 57.799 |
Geography | 4,286 | 285 | 6.650 |
Chemistry | 339,501 | 239,576 | 70.567 |
Economics | 91,411 | 62,025 | 67.853 |
Mathematics | 109,797 | 79,959 | 72.824 |
Geology | 22,600 | 15,777 | 69.810 |
Engineering | 731,505 | 207,063 | 28.306 |
Physics | 694,631 | 464,083 | 66.810 |
Sociology | 10,725 | 4,418 | 41.193 |
Business | 141,498 | 26,095 | 18.442 |
Medicine | 311,197 | 192,397 | 61.825 |
Psychology | 36,080 | 25,548 | 70.809 |
Art | 23,728 | 4,901 | 20.655 |
History | 39,938 | 3,391 | 8.491 |
Philosophy | 19,517 | 8,641 | 44.274 |
Environm. Science | 17,727 | 302 | 1.704 |
Total | 2,863,258 | 1,514,840 | 52.906 |
Label . | # labels . | # matching . | % matching . |
---|---|---|---|
Computer Science | 21,157 | 13,055 | 61.705 |
Biology | 212,356 | 145,671 | 68.598 |
Political Science | 12,043 | 8,035 | 66.719 |
Materials Science | 23,561 | 13,618 | 57.799 |
Geography | 4,286 | 285 | 6.650 |
Chemistry | 339,501 | 239,576 | 70.567 |
Economics | 91,411 | 62,025 | 67.853 |
Mathematics | 109,797 | 79,959 | 72.824 |
Geology | 22,600 | 15,777 | 69.810 |
Engineering | 731,505 | 207,063 | 28.306 |
Physics | 694,631 | 464,083 | 66.810 |
Sociology | 10,725 | 4,418 | 41.193 |
Business | 141,498 | 26,095 | 18.442 |
Medicine | 311,197 | 192,397 | 61.825 |
Psychology | 36,080 | 25,548 | 70.809 |
Art | 23,728 | 4,901 | 20.655 |
History | 39,938 | 3,391 | 8.491 |
Philosophy | 19,517 | 8,641 | 44.274 |
Environm. Science | 17,727 | 302 | 1.704 |
Total | 2,863,258 | 1,514,840 | 52.906 |
Our result based on this approach is very similar to the previous evaluation. Out of the 2,863,258 papers, we found 1,514,840 labels matching those based on journal names, resulting in a 52.91% match (compared to 55.73% in the previous evaluation). Including the MAKG field of study hierarchy did not improve the quality of labels. For many disciplines, the number of mislabelings increased significantly, further devaluing the quality of existing MAG labels.
4.4.4.3. Evaluation of classification
In the following, we evaluate the newly created field of study labels for papers determined by our transformer-based classifiers.
We first analyze the effect of training size on the overall results. Although we observe a steady increase in performance with each increase in size of our training set, the marginal increment deteriorates after a certain value. Therefore, with training time in mind, we decided to limit the training input size to 20,000 samples per label, leading a theoretical training data size of 390,000 samples. The number is slightly smaller in reality, however, due to certain labels having fewer than 20,000 training samples in total.
We then compared the performances of various transformer-based models for our task. Table 17 shows performances of our models trained on the same training set after one epoch. As we can see, SciBERT and BERTbase outperform other models significantly, with SciBERT slightly edging ahead in comparison. Surprisingly, the larger BERT variant performs significantly worse than its smaller counterpart.
Model . | MCC . | F1-score . |
---|---|---|
BERTbase | 0.7452 | 0.7584 |
BERTlarge | 0.6853 | 0.7014 |
SciBERT | 0.7552 | 0.7678 |
Albert | 0.7037 | 0.7188 |
RoBERTa | 0.7170 | 0.7316 |
XLNet | 0.6755 | 0.6920 |
Model . | MCC . | F1-score . |
---|---|---|
BERTbase | 0.7452 | 0.7584 |
BERTlarge | 0.6853 | 0.7014 |
SciBERT | 0.7552 | 0.7678 |
Albert | 0.7037 | 0.7188 |
RoBERTa | 0.7170 | 0.7316 |
XLNet | 0.6755 | 0.6920 |
We then compare the effect of training epochs on performance. We limit our comparison to the SciBERT model in this case. We choose SciBERT as it achieves the best performance after one epoch of training. We fine-tune the same SciBERT model using an identical training set (20,000 samples per label) as well as the same evaluation set. We observe a peak in performance after two epochs (see Table 18). Although performance for certain individual labels keeps improving steadily afterward, the overall performance starts to deteriorate. Therefore, training was stopped after two epochs for our final classifier. Note that we have performed similar analysis with some other models in a limited fashion as well. The best performance was generally achieved after two or three epochs, depending on the model.
# of epoch . | MCC . | F1-score . |
---|---|---|
1 | 0.7552 | 0.7678 |
2 | 0.7708 | 0.7826 |
3 | 0.7665 | 0.7787 |
4 | 0.7615 | 0.7739 |
5 | 0.7558 | 0.7685 |
# of epoch . | MCC . | F1-score . |
---|---|---|
1 | 0.7552 | 0.7678 |
2 | 0.7708 | 0.7826 |
3 | 0.7665 | 0.7787 |
4 | 0.7615 | 0.7739 |
5 | 0.7558 | 0.7685 |
Table 19 showcases the performance per label for our SciBERT model after two training epochs on the evaluation set. On average, the classifier achieves an macro average F1-score of 0.78. In the detailed results for each label, we highlighted labels that achieved scores one standard deviation above and below the average.
Label . | Precision . | Recall . | F1 . | # samples . |
---|---|---|---|---|
Computer Science | 0.77 | 0.83 | 0.80 | 2,000 |
Biology | 0.83 | 0.84 | 0.84 | 2,000 |
Political Science | 0.83 | 0.81 | 0.82 | 2,000 |
Materials Science | 0.78 | 0.83 | 0.80 | 2,000 |
Geography | 0.96 | 0.67 | 0.79 | 2,000 |
Chemistry | 0.79 | 0.80 | 0.80 | 2,000 |
Economics | 0.66 | 0.68 | 0.67 | 2,000 |
Mathematics | 0.79 | 0.81 | 0.80 | 2,000 |
Geology | 0.90 | 0.94 | 0.92 | 2,000 |
Engineering | 0.58 | 0.49 | 0.53 | 2,000 |
Physics | 0.84 | 0.81 | 0.83 | 2,000 |
Sociology | 0.81 | 0.70 | 0.75 | 2,000 |
Business | 0.65 | 0.69 | 0.67 | 2,000 |
Medicine | 0.84 | 0.84 | 0.84 | 2,000 |
Psychology | 0.85 | 0.89 | 0.87 | 2,000 |
Art | 0.68 | 0.76 | 0.72 | 2,000 |
History | 0.70 | 0.75 | 0.72 | 2,000 |
Philosophy | 0.81 | 0.81 | 0.81 | 2,000 |
Environm. Science | 0.79 | 0.86 | 0.82 | 2,000 |
Macro average | 0.78 | 0.78 | 0.78 | 38,000 |
Label . | Precision . | Recall . | F1 . | # samples . |
---|---|---|---|---|
Computer Science | 0.77 | 0.83 | 0.80 | 2,000 |
Biology | 0.83 | 0.84 | 0.84 | 2,000 |
Political Science | 0.83 | 0.81 | 0.82 | 2,000 |
Materials Science | 0.78 | 0.83 | 0.80 | 2,000 |
Geography | 0.96 | 0.67 | 0.79 | 2,000 |
Chemistry | 0.79 | 0.80 | 0.80 | 2,000 |
Economics | 0.66 | 0.68 | 0.67 | 2,000 |
Mathematics | 0.79 | 0.81 | 0.80 | 2,000 |
Geology | 0.90 | 0.94 | 0.92 | 2,000 |
Engineering | 0.58 | 0.49 | 0.53 | 2,000 |
Physics | 0.84 | 0.81 | 0.83 | 2,000 |
Sociology | 0.81 | 0.70 | 0.75 | 2,000 |
Business | 0.65 | 0.69 | 0.67 | 2,000 |
Medicine | 0.84 | 0.84 | 0.84 | 2,000 |
Psychology | 0.85 | 0.89 | 0.87 | 2,000 |
Art | 0.68 | 0.76 | 0.72 | 2,000 |
History | 0.70 | 0.75 | 0.72 | 2,000 |
Philosophy | 0.81 | 0.81 | 0.81 | 2,000 |
Environm. Science | 0.79 | 0.86 | 0.82 | 2,000 |
Macro average | 0.78 | 0.78 | 0.78 | 38,000 |
Classification performances for the majority of labels are similar to the overall average, though some outliers can be found.
Overall, the setup is especially adept at classifying papers from the fields of geology (0.94), psychology (0.87), medicine (0.84), and biology (0.84); whereas it performs the worst for engineering (0.53), economics (0.67), and business (0.67). The values in parentheses are the respective F1-scores achieved during classification.
We suspect the performance differences to be a result of the breadth of vocabularies used in each discipline. Disciplines for which the classifier performs well usually use highly specific and technical vocabularies. Engineering especially follows this assumption, as engineering is an agglomeration of a multitude of disciplines, such as physics, chemistry, biology, and would encompass their respective vocabularies as well.
4.4.5. Keyword extraction
As outlined in Section 4.3, we apply TextRank to extract keywords from text and assign them to publications. We use “pytextrank” (https://github.com/DerwenAI/pytextrank/), a Python implementation of the TextRank algorithm, as our keyword extractor. Due to the generally smaller text size of an abstract, we limit the number of keywords/key phrases to five. A greater number of keywords would inevitably introduce additional “filler phrases,” which are not conducive for representing the content of a given abstract. Further statistics about the keywords are given in Section 6.
4.5. Discussion
In the following, we discuss certain challenges faced, lessons learned, and future outlooks.
Our classification approach relied on the existing top-level fields of study (level-0) found in the MAKG. Instead, we could have established an entirely new selection of disciplines as our label set. It is also possible to adapt an established classification scheme, such as the ACM Computing Classification System (https://dl.acm.org/ccs) or the Computer Science Ontology (Salatino, Thanapalasingam et al., 2018). However, to the best of our knowledge, there is not an equivalent classification scheme covering the entirety of research topics found in the MAKG, which was a major factor leading us to adapt the field of study system.
Regarding keyword extraction, grouping of extracted keywords and key phrases and building a taxonomy or ontology are natural continuations of the work. We suggest categories to be constructed on an individual discipline level, rather than having a fixed category scheme for all possible fields of study. For instance, within the discipline of computer science, we could try to categorize tasks, data sets, approaches and so forth from the list of extracted keywords. Brack, D’Souza et al. (2020) and Färber et al. (2021) recently published such an entity recognition approach. Both have also adapted the SciBERT architecture to extract scientific concepts from paper abstracts.
Future researchers can expand our extracted tags by enriching them with additional relationships to recreate a similar structure to the current MAKG field of study hierarchy. Approaches such as the Scientific Information Extractor (Luan, He et al., 2018) could be applied to categorize or to establish relationships between keywords, building an ontology or rich knowledge graph.
5. KNOWLEDGE GRAPH EMBEDDINGS
5.1. Motivation
Embeddings provide an implicit knowledge representation for otherwise symbolic information. They are often used to represent concepts in a fixed low-dimensional space. Traditionally, embeddings are used in the field of natural language processing to represent vocabularies, allowing computer models to capture the context of words and, thus, the contextual meaning.
Knowledge graph embeddings follow a similar principle, in which the vocabulary consists of entities and relation types. The final embedding encompasses the relationships between specific entities but also generalizes relations for entities of similar types. The embeddings retain the structure and relationships of information from the original knowledge graph and facilitate a series of tasks, such as knowledge graph completion, relation extraction, entity classification, question answering, and entity resolution (Wang, Mao et al., 2017).
Färber (2019) published pretrained embeddings for MAKG publications using RDF2Vec (Ristoski, 2017) as an “add-on” to the MAKG. Here, we provide an updated version of embeddings for a newer version of the MAG data set and for a variety of entity types instead of papers alone. We experiment with various types of embeddings and provide evaluation results for each approach. Finally, we provide embeddings for millions of papers and thousands of journals and conferences, as well as millions of disambiguated authors.
In the following, we introduce related work in Section 5.2. Section 5.3 describes our approach to knowledge graph embedding computation, followed by our evaluation in Section 5.4. We conclude in Section 5.5.
5.2. Related Work
Generally, knowledge graphs are described using triplets in the form of (h, r, t), referring to the head entity h ∈ 𝔼, the relationship between both entities r ∈ ℝ, and the tail entity t ∈ 𝔼. Nguyen (2017) and Wang et al. (2017) provide overviews of existing approaches for creating knowledge graph embeddings, as well as differences in complexity and performance.
Within the existing literature, there have been numerous approaches to train embeddings for knowledge graphs. Generally speaking, the main difference between the approaches lies in the scoring function used to calculate the similarity or distance between two triplets. Overall, two major families of algorithms exist: ones using translational distance models and ones using semantic matching models.
Translational distance models use distance function scores to determine the plausibility of specific sets of triplets existing within a given knowledge graph context (Wang et al., 2017). More specifically, the head entity of a triplet is projected as a point in a fixed dimensional space; the relationship entity is herein, for example, a directional vector originating from the head entity. The distance between the end point of the relationship entity and the tail entity in this given fixed dimensional space describes the accuracy or quality of the embeddings. One such example is the TransE (Bordes, Usunier et al., 2013) algorithm. The standard TransE model does not perform well on knowledge graphs with one-to-many, many-to-one, or many-to-many relationships (Wang, Zhang et al., 2014) because the tail entities’ embeddings are heavily influenced by the relations. Two tail entities that share the same head entity as well as relation are therefore similar in the embedding space created by TransE, even if they may be different concepts entirely in the real world. As an effort to overcome the deficits of TransE, TransH (Wang et al., 2014) was introduced to distinguish the subtleties of tail entities sharing a common head entity as well as relation. Later on, TransR was introduced to further model relations as separate vectors rather than hyperplanes, as is the case with TransH. The efficiency was later improved with the TransD model (Ji, He et al., 2015).
Semantic matching models compare similarity scores to determine the plausibility of a given triplet. Here, relations are not modeled as vectors similar to entities, but rather as matrices describing interactions between entities. Such approaches include RESCAL (Nickel, Tresp, & Kriegel, 2011), DistMult (Yang, Yih et al., 2015), HolE (Nickel, Rosasco, & Poggio, 2016), ComplEx (Trouillon et al., 2016), and others.
More recent approaches use neural network architectures to represent relation embeddings. ConvE, for instance, represents head entity and relations as input and tail entity as output of a convolutional neural network (Dettmers, Minervini et al., 2018). ParamE extends the approach by representing relations as parameters of a neural network used to “translate” the input of head entity into the corresponding output of tail entity (Che, Zhang et al., 2020).
In addition, there are newer variations of knowledge graph embeddings, for example using textual information (Lu, Cong, & Huang, 2020) and literals (Gesese, Biswas et al., 2019; Kristiadi, Khan et al., 2019). Overall, we decided to use established methods to generate our embeddings for stability in results, performance during training, and compatibility with file formats and graph structure.
5.3. Approach
We experiment with various embedding types and compare their performances on our data set. We include both translational distance models and semantic matching models of the following types: TransE (Bordes et al., 2013), TransR (Lin, Liu et al., 2015), DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016), and RESCAL (Nickel et al., 2016) (see Section 5.2 for an overview of how these approaches differ from each other). The reasoning behind the choices is as follows: The embedding types need to be state-of-the-art and widespread, therein acting as the basis of comparison. In addition, there needs to be an efficient implementation to train each embedding type, as runtime is a limiting factor. For example, the paper embeddings by Färber (2019) were trained using RDF2Vec (Ristoski, 2017) and took 2 weeks to complete. RDF2Vec did not scale well enough using all authors and other entities in the MAKG. Also current implementations of RDF2Vec, such as pyRDF2Vec, are not designed for such a large scale: “Loading large RDF files into memory will cause memory issues as the code is not optimized for larger files” (https://github.com/IBCNServices/pyRDF2Vec). This turned out to be true when running RDF2Vec on the MAKG. For the difference between RDF2Vec and other algorithms, such as TransE, we can refer to Portisch, Heist, and Paulheim (2021).
5.4. Evaluation
5.4.1. Evaluation data
Our aim is to generate knowledge graph embeddings for the entities of type papers, journals, conferences, and authors to solve machine learning-based tasks, such as search and recommendation tasks. The RDF representations can be downloaded from the MAKG website (https://makg.org/).
We first select the required data files containing the entities of our chosen entity types and combine them into a single input. Ideally, we would train paper and author embeddings simultaneously, such that they benefit from each other’s context. However, the required memory space proved to be a limiting factor given the more than 200 million authors and more than 200 million papers. Ultimately, we train embeddings for papers, journals, and conferences together; we train the embeddings for authors separately.
Due to the large number of input entities within the knowledge graph, we try to minimize the overall input size and thereby the memory requirement for training. We first filter out the relationships we aim to model. To further reduce memory consumption, we “abbreviate” relations by removing their prefixes.
Furthermore, we use a mapping for entities and relations to further reduce memory consumption. All entities and relations are mapped to a specific index in the form of an integer. In this way, all statements within the knowledge graph are reduced to a triple of integers and used as input for training together with the mapping files.
5.4.2. Evaluation setup
We use the Python package DGL-KE (Zheng et al., 2020) for our implementation of knowledge graph embedding algorithms. DGL-KE is a recently published package optimized for training knowledge graph embeddings at a large scale. It outperforms other state-of-the-art packages while achieving linear scaling with machine resources as well as high model accuracies. We set the dimension size of our output embeddings to 100. We set the limit due to greater memory constraints for training higher-dimensional embeddings. We experiment with a dimension size of 150 and did not observe any improvements to our metrics. Embedding sizes any higher will result in out-of-memory errors on our setup. The exact choices of hyperparameters are in Table 20. We perform evaluation through randomly masking entities and relations and trying to repredict the missing part.
Hyperparameter . | Value . |
---|---|
Embedding size | 100 |
Maximum training step | 1,000,000 |
Batch size | 1,000 |
Negative sampling size | 1,000 |
Hyperparameter . | Value . |
---|---|
Embedding size | 100 |
Maximum training step | 1,000,000 |
Batch size | 1,000 |
Negative sampling size | 1,000 |
We perform training on the bwUnicluster using GPU nodes with eight Nvidia Tesla V100 GPUs and 752 GB of RAM. We use standard ranking metrics Hit@k, mean rank (MR), and mean reciprocal rank (MRR).
5.4.3. Evaluation results
Our evaluation results can be found in Table 21. Note that performing a full-scale analysis of the effects of the hyperparameters on the embedding quality was out of the scope of this paper. Results are based on embeddings trained on paper, journal, and conference entities. We observed an average mean rank of 1.301 and a mean reciprocal rank of 0.958 for the best-performing embedding type.
. | TransR* . | TransE . | RESCAL . | ComplEx . | DistMult . |
---|---|---|---|---|---|
Average MR | 105.598 | 15.224 | 4.912 | 1.301 | 2.094 |
Average MRR | 0.388 | 0.640 | 0.803 | 0.958 | 0.923 |
Average HITS@1 | 0.338 | 0.578 | 0.734 | 0.937 | 0.893 |
Average HITS@3 | 0.403 | 0.659 | 0.851 | 0.975 | 0.945 |
Average HITS@10 | 0.474 | 0.769 | 0.920 | 0.992 | 0.977 |
Training time | 10 hours | 8 hours | 18 hours | 8 hours | 8 hours |
. | TransR* . | TransE . | RESCAL . | ComplEx . | DistMult . |
---|---|---|---|---|---|
Average MR | 105.598 | 15.224 | 4.912 | 1.301 | 2.094 |
Average MRR | 0.388 | 0.640 | 0.803 | 0.958 | 0.923 |
Average HITS@1 | 0.338 | 0.578 | 0.734 | 0.937 | 0.893 |
Average HITS@3 | 0.403 | 0.659 | 0.851 | 0.975 | 0.945 |
Average HITS@10 | 0.474 | 0.769 | 0.920 | 0.992 | 0.977 |
Training time | 10 hours | 8 hours | 18 hours | 8 hours | 8 hours |
Interestingly, TransE and TransR greatly outperform other algorithms during fewer training steps (1,000). For higher training steps, the more modern models, such as ComplEx and DistMult, achieve state-of-the-art performance. Across all metrics, ComplEx, which is based on complex embeddings instead of real-valued embeddings, achieves the best results (e.g., MRR of 0.958 and HITS@1 of 0.937) while having competitive training times to other methods. A direct comparison of these evaluation results with the evaluation results for link prediction with embeddings in the general domain is not possible, in our view, because the performance depends heavily on the used training data and test data. However, it is remarkable that embedding methods that perform quite well on our tasks (e.g., RESCAL) do not perform so well in the general domain (e.g., using the data sets WN18 and FB15K) (Dai, Wang et al., 2020), while the embedding method that performs best in our case, namely ComplEx, also counts as state-of-the-art in the general domain (Dai et al., 2020).
It is important to note that we train the TransR embedding type on 250,000 maximum training steps compared to 1,000,000 for all others embedding types. This is due to the extremely long training time for this specific embedding; we were unable to finish training in 48 hours, and, therefore, had to adjust the training steps manually. The effect can be seen in its performance; although for fewer training steps, TransR performed similarly to TransE.
Table 22 shows the quality of our final embeddings, which we published at https://makg.org/.
. | Author . | Paper/Journal/Conference . |
---|---|---|
Average MR | 2.644 | 1.301 |
Average MRR | 0.896 | 0.958 |
Average HITS@1 | 0.862 | 0.937 |
Average HITS@3 | 0.918 | 0.975 |
Average HITS@10 | 0.960 | 0.992 |
. | Author . | Paper/Journal/Conference . |
---|---|---|
Average MR | 2.644 | 1.301 |
Average MRR | 0.896 | 0.958 |
Average HITS@1 | 0.862 | 0.937 |
Average HITS@3 | 0.918 | 0.975 |
Average HITS@10 | 0.960 | 0.992 |
5.5. Discussion
The main challenge of the task lies in the hardware requirement for training embeddings on such a large scale. For publications, for instance, even after the approaches we have carried out for reducing memory consumption, it still required a significant amount of memory. For example, we were not able to train publications and author embeddings simultaneously given 750 GB of memory space. Given additional resources, future researchers could increase the dimensionality of embeddings, which might increase performance.
Other embedding approaches may be suitable for our case as well, though the limiting factor here is the large file size of the input graph. Any approach needs to be scalable and perform efficiently on such large data sets. One of the limiting factors for choosing embedding types (e.g., TransE) is the availability of an efficient implementation. The DGL-KE provides such implementations, but only for a select number of embedding types. In the future, as other implementations become publicly available, further evaluations may be performed. Alternatively, custom implementations can also be developed, though such tasks are not the subject of our paper.
Future researchers might further experiment with various combinations of hyperparameters. We have noticed a great effect of training steps on the embedding qualities of various models. Other effects might be learnable with additional experimentations.
6. KNOWLEDGE GRAPH PROVISIONING AND STATISTICAL ANALYSIS
In this section, we outline how we provide the enhanced MAKG. Furthermore, we show the results of a statistical analysis on various aspects of the MAKG.
6.1. Knowledge Graph Provisioning
For creating the enhanced MAKG, we followed the initial schema and data model of Färber (2019). However, we introduced new properties to model novel relationships and data attributes. A list of all new properties to the MAKG ontology can be found in Table 23. An updated schema for the MAKG is in Figure 7 and on the MAKG homepage, together with the updated ontology.
Besides the MAKG, Wikidata models millions of scientific publications. Thus, similar to the initial MAKG (Färber, 2019), we created mappings between the MAKG and Wikidata in the form of owl:sameAs statements. Using the DOI as unique identifier for publications, we were able to create 20,872,925 links between the MAKG and Wikidata.
The MAKG RDF files—containing 8.7 billion RDF triples as the core part—are available at https://doi.org/10.5281/zenodo.4617285. The updated SPARQL endpoint is available at https://makg.org/sparql.
6.2. General Statistics
Similar to analyses performed by Herrmannova and Knoth (2016) and Färber (2019), we aim to provide some general data set statistics regarding the content of the MAKG. Since the last publication, the MAG has received many updates in the form of additional data entries, as well as some small to moderate data schema changes. Therefore, we aim to provide some up-to-date statistics of the MAKG and further detailed analyses of other areas.
We carried out all analysis using the MAKG based on the MAG data as of June 2020 and our modified variants (i.e., custom fields of study and enhanced author set). Table 24 shows general statistics of the enhanced MAKG. In the following, we describe key statistics in more detail.
. | # in MAG/MAKG . | # in enhanced MAKG . |
---|---|---|
Papers | 238,670,900 | 238,670,900 |
Paper abstracts | 139,227,097 | 139,227,097 |
Authors | 243,042,675 | 151,355,324 |
Affiliations | 25,767 | 25,767 |
Journals | 48,942 | 48,942 |
Conference series | 4,468 | 4,468 |
Conference instances | 16,142 | 16,142 |
Unique fields of study | 740,460 | 740,460 |
ORCID iDs | – | 34,863 |
. | # in MAG/MAKG . | # in enhanced MAKG . |
---|---|---|
Papers | 238,670,900 | 238,670,900 |
Paper abstracts | 139,227,097 | 139,227,097 |
Authors | 243,042,675 | 151,355,324 |
Affiliations | 25,767 | 25,767 |
Journals | 48,942 | 48,942 |
Conference series | 4,468 | 4,468 |
Conference instances | 16,142 | 16,142 |
Unique fields of study | 740,460 | 740,460 |
ORCID iDs | – | 34,863 |
6.2.1. Authors
The original MAKG encompasses 243,042,675 authors, of which 43,514,250 had an affiliation given in the MAG. Our disambiguation approach reduced this set to 151,355,324 authors.
Table 25 showcases certain author statistics with respect to publication and cooperation. The average paper in the MAG has 2.7 authors with the most having 7,545 authors. On average, an author published 2.65 papers according to the MAKG. The author with the highest number of papers published 8,551 papers. The average author cooperated with 10.69 other authors in their combined work, with the most “connected” author having 65,793 coauthors overall, which might be plausible, but is likely misleading due to unclean data to some extent.
Metric . | Value . |
---|---|
Average authors per paper | 2.6994 |
Maximum authors per paper | 7,545 |
Average papers per author | 2.6504 |
Maximum papers per author | 8,551 |
Average coauthors per author | 10.6882 |
Maximum coauthors per author | 65,793 |
Metric . | Value . |
---|---|
Average authors per paper | 2.6994 |
Maximum authors per paper | 7,545 |
Average papers per author | 2.6504 |
Maximum papers per author | 8,551 |
Average coauthors per author | 10.6882 |
Maximum coauthors per author | 65,793 |
6.2.2. Papers
We first analyze the composition of paper entities by their associated type (see Table 2). The most frequently found document type is journal articles, followed by patents. A huge proportion of paper entities in the MAKG do not have a document type.
In the following, we analyze the number of citations and references for papers within the MAKG. The results can be found in Table 26.
Key statistics . | Value . |
---|---|
Average references | 6.8511 |
At least one reference | 78,684,683 |
Average references (filtered) | 20.7813 |
Median references (filtered) | 12 |
Most references | 26,690 |
Average citations | 6.8511 |
At least one citation | 90,887,343 |
Average citations (filtered) | 17.9912 |
Median citations (filtered) | 4 |
Most citations | 252,077 |
Key statistics . | Value . |
---|---|
Average references | 6.8511 |
At least one reference | 78,684,683 |
Average references (filtered) | 20.7813 |
Median references (filtered) | 12 |
Most references | 26,690 |
Average citations | 6.8511 |
At least one citation | 90,887,343 |
Average citations (filtered) | 17.9912 |
Median citations (filtered) | 4 |
Most citations | 252,077 |
The average paper in the MAKG references 6.85 papers and received 6.85 citations. The exact match in numbers here seems too unlikely to be coincidental. Therefore, we suspect these numbers to be a result of a closed referencing system of the original MAG, meaning references for a paper are only counted if they reference another paper within the MAG; and citations are only counted if a paper is cited by another paper found in the MAKG. When we remove papers with zero references, we are left with a set of 78,684,683 papers. The average references per paper from the filtered paper set is now 20.78. In the MAKG, 90,887,343 papers are cited at least once, with the average among this new set being 17.99. As averages are highly susceptible to outliers, which were frequent in our data set due to unclean data and the power law distribution of scientific output, we also calculated the median of references and citations. These values should give us a more representative picture of reality. The paper with the most references from the MAG has 26,690 references, whereas the paper with the most citations received 252,077 citations as of June 2020.
Table 27 showcases detailed reference and citation statistics for each document type found in our (enhanced) MAKG. Unsurprisingly, books have the most amount of references on average due to their significant lengths, followed by journal papers (and book sections). However, the median value for books is less than for journals, likely due to outliers. Citation wise, books and journal papers again are the most cited document types on average. Again, journal papers have fewer citations on average but a higher median value.
. | Journal . | Conference . | Patent . | Book . | BookSection . | Repository . | Data Set . | No data . |
---|---|---|---|---|---|---|---|---|
Average references | 13.089 | 10.309 | 3.470 | 2.460 | 3.286 | 11.649 | 0.063 | 2.782 |
At least one reference | 42,660,071 | 3,913,744 | 19,023,288 | 93,644 | 339,439 | 1,305,000 | 130 | 11,349,367 |
Average references (filtered) | 26.313 | 12.400 | 9.643 | 56.315 | 26.268 | 14.988 | 18.969 | 21.758 |
Median references (filtered) | 20 | 10 | 5 | 15 | 6 | 7 | 7 | 10 |
Most references | 13,220 | 4,156 | 19,352 | 5,296 | 7,747 | 2,092 | 196 | 26,690 |
Average citations | 14.729 | 9.024 | 3.225 | 29.206 | 0.813 | 2.251 | 0.188 | 1.019 |
At least one citation | 50,599,935 | 3,063,123 | 22,591,991 | 1,299,728 | 351,448 | 549,526 | 1,187 | 12,430,405 |
Average citations (filtered) | 24.963 | 13.869 | 7.547 | 48.177 | 6.277 | 6.878 | 6.240 | 7.274 |
Median citations (filtered) | 8 | 4 | 3 | 7 | 2 | 2 | 1 | 2 |
Most citations | 252,077 | 34,134 | 32,096 | 137,596 | 4,119 | 20,503 | 633 | 103,540 |
. | Journal . | Conference . | Patent . | Book . | BookSection . | Repository . | Data Set . | No data . |
---|---|---|---|---|---|---|---|---|
Average references | 13.089 | 10.309 | 3.470 | 2.460 | 3.286 | 11.649 | 0.063 | 2.782 |
At least one reference | 42,660,071 | 3,913,744 | 19,023,288 | 93,644 | 339,439 | 1,305,000 | 130 | 11,349,367 |
Average references (filtered) | 26.313 | 12.400 | 9.643 | 56.315 | 26.268 | 14.988 | 18.969 | 21.758 |
Median references (filtered) | 20 | 10 | 5 | 15 | 6 | 7 | 7 | 10 |
Most references | 13,220 | 4,156 | 19,352 | 5,296 | 7,747 | 2,092 | 196 | 26,690 |
Average citations | 14.729 | 9.024 | 3.225 | 29.206 | 0.813 | 2.251 | 0.188 | 1.019 |
At least one citation | 50,599,935 | 3,063,123 | 22,591,991 | 1,299,728 | 351,448 | 549,526 | 1,187 | 12,430,405 |
Average citations (filtered) | 24.963 | 13.869 | 7.547 | 48.177 | 6.277 | 6.878 | 6.240 | 7.274 |
Median citations (filtered) | 8 | 4 | 3 | 7 | 2 | 2 | 1 | 2 |
Most citations | 252,077 | 34,134 | 32,096 | 137,596 | 4,119 | 20,503 | 633 | 103,540 |
Figure 8 shows the number of papers published each year in the time span recorded by the MAKG (1800–present). The number of publications has been on a steady exponential trajectory. This is, of course, partly due to advances in the digitalization of libraries and journals, as well as the increasing ease of accessing new research papers. However, we can certainly attribute a large part of the growth to the increasing number of publications every year (Johnson et al., 2018).
Interestingly, the average number of references per paper has been on a steady increase (see Figure 9 and Johnson et al. (2018)). This could be due to a couple of reasons. First, as scientific fields develop and grow, novel work becomes increasingly rare. Rather, researchers publish work built on top of previous research (“on the shoulders of giants”), leading to a growing number of references for new publications. Furthermore, the increasing number of research papers further contribute to more works being considered for referencing. Second, developments in technology, such as digital libraries, enable the spread of research and ease the sharing of ideas and communication between researchers (see, for example, the open access efforts (Piwowar, Priem et al., 2018)). Therefore, a researcher from the modern age has a huge advantage in accessing other papers and publications. The ease of access could contribute to more works being referenced in this way. Third, as the MAKG is (most likely) a closed reference system, meaning papers referenced are only included if they are part of the MAKG, and as modern publications are more likely to be included in the MAKG, newer papers will automatically have a higher number of recorded references in the MAKG. Although this is a possibility, we do not suspect it to be the main reason behind the rising number of references. Most likely, the cause is a combination of several factors.
Surprisingly, the average number of citations a paper receives has increased, as shown in Figure 10. Intuitively, one would assume older papers to receive more citations on average purely due to longevity. However, as our graph shows, the number of citations an average paper receives has increased since the turn of the last century. We observe a peak of growth around 1996, which might be where the age of a paper exhibits its effect. Coupled with the exponential growth of publications, the average citations per paper plummets.
Figure 11 shows the average number of authors per paper per year and publication type, using the MAKG paper’s publication year. As we can observe, there has been a clear upward trend for the average number of authors per paper specifically concerning journal articles, conference papers, and patents since the 1970s. The level of cooperation within the scientific community has grown, partly led by the technological developments that enable researchers to easily connect and cooperate. This finding reconfirms the results from the STM report 2018 (Johnson et al., 2018).
6.2.3. Fields of study
In the following, we analyze the development of fields of study over time. First, Figure 12 showcases the current number of publications per top-level field of study within the MAKG. Each field of study here has two distinct values. The blue bars represent the field of study as labeled by the MAKG, whereas the red bars are labels as generated by our custom classifier. Importantly, there is a discrepancy between the total number of paper labels between the original MAKG field of study labels and our custom labels. The original MAG hierarchy includes labels for 199,846,956 papers. Our custom labels are created through classification of paper abstracts and are therefore limited by the number of abstracts available in the data set; thus, we only generated labels for 139,227,097 papers. Rather surprisingly, the disciplines of medicine and materials science are the most common fields of study within the MAG, according to the original MAG field of study labels. According to our classification, engineering and medicine are the most represented disciplines.
Evaluating the cumulative number of papers associated with the different fields of study over the years, we can confirm the exponential growth of scientific output shown by Larsen and von Ins (2010). In many areas, our data show greater rates of growth than previously anticipated.
Figure 13 shows the interdisciplinary works of authors. Here, we modeled the relationships between fields of study in a chord graph. Each chord between two fields of study represents authors who have published papers in both disciplines. The thickness of each chord is representative of the number of authors who have done so. We observe strong relationships between the disciplines of biology and medicine, materials science and engineering, and computer science and engineering. Furthermore, there is a moderately strong relationship between the disciplines of chemistry and medicine, biology and engineering, and chemistry and biology. The multitude of links between engineering and other disciplines could be due to mislabeling of engineering papers, as our classifier is not adept at classifying papers from engineering in comparison to other fields of study, as shown in Table 19.
7. CONCLUSION AND OUTLOOK
In this paper, we developed and applied several methods for enhancing the MAKG, a large-scale scholarly knowledge graph. First, we performed author name disambiguation on the set of 243 million authors using background information, such as the metadata of 239 million publications. Our classifier achieved a precision of 0.949, a recall of 0.755, and an accuracy of 0.991. We managed to reduce the number of total author entities from 243 million to 151 million.
Second, we reclassified existing papers from the MAKG into a distinct set of 19 disciplines (i.e., level-0 fields of study). We performed an evaluation of existing labels and determined 55% of the existing labels to be accurate, whereas our newly generated labels achieved an accuracy of approximately 78%. We then assigned tags to papers based on the papers’ abstracts to create a more suitable description of paper content in comparison to the preexisting rigid field of study hierarchy in the MAKG.
Third, we generated entity embeddings for all paper, journal, conference, and author entities. Our evaluation showed that ComplEx was the best performing large-scale entity embedding method that we could apply to the MAKG.
Finally, we performed a statistical analysis on key features of the enhanced MAKG. We updated the MAKG based on our results and provided all data sets, as well as the updated MAKG, online at https://makg.org and https://doi.org/10.5281/zenodo.4617285.
Future researchers could further improve upon our results. For author name disambiguation, we believe the results could be further improved by incorporating additional author information from other sources. For field of study classification, future approaches could develop ways to organize our generated paper tags into a more hierarchical system. For the trained entity embeddings, future research could generate embeddings at a higher dimensionality. This was not possible because of the lack of existing efficient scalable implementations of most algorithms. Beyond these enhancements, the MAKG should be enriched with the key content of scientific publications, such as research data sets (Färber & Lamprecht, 2022), scientific methods (Färber et al., 2021), and research contributions (Jaradeh et al., 2019b).
AUTHOR CONTRIBUTIONS
Michael Färber: Conceptualization, Data curation, Investigation, Methodology, Resources, Supervision, Visualization, Writing—review & editing. Lin Ao: Conceptualization, Data curation, Investigation, Methodology, Resources, Software, Visualization, Writing—original draft.
COMPETING INTERESTS
The authors have no competing interests.
FUNDING INFORMATION
The authors did not receive any funding for this research.
DATA AVAILABILITY
We provide all generated data online to the public at https://makg.org and https://doi.org/10.5281/zenodo.4617285 under the ODC-BY license (https://opendatacommons.org/licenses/by/1-0/). Our code is available online at https://github.com/lin-ao/enhancing_the_makg.
REFERENCES
Author notes
Handling Editor: Ludo Waltman