Abstract
Knowledge base plays an important role in machine understanding and has been widely used in various applications, such as search engine, recommendation system and question answering. However, most knowledge bases are incomplete, which can cause many downstream applications to perform poorly because they cannot find the corresponding facts in the knowledge bases. In this paper, we propose an extraction and verification framework to enrich the knowledge bases. Specifically, based on the existing knowledge base, we first extract new facts from the description texts of entities. But not all newly-formed facts can be added directly to the knowledge base because the errors might be involved by the extraction. Then we propose a novel crowd-sourcing based verification step to verify the candidate facts. Finally, we apply this framework to the existing knowledge base CN-DBpedia and construct a new version of knowledge base CN-DBpedia2, which additionally contains the high confidence facts extracted from the description texts of entities.
1. INTRODUCTION
In recent years, there has been a great amount of efforts in trying to harvest knowledge from Web, and a variety of knowledge graphs (KGs) or knowledge bases (KBs) have been constructed, such as YAGO [1], DBpedia [2], Freebase [3] and CN-DBpedia [4]. These knowledge bases play important roles in many applications, such as search engine [5], recommendation system [6] and question answering [7].
However, knowledge bases are generally incomplete. Facts of current KBs (e.g., DBpedia [8], YAGO [1], Freebase [3] and CN-DBpedia [4]) are mainly obtained from the carefully edited structured texts (e.g., infobox and category information) of Web pages in the online encyclopedia websites (e.g., Wikipedia and Baidu Baike①). Since knowledge is rich but editors have limited editing capabilities, many structured texts are often incomplete, making the facts in knowledge bases directly extracted from structured texts are incomplete. According to Catriple [9], only 44.2% of articles in Wikipedia have infobox information. Also in Baidu Baike, the largest Chinese online encyclopedia website, almost 32% (nearly 3 million) of entities lack the infobox and category information altogether [10].
Incomplete knowledge base will lead to poor performance of many downstream applications, since they cannot find the corresponding facts in the knowledge bases. For example, if the knowledge graphs lack the fact about Donald Trump's birthday, then it cannot answer the question of “when was Donald Trump born”.
To address this challenge, we propose an extraction and verification framework to enrich the knowledge bases. Based on the existing knowledge bases, we first extract new facts from the description texts of entities. But not all newly-formed facts can be added directly to the knowledge base because the errors might be involved by the extraction [11, 12, 13]. For example, in Table 1, which is the F1-score of the state-of-the-art text-based extractors on Slot Filling benchmark TAC data set, including the pattern-based method (PATdist [11]), traditional machine learning methods (Mintz++ [14], SVMskip [11]), graphical model (MIMLRE [15]) and neural network based method (CNNcontext) [11], the facts extracted by these extractors still have a lot of noise. This motivates us to employ a novel crowdsourcing method to verify the extracted facts. Considering the human cost, we only verify those low-confidence facts. In the end, only two types of facts extracted by the extractors can be added to the knowledge base. One is the facts with high confidence, and the other is the facts with low confidence but verified by human as correct.
. | Mintz++ . | MIMLRE . | PATdist . | SVMskip . | CNNcontext . | |||||
---|---|---|---|---|---|---|---|---|---|---|
dev . | eval . | dev . | eval . | dev . | eval . | dev . | eval . | dev . | eval . | |
per:age | .84 | .71 | .83 | .73 | .69 | .80 | .86 | .74 | .83 | .76 |
per:alternate names | .29 | .03 | .29 | .03 | .50 | .50 | .35 | .02 | .32 | .04 |
per:children | .76 | .43 | .77 | .48 | .10 | .07 | .81 | .68 | .82 | .61 |
per:cause of death | .76 | .42 | .75 | .36 | .44 | .11 | .82 | .32 | .77 | .52 |
per:date of birth | 1.0 | .60 | .99 | .60 | .67 | .57 | 1.0 | .67 | 1.0 | .77 |
per:date of death | .67 | .45 | .67 | .45 | .30 | .32 | .79 | .54 | .72 | .48 |
per:empl memb of | .38 | .36 | .41 | .37 | .24 | .22 | .42 | .36 | .41 | .37 |
per:location of birth | .56 | .22 | .56 | .22 | .30 | .30 | .59 | .27 | .59 | .23 |
per:loc of death | .65 | .41 | .66 | .43 | .13 | .00 | .64 | .34 | .63 | .28 |
per:loc of residence | .14 | .11 | .15 | .18 | .10 | .03 | .31 | .33 | .20 | .23 |
average | .53 | .41 | .54 | .42 | .35 | .36 | .62 | .48 | .60 | .46 |
. | Mintz++ . | MIMLRE . | PATdist . | SVMskip . | CNNcontext . | |||||
---|---|---|---|---|---|---|---|---|---|---|
dev . | eval . | dev . | eval . | dev . | eval . | dev . | eval . | dev . | eval . | |
per:age | .84 | .71 | .83 | .73 | .69 | .80 | .86 | .74 | .83 | .76 |
per:alternate names | .29 | .03 | .29 | .03 | .50 | .50 | .35 | .02 | .32 | .04 |
per:children | .76 | .43 | .77 | .48 | .10 | .07 | .81 | .68 | .82 | .61 |
per:cause of death | .76 | .42 | .75 | .36 | .44 | .11 | .82 | .32 | .77 | .52 |
per:date of birth | 1.0 | .60 | .99 | .60 | .67 | .57 | 1.0 | .67 | 1.0 | .77 |
per:date of death | .67 | .45 | .67 | .45 | .30 | .32 | .79 | .54 | .72 | .48 |
per:empl memb of | .38 | .36 | .41 | .37 | .24 | .22 | .42 | .36 | .41 | .37 |
per:location of birth | .56 | .22 | .56 | .22 | .30 | .30 | .59 | .27 | .59 | .23 |
per:loc of death | .65 | .41 | .66 | .43 | .13 | .00 | .64 | .34 | .63 | .28 |
per:loc of residence | .14 | .11 | .15 | .18 | .10 | .03 | .31 | .33 | .20 | .23 |
average | .53 | .41 | .54 | .42 | .35 | .36 | .62 | .48 | .60 | .46 |
The missing facts in the knowledge base mainly include the relationship between entities and entities and the relationship between entities and concepts. In this paper, we use description texts of entities to enrich the knowledge base, including two subtasks, entity typing and slot filling. Our contributions are as follows:
First, for entity typing subtask, we propose a multi-instance learning model to process textual information as well as heterogeneous information.
Second, for slot filling subtask, we use a transfer learning strategy to extract the values of the long-tailed predicates.
Third, we propose a novel implicit crowdsourcing approach to verify low-confidence new facts.
Finally, we apply this framework to the existing knowledge base CN-DBpedia and release a new version of knowledge base CN-DBpedia2, which additionally contains the facts extracted from the description texts of entities. By April 2019, CN-DBpedia2 contains about 16,024,656 entities and 228,499,155 facts.
The rest of this paper is organized as follows. Section 2 introduces the system architecture of CN-DBpedia2. Section 3 and Section 4 detail the methods of entity typing and slot filling. Section 5 introduces how to verify those low-confidence new facts. Section 6 presents the statistics of our new system. Finally, Section 7 concludes the paper.
2. SYSTEM ARCHITECTURE
The system architecture of CN-DBpedia2 is shown in Figure 1, which is an extension of CN-DBpedia. CN-DBpedia2 uses Baidu Baike, Hudong Baike, Chinese Wikipedia and other domain encyclopedia websites as data sources, and the pipeline process consists of five components:
The extraction component is used to extract the raw facts from the articles of the encyclopedia websites, including crawling the Web pages of all the entities in the data sources, and extracting the raw facts from the structured text of the pages.
The normalization component is used to normalize the raw facts, including the normalization of attributes/predicates and values of the facts.
The enrichment component is used to extract new facts that cannot be obtained directly from structured text of Web pages.
The correction component is used to correct some of the error facts in the knowledge base, including error detection and crowdsourcing error correction.
The update component is used to keep the freshness of knowledge base, including periodic update and active update.
CN-DBpedia2 is different from CN-DBpedia in the enrichment component. We propose an extraction and verification framework to enrich the knowledge bases, which includes three new features, entity typing, slot filling and fact verification. As shown in Figure 2, we use both entity typing and slot filling methods to extract new facts from the description texts of entities, and low-confidence facts need to be verified before they are added to the knowledge base.
3. ENTITY TYPING
The entity typing task is to find a set of types/concepts for each entity in knowledge bases. An entity contains both structured and unstructured features. In CN-DBpedia, we have used structured features to type entities [10]. In CN-DBpedia2, we first use unstructured features alone to type entities and then use both structured and unstructured features together to type the entities.
3.1 Entity Typing from Unstructured Features
We propose a multi-instance method, METIC [13] to type the entity with unstructured features alone. An entity may have multiple mentions in a corpus, we take each mention of an entity in KBs as an instance of the entity, and learn the types of these entities from multiple instances. Specifically, we first use an end-to-end neural network model to type each instance of an entity (mention typing), and then use an integer linear programming (ILP) method to aggregate the predicted type results from multiple instances (type fusion). The framework of our solution is shown in Figure 3.
In the offline phase, we train models for the two subtasks mention typing and type fusion separately. For mention typing, we model it as a multi-label classification. In our setting, we use distant supervision method to construct the training data automatically and build a supervised learning model (more specifically we propose a neural network model). For type fusion, we model it as a constrained optimization problem, and propose an integer linear programming model in order to solve the problem. The constraints are derived from the semantic relationship between types.
In the online phase, we use the models built in the offline phase to enrich types for each entity in the KB. For each entity e, we first employ the existing entity linking systems [16] to discover entity mentions from its corpus. Each mention and its corresponding context: <mi, ci> are fed into the mention typing model. The model then derives a set of types with each type being associated with a probability (i.e., (t|mi)). Types as well as their probability (i.e., P(mi)) derived from each mention are further fed into the integer linear programming model with constraints specified as exclusive or compatibility among types. The model finally selects a subset from all candidate types as the final output types.
In mention typing step, we propose a neural network model, as shown in Figure 4. We first divide the sentence into three parts: the left context of the mention, the mention part and the right context of the mention. Each part of the sentence is fed into a parallel neural network with similar structure: a word embedding layer and a BiLSTM (bidirectional long short-term memory) layer [17]. We then concatenate the outputs of these BiLSTM layers to generate the final output.
In type fusion step, we propose an integer linear programming model (ILP) to aggregate all the types derived from mentions of an entity to reduce the noise. ILP is an optimization model with constraints and all the variables required to be non-negative integers [18]. For each entity e, we first define a decision variable xe,t for every candidate type t. These variables are binary and indicate whether entity e belongs to type t or not. Our ILP model as follows:
where maxm∊Me P(t|m) represents the maximum probability that one mention of the entity e belongs to type t, and θ is the threshold (In our experiment, we set the threshold as 0.5).
We propose two constraints for our ILP model: a type disjointness constraint and a type hierarchy constraint. The Type disjointness constraint claims that an entity cannot belong to two semantically mutually exclusive types simultaneously, such as Person and Organization. Types with no overlap in entities or an insignificant overlap below a specified threshold are considered to be disjoint [19]. The Type hierarchy constraint claims that if an entity does not belong to a type t then it will certainly not belong to any of t's sub-types. For example, an entity that does not belong to type Artist should not be classified to type Actor.
3.2 Entity Typing from Heterogeneous Features
In order to take advantage of the structured and unstructured features of the entity, we propose a new framework, METIH (Multi-instance Entity TypIng from Heterogeneous features), which is a modified version of the METIC. As shown in Figure 5, most of the components are the same as METIC, except the ones marked by the dotted line. Specially, in type fusion step, we treat the prediction result from structured features as a new instance of an entity, and use a new ILP model to aggregate those prediction results. The new ILP model is shown as follows:
where function δ(t ∊ Ce) is defined as follows:
max(maxm∊Me P(t|m), δ(t ∊ Ce)) represents the maximum probability that entity e belongs to type t, where maxm∊Me P(t|m) is the maximum probability from unstructured features, while δ(t ∊ Ce) is the probability from structured features.
4. SLOT FILLING
Some entities’ attribute values cannot be directly extracted due to missing information, so we extract the attribute values from text corpus. We cast it as a slot filling task, given an entity and an attribute, our goal is to extract the values from the description text of the entity. The state-of-the-art methods usually use supervised learning to build extractors for each predicate, and treat it as a sequence tagging problem.
However, the training samples for each predicate are unbalanced, and head predicates usually contain a large amount of training samples while long-tailed ones only contain few samples. In CN-DBpedia, we have extracted the values for head predicates. And in CN-DBpedia2, we focus on extracting the values for long-tailed predicates.
A naive solution for slot filling is to train a single predicate extractor for each predicate. The single predicate extractor is trained separately on data of different predicates. In the case of long-tailed predicates with insufficient training data, the single predicate extractors may not be fully trained. This impairs the performance. Therefore, we propose a Multiple Predicate Extractor with Prior Knowledge (MPK) model. The MPK model also takes the predicate as an input. We first pre-train a model by using all the head predicates, and then we fine-tune the model for each long-tailed predicates by transfer learning. Since most of the parameters in the model are shared along different predicates, it is feasible for the long-tailed predicate to utilize the abundant training data of head predicates by transfer learning.
Naturally, we expect to utilize the training samples from other predicates, which motivates us to develop a multiple predicate extractor structure. The multiple predicate extractor network structure is shown in Figure 6. The extractor is divided into five parts: 1) text embedding layer, 2) knowledge embedding layer, 3) text-knowledge attention layer, 4) encoder layer and 5) output layer.
4.1 Text Embedding Layer
We first pre-process the sentences. The text embedding layer tends to capture the text features before extraction, including both the word and phrase information. Then we encode the concatenation of these embedded vectors with a BiLSTM layer.
4.2 Knowledge Embedding Layer
In this layer, we embed the prior knowledge as guidance for further information extracting, i.e. they decide what kind of information should be extracted. The prior knowledge information includes the type from the entity (subject) and the predicate specifying the information to extract.
Type Encoder Layer. One entity may belong to multiple types, which implies different aspects of the entity. We use a self-attention layer to embed all the types of an entity, which is a multi-head attention mechanism proposed by [20]. Specifically, for each type, called the query, compute a weighted sum of all types, or keys, in the input based on the similarity between the query and key as measured by the dot product.
Predicate-Type Attention Layer. We combine the type and predicate information in this layer. The predicate and the type determine what to be extracted together. For example, with the entity type person and the predicate birthplace, we can determine that the task is to extract where the person was born, and which is similar to a query. However, an entity always has many noisy types (e.g., such as actor and producer in the Figure). We use the predicate-to-type attention to select suitable types to form the query. Specifically, we use T and P to respectively denote the encoded types and predicate. The predicate-to-type similarity is computed as S0 ∊ R1×m. We then normalize the only row of S0 by applying the softmax function, deriving a matrix . Then the predicate-to-type attention is computed as . The similarity function used here is the trilinear function [21]:
where is the element-wise multiplication and W0 is a trainable variable.
The output of this module is an encoded knowledge sequence U, where each unit is u = W1[t,p,t⊙p, a0t], t and p respectively denote an encoded type and the predicate embedding and a0 is an attention value in A0. Each unit u specifies an aspect of the extraction task, like the birthPlace of a person.
4.3 Text-Knowledge Attention Layer
This module uses the knowledge as a query to guide the extractor. We use H and U to denote the encoded text and knowledge. Then we adopt the attention similar to the BiDAF [21] to model the interaction between the text and the knowledge, including text-to-knowledge attention and knowledge-to-text attention.
The text-to-knowledge attention is constructed as follows. We first use Equation (2) to compute the similarities between each pair of encoded text and knowledge, rendering a similarity matrix S ∊ Rm×m. We then normalize each row of S by applying the softmax function, getting a matrix S. Then the text-to-knowledge attention is computed as .
We additionally use some form of knowledge-to-text attention. Specifically, we perform the maximum reduction on each column of S and compute the softmax function to get the text-to-knowledge attention B = softmax (max(S,2)).
4.4 Model Encoder Layer
The input of this layer at each position is [h,a,h⊙a,h⊙b], where a and b are respectively a row of attention matrix A and B. This module contains two BiLSTM layers.
4.5 Output Layer
In the output layer, we adopt a layer of Conditional Random Field, as it has been proven to be effective in the sequence tagging task [22].
4.6 Training
To tackle with insufficient training data for the long-tailed predicates, we hope that the model can benefit from abundant training samples of the head predicates. To achieve this, we let the model first learn how to extract the value of the head predicates, and then learn the extraction of the long-tailed predicates through transfer learning.
Specifically, we build extractors for each long-tailed predicate. The training processes include two steps. First, we pre-train the MPK model on training data of the Top-K head predicates. Then we fine tune the MPK model on the training data of the long-tailed predicate.
5. FACT VERIFICATION
Through the two steps of entity typing and slot filling, we can obtain a lot of facts from the text. But some of them are wrong, adding these errors to an existing knowledge base will reduce the quality of the knowledge base. Hence we need to verify the low confidence facts by human before adding it to the current knowledge base. The task of the verification process is as follows: given a piece of text and a fact extracted from the text, our goal is to let people judge whether this fact can be inferred from the text.
In general, it is very costly to verify all the facts by experts. To solve this problem, we propose a novel implicit crowdsourcing approach to ask users to verify those extracted facts. Our idea is inspired by CAPTCHA [23], which is a program that can generate and grade tests that: most humans can pass, but current computer programs cannot. Such a program can be used to differentiate humans from computers.
As shown in Figure 7, it asks users to type the distorted text seen from the image. If the typed characters match the characters in the image, it is considered verified.
Specifically, we propose a different type of CAPTCHA, rcCAPTCHA, which is a reading comprehension test. Note that our goal is to let people verify whether a fact can be inferred from a piece of text. For each test, the reading passage is a piece of text, the question is generated from the fact, and the answer options are generated from both the text and the fact.
For different types of facts, we propose different ways to generate question and answer options. Given a fact derived from entity typing, such as (e, isA, t), where e ∊ E is an entity and t ∊ T is a type, we generate a question such as “which type does <e> belong to?”, the answer options are t and all its siblings are in the type taxonomy, and the correct answer in this test is t. Since one entity may belong to many types, the entity e may belong to some sibling types of t according to other extracted facts. In this test, we exclude those types in the answer options. Given a fact derived from slot filling, such as (e, a, v), where a ∊ A is an attribute and v is the attribute value, we generate a question such as “what is the <a> of <e>?”, the answer options are all the words in the text (including the v), and the correct answer in this test is v. For each extracted fact, we generate some reading comprehension tests. When multiple users click on the correct answer and pass the test, the extracted facts are considered correct.
To verify the accuracy of the facts we extracted, we released a free rcCAPTCHA API③ and deployed it on multiple websites. CN-DBpedia search engine④ is one of those websites, and our rcCAPTCHA system is triggered when the number of user search exceeds a certain threshold. The search can only be continued if the user correctly answers the question from the rcCAPTCHA system. As shown in Figure 8, this is an instance of our system. The instance is used to verify whether the fact (Galare Thong Tower, star rating, 3-star) is correct. Based on this fact, the system generates a question (What is the star rating of Galare Thong Tower?) and a text containing the answer (Galare Thong Tower is a 3-star hotel located in Chiang Mai… It is a 9-minute drive from Wat Chiang Man …). If majority of users click on “3-star” in the text, then the triple is considered correct.
Our rcCAPTCHA system is triggered thousands of times per day on average, and each fact is verified by an average of 20 users. We randomly sampled 100 extracted facts that were verified to be correct with an accuracy of 100%. Although the current verification speed is slow, as our system is used by more and more people, our verification speed can be accelerated.
6. STATISTICS FOR CN-DBPEDIA2
We present the statistics of CN-DBpedia2 in this section. The number of entities and facts of CN-DBpedia2 is much larger than CN-DBpedia. By April 2019, CN-DBpedia2 contains about 16,024,656 entities and 228,499,155 facts, while CN-DBpedia only contains 10,341,196 entities and 88,454,264 facts. Table 2 shows the fact types in CN-DBpedia2, as well as changes compared to CN-DBpedia. The increase mainly comes from three aspects. Firstly, new sources of data have been added. Secondly, textual information is used. Thirdly, updating is implemented.
Rank . | Rank Count . | Fact Types . | Fact Quantity . | Quantity Change . |
---|---|---|---|---|
1 . | 0 . | Entity Infobox . | 149,350,983 . | +108,210,921 . |
2 | +1 | Entity Types | 45,065,595 | +25,219,295 |
3 | -1 | Entity Tags | 25,923,300 | +6,057,489 |
4 | 0 | Entity Information | 8,016,829 | +4,012,928 |
5 | 0 | Entity SameAs | 142,448 | 0 |
Rank . | Rank Count . | Fact Types . | Fact Quantity . | Quantity Change . |
---|---|---|---|---|
1 . | 0 . | Entity Infobox . | 149,350,983 . | +108,210,921 . |
2 | +1 | Entity Types | 45,065,595 | +25,219,295 |
3 | -1 | Entity Tags | 25,923,300 | +6,057,489 |
4 | 0 | Entity Information | 8,016,829 | +4,012,928 |
5 | 0 | Entity SameAs | 142,448 | 0 |
By using the new entity typing method, we found more types for entities in CN-DBpedia2. Table 3 shows the top 20 concepts and the number of entities they contained in CN-DBpedia2, as well as changes compared to CN-DBpedia. Because of the consideration of textual information, entities have gained more concepts, and the size of the concepts has increased significantly (except for the two concepts of companies and singles, the increase is mainly due to the extraction from the domain encyclopedia websites).
Rank . | Rank Change . | Type . | Entity Quantity . | Quantity Change . |
---|---|---|---|---|
1 | +1 | Agent | 8,771,817 | +6,766,894 |
2 | +7 | Organisation | 7,019,610 | +6,228,636 |
3 | +9 | Company | 6,545,250 | +6,128,240 |
4 | -3 | Work | 4,786,508 | +2,257,454 |
5 | 0 | WrittenWork | 2,918,856 | +1,820,837 |
6 | 0 | Book | 2,742,963 | +1,686,857 |
7 | +9 | Novel | 1,504,497 | +1,358,730 |
8 | -5 | Person | 1,488,264 | +270,276 |
9 | -5 | Place | 1,401,132 | +203,869 |
10 | +7 | MusicalWork | 950,072 | +699,196 |
11 | +7 | Single | 897,631 | +691,346 |
12 | -4 | PopulatedPlace | 546,813 | -69,209 |
13 | -3 | Settlement | 404,672 | -57,410 |
14 | -5 | ArchitecturalStructure | 250,638 | -241,942 |
15 | -2 | Species | 247,723 | +36,187 |
16 | -2 | Eukaryote | 237,418 | +29,647 |
17 | +2 | Software | 230,900 | +80,114 |
18 | +2 | Device | 199,341 | +75,976 |
19 | +2 | Athlete | 167,293 | +45,960 |
20 | -5 | Food | 163,823 | -14,866 |
Rank . | Rank Change . | Type . | Entity Quantity . | Quantity Change . |
---|---|---|---|---|
1 | +1 | Agent | 8,771,817 | +6,766,894 |
2 | +7 | Organisation | 7,019,610 | +6,228,636 |
3 | +9 | Company | 6,545,250 | +6,128,240 |
4 | -3 | Work | 4,786,508 | +2,257,454 |
5 | 0 | WrittenWork | 2,918,856 | +1,820,837 |
6 | 0 | Book | 2,742,963 | +1,686,857 |
7 | +9 | Novel | 1,504,497 | +1,358,730 |
8 | -5 | Person | 1,488,264 | +270,276 |
9 | -5 | Place | 1,401,132 | +203,869 |
10 | +7 | MusicalWork | 950,072 | +699,196 |
11 | +7 | Single | 897,631 | +691,346 |
12 | -4 | PopulatedPlace | 546,813 | -69,209 |
13 | -3 | Settlement | 404,672 | -57,410 |
14 | -5 | ArchitecturalStructure | 250,638 | -241,942 |
15 | -2 | Species | 247,723 | +36,187 |
16 | -2 | Eukaryote | 237,418 | +29,647 |
17 | +2 | Software | 230,900 | +80,114 |
18 | +2 | Device | 199,341 | +75,976 |
19 | +2 | Athlete | 167,293 | +45,960 |
20 | -5 | Food | 163,823 | -14,866 |
Based on CN-DBpedia2, we also publish some new APIs⑤ for natural language understanding, including entity linking and question answering. By April 2019, these APIs have already been called 950 million times since they are published in December 2015.
7. CONCLUSION
In this paper, we released a new version of the knowledge base CN-DBpedia2, which is an extension of CN-DBpedia. Based on the existing knowledge base, we additionally exploit both structured and unstructured features to type entities together and propose a transfer learning strategy to extract the values for long-tailed predicates. And then, we propose a novel implicit crowdsourcing approach to verify low confidence new facts. And the facts verified as correct are added to the knowledge base. By April 2019, CN-DBpedia2 contains about 16,024,656 entities and 228,499,155 facts, and the APIs have already been called 950 million times.
AUTHOR CONTRIBUTIONS
This work was collaboration between all of the authors. Y. Xiao ([email protected], corresponding author) is the leader of the CN-DBpedia project. B. Xu ([email protected]) led the work and summarized the entity typing part. C. Xie ([email protected]) and L. Chen ([email protected]) summarized the slot filling part. J. Liang ([email protected]) summarized the fact verification part. B. Liang ([email protected]) summarized the statistics part. All the authors have made meaningful and valuable contributions in revising and proofreading the resulting manuscript.
ACKNOWLEDGEMENTS
This paper was supported by National Key R&D Program of China under Grant No. 2017YFC1201203, sponsored by Shanghai Sailing Program under Grant No.19YF1402300, by the Initial Research Funds for Young Teachers of Donghua University under Grant No. 112-07-0053019.
Notes
REFERENCES
Author notes
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai 200433, China
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai 200433, China
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai 200433, China
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai 200433, China
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai 200433, China