Abstract
Knowledge graph (KG) has played an important role in enhancing the performance of many intelligent systems. In this paper, we introduce the solution of building a large-scale multi-source knowledge graph from scratch in Sogou Inc., including its architecture, technical implementation and applications. Unlike previous works that build knowledge graph with graph databases, we build the knowledge graph on top of SogouQdb, a distributed search engine developed by Sogou Web Search Department, which can be easily scaled to support petabytes of data. As a supplement to the search engine, we also introduce a series of models to support inference and graph based querying. Currently, the data of Sogou knowledge graph that are collected from 136 different websites and constantly updated consist of 54 million entities and over 600 million entity links. We also introduce three applications of knowledge graph in Sogou Inc.: entity detection and linking, knowledge based question answering and knowledge based dialog system. These applications have been used in Web search products to help user acquire information more efficiently.
1. Introduction
A knowledge graph (KG) is a kind of special database which integrates information into an ontology. As an effective way to store and search knowledge, knowledge graph has been applied in many intelligent systems and drawn a lot of research interest. While many knowledge graphs have been constructed and published, such as Freebase [1], Wikidata [2], DBpedia [3] and YAGO [4], none of these works could completely fulfill the application requirement of Sogou Inc. The main challenges are listed below:
Lack of data: Though the biggest published knowledge graph (Wikidata) is reported to contain millions of entities and billions of triples, most of their data are extracted from Wikipedia and are still far less than fulfilling the requirement of Web search applications such as general purpose question answering and recommendation. For example, none of the existed knowledge graph contains the latest Chinese songs’ information which can only be obtained from specific websites.
Uncertainty of scalability: None of the existing works explicitly report their systems’ capability to deal with large-scale data or discuss how the knowledge graph could be expanded on server cluster. This problem might not be very important for academic research since even the biggest knowledge graph's data can still be held by single server with a large hard disk drive. In the case of search engine, the potential data requirement of knowledge graph is much larger and using distributed storage is unavoidable.
To solve these challenges, we propose a novel solution of building a large-scale knowledge graph. We use a distributed search engine called SogouQdb that is developed by Sogou Web Search Department for inner use as the core storage engine to obtain the capability of scalability, and develop a series of models to supply inference and graph-based querying functions which make the system compatible with the other knowledge graph applications. The inference is conducted on HDFS with Spark which makes the inference procedure capable of dealing with big data. The Sogou knowledge graph is built with this solution and has been published to support online products. Currently, the Sogou knowledge graph consists of 54 million entities and over 600 million entity links. The data are extracted from 136 different websites and constantly updated.
We also introduce three applications of knowledge graph in Sogou Inc.: entity detection and linking, knowledge-based question answering and knowledge-based dialog system. These applications have been used as an infrastructural service in Web search products to help users find the information they want more efficiently.
The rest of this paper is organized as follows: In Section 2, we introduce the related works of widely known published knowledge graphs. In Section 3, we elaborate our solution to construct a knowledge graph from scratch. Section 4 presents the applications of knowledge graph especially in Sogou Inc. Finally, we draw a conclusion in Section 6.
2. Related Work
While many works about building a domain-specific knowledge graph have been published, we focus on works of building large-scale multi-domain knowledge graph and list the most widely known works in this section.
Freebase [1] was published as an open shared database in 2007 and was shut down in 2016 after all of its data are transferred to Wikidata. The data of Freebase were collected from Wikipedia①, NNDB②, Fashion Model Directory③ and MusicBrainz④, and were also contributed by its users⑤. Freebase has more than 1.9 billion triples⑥, 4,000 types and 7,000 properties [1].
Wikidata [2] was firstly published by Wikidata in 2012 and has been publicly maintained until now. The data of Wikidata⑦ that contain more than 55 million entities mainly come from its Wikipedia sister projects including Wikipedia, Wikivoyage, Wikisource and others websites.
DBpedia [3] is a large-scale multilingual knowledge graph and its data were extracted from Wikipedia and collaboratively edited by the community. The English version of DBpedia⑧ contains more than 4.58 million entities and data of DBpedia in 125 languages have 38.3 million entities [5, 6, 7].
YAGO [4] is an open-sourced semantic knowledge graph derived from Wikipedia, WordNet and GeoName. YAGO has more than 10 million entities and 120 million entities’ facts⑨.
ConceptNet [5] that originated from the Open Mind Common Sense project which was launched in 1999 has grown to be an open multilingual knowledge graph. ConceptNet contains more than 8 million entities and 21 million entity links.
3. Construction
An overview of the construction framework of Sogou knowledge graph is shown in Figure 1. The data of Sogou knowledge graph are collected from various websites which allow their data to be downloaded or crawled, e.g., Wikipedia and SogouBaike. The extracted data are stored in a distributed database in the form of JSON-LD (JavaScript Object Notation for Linked Data) which is a commonly used concrete RDF syntax. As an additional way to supply data, we introduce inference model which infers new relationships between entities. To search and browse the knowledge graph, a SPARQL query engine is developed that provides RESTful APIs services. For supporting a search engine's products like question answering and recommendation, the knowledge graph data are processed to adapt to the data form of specific tasks. In this section, we give an introduction of each part of the construction framework.
3.1 Data Extraction
The role of data extraction is extracting data into pre-defined form from various input data. Specifically, the input and output of data extraction are defined as follows:
Input: Data downloaded or crawled from the Internet, e.g., the Web pages, XML data or JSON data downloaded by APIs. While the input data comprise mostly of free text, many data contain structured information such as: images, geo-coordinates, links to external Web pages and disambiguation pages. Output: Structured data in the form of JSON-LD that record the knowledge information extracted from the input data.
Data extraction operations can be classified into two categories: Structured data extraction only deals with the input data with structured information, specifically, the data that contain recognizable markup. Free text extraction detects entities and extracts the property information of specific entities from free text.
3.1.1 Structured Data Extraction
As the structured information has recognizable markups, we use rule-based method to build the extractors. The extractors firstly parse the Web page to unified DOM-tree, then find the target information according to the manually written rules and save the extracted data in JSON-LD form. For each website, we build specialized extractors to deal with its data to make it independently update the data of different websites. Currently, in March 2019, Sogou knowledge graph system has 45 websites as data sources and 77 rule-based extractors.
3.1.2 Free Text Extraction
The task of free text extraction is combined with a series of sub-tasks including extracting named entity mentions from plain text, linking the mentions to the entities in knowledge graphs and extracting entities’ properties or the relationships between extracted entities. Since training a model that could deal with all entity types is quite time consuming, we currently just focus on limited types of entities including: Person (PER), Geo-political Entity (GPE), Organization (ORG), Facility (FAC) and Location (LOC). For named entity recognition and linking tasks, we train a Bi-LSTM-CRF model and the feature and parameter selection follows work of [7] which got the best performance in TAC KBP 2017 competition [8]. The training data are constructed by the SogouBaike and Wikipedia Web pages that contain anchor markups. More details of the model and the training data can be found in Section 4.1.
3.2 Normalization
This part normalizes property values of extracted entities and maps entities’ class and property to terms in the Sogou knowledge graph's ontology. Besides, data types of property are also specified, which ensures the high quality of processed data. The input and output of this part are defined as follows:
Input: Output of data extraction: Structured data in the form of JSON-LD.
Output: Structured data in the form of JSON-LD with normalized property name and property value. The type of property value follows the definition of Sogou knowledge graph schema. A simplified example is given below:
{
“@context”: {“@vocab”: “http://schema.sogou.com” “kg”: “http://kg.sogou.com”
}
“@id”: “4962641”, “@type”: [“Person”], “name”: “Dehua Liu”, “birthDate”: “1961-09-27”,
“hasOccupation” [“Singer”, “Actor”] “sogouBaikeUrl”: “https://baike.sogou.com/v4962641.htm”
}
The schema http://schema.sogou.com used in Sogou knowledge base is compatible with http://schema.org. Currently, we maintain only one knowledge graph whose KG is marked with http://kg.sogou.com while the framework could support any numbers of KG by setting different KG values.
3.3 Merging
Merging section is the entrance of KG storage which is a distributed database storing the whole knowledge graph. Any operations aiming to change the KG database including adding new data, updating or deleting data have to be transformed into unit operations following a pre-defined interface (including “add”, “update” and “delete”) in the Merging section. All unit operations are executed with logs which can be used to roll back to any historical version.
For adding entities, merging section checks whether the entity already exists in the KG database. If the entity to be added is found in database, the old entity's property value would be updated to the value of added entity's same properties. Otherwise, the entity would be added into the database as a new entity. To distinguish the entities with same name, we develop a heuristic model that also compares the entities’ property values. For updating and deleting data, the @id property is required and the operation would be executed to the entities with given ids.
3.4 Inference
As an additional way to supply data, inference section infers new relationships of entities based on the existing relations. For example, when we know A is B's son, we could infer a new relation that B is A's father. In the construction framework, the inference is conducted on the whole data that are dumped from KG database and the inference result is added back to the KG through merging part. Currently, all of our inference models are rule-based. While neural network based inference methods (such as TransE and TransR) could infer more potential relations, the accuracy of these inference models’ result is not good enough to be applied to products.
3.5 Knowledge Graph Storage
The Sogou knowledge graph storage is developed on top of SogouQdb which is an open source search engine. Figure 2 gives an overview of the architecture of the KG storage. SogouQdb is used as a distributed database to store data and provide search services. KG Storage Service wraps up SogouQdb to provide storing and querying APIs that are more proper for applications of knowledge graph based cases. In practice, we find the querying requests are much more than storing requests and cost more computation resources. To reduce cost and improve querying speed, a cache layer is added between querying API and the KG storage service.
Compared with graph databases such as Neo4j and OrientDB which is commonly used in knowledge graph storage, using SogouQdb has more advantages on querying speed, scalability and more engineering optimizations. One disadvantage of SogouQdb is that it does not natively support knowledge graph query languages such as SPARQL. To solve this problem, we introduce the KG storage service to parse SPARQL to SogouQdb's APIs. Another disadvantage is that SogouQdb is relatively inefficient to conduct data inference. To solve this problem, we separate the inference part from KG storage and conduct the inference on HDFS using Spark. The data to be inferred are dumped from SogouQdb using Qdb-Hadoop tools.
4. Application
4.1 Entity Linking
The entity linking task identifies the character string representing the entity from natural language text and maps it to a specific entity in the knowledge base. For example, Wiki editors manually add hyperlinks to phrases representing entities in the text to the corresponding Wikipedia pages. This phrase with Wiki internal hyperlinks is called Anchor Text. Traditional entity linking method is based on feature engineering. This kind of method calculates the link matching degree through the features between the candidate entity and its context. Features usually include prior information of entities, contextual semantic features, and features associated with entities. Commonly used models include Ranking SVM [9], CRF [10] and S-MART [11]. With the development of neural networks, feature learning is gradually replacing the original method based on feature engineering. This kind of method calculates the context representation of the entity phrase and the representation of the candidate entity through a specific neural network. The matching score is defined as the similarity between vectors. The entity linking models based on deep learning include [12, 13, 14, 15]. In addition, knowledge graph embedding is also applied to entity linking tasks. The vector representation of each entity is learned through a large number of knowledge base triplets as training data, so that similar entities have similar vector representations. The methods of vector learning based on knowledge base include [16, 17, 12, 18, 19].
The focus of entity linking is to find the correct entity from multiple candidates and eliminate ambiguity. For example, “Li Na” has multiple possible candidate entities, which may represent a tennis star, pop singer, football baby of Sogou, or even a movie with the same name. In the absence of context information, it is difficult to link entities accurately. A well-designed entity linking service needs to consider many factors, including the prior knowledge of the entity itself, the matching degree between the entity and the phrase, and the fit degree between the context in which the entity and the phrase are located.
In Sogou, the entity linking problem is treated as a ranking problem. We take into consideration the entity prior, the similarity between the entity description and the context, and the coherence between entities and entities in the same paragraph. Based on the knowledge graph of Sogou, we have developed a set of entities linking APIs, which provides short text linking service, long text linking service and table linking service. These services link the entities contained in the text to the Sogou Knowledge Graph.
The short text entity linking service is mainly used for entity linking of query text in our search engine. After entity linking, the structural information in the knowledge graph related to the entities is shown to the user, along with illustrations and pictures (Figure 3). At the same time, based on the type of entities and the relationship between entities in the knowledge graph, recommendations of relevant entities are given (Figure 4). These richer results make it quicker for users to get what they want and what they are interested in. Also, entity linking is the basis for automatic question answering, especially for the task of knowledge-based question answering. The existing entities in the question need to be accurately linked, to limit the scope of semantic search.
Long text entity service is mainly used for Anchor Text generation in Web pages, as shown in Figure 5. In order to help readers quickly access the introduction information of entities in Sogou Encyclopedia's pages, these entities contain hyperlinks to their own pages, i.e., Anchor Text. Our automated entity linking service greatly improves the manual editing efficiency. In addition, the long text entity linking service is also applied to Sogou's news feed personalized recommendation. In combination with the entity linking process, similar or related entities are extended in the knowledge graph. Thus we can provide personalized recommendation, along with very interpretable reasons.
Table entity linking is also used to generate entity Anchor Text in tables online, such as entities in tables of Sogou Encyclopedia. Meanwhile, tables provide rich entity type information, entity relationship information, etc. After entity linking, these tables can also supply a large amount of high confidence triplet information to our knowledge graph.
4.2 Knowledge-based Question Answering
Knowledge graph usually comes with a descriptive language, such as MQL provided by Freebase, SPARQL formulated by W3C, CycL provided by Cyc. However, for ordinary users, this structured query syntax has a high usage threshold. A knowledge-based question answering system uses natural language as interface to provide a more friendly way for knowledge querying. On the one hand, natural language has very strong expressive power. On the other hand, this method does not require users to receive any professional training. Due to its broad application prospect, knowledge-base question answering (KBQA) has become a research hot-spot in both academia and industry.
For question understanding, we focus on the automatic question answering task based on a knowledge graph. The task is to find one or more corresponding answer entities from the knowledge graph for questions describing objective facts. For a question that contains only simple semantics, the process of automatic question answering is equivalent to converting the question into a fact triplet on the knowledge base. However, the problems raised by human beings are not always presented in simple forms. More restrictions will be added to them. For example, there are multiple entities and types related to the answer in the question. In complex semantic scenarios, the KBQA has the following challenges: 1) How to find multiple relationships from questions and combine them into a candidate semantic structure; 2) How to calculate the matching degree between natural language questions and complex semantic structures.
Commonly used methods are based on semantic parsing or ranking. The method based on semantic parsing is to convert the question into a formal query statement of a certain standard knowledge base, i.e., find the optimal (question, semantic query) pair instead of a simple answer entity. Related work includes the generation of semantic parsing trees using the Combinatory Categorial Grammar(CCG) [20, 21, 22], λ-DCS [23, 24, 25]. Typical application projects include ATIS [26] in the air travel information question and answer system, CLANG [27] in the robot soccer game, GeoQuery [28] in the US geographic knowledge question and answer system, an open source question and answer system SEMPRE [29], etc. The ranking method does not need formal representation of questions, but directly ranks candidate entities or answers in the knowledge base. This kind of method follows the representation-comparison framework, in which the traditional feature-based engineering methods include [30] and deep learning based methods include [31, 32, 33].
We have implemented a KBQA system and integrated it into Sogou Search Engine and Sogou's dialog service. Sogou's KBQA relies mainly on the combination of manual templates and models. By using templates, the user's query is directly converted into structural KB query (Figure 6). In the model approach, the entities in the query are first linked to the knowledge graph, and then a subgraph is constructed with the entity as the center. The final answer is to sort the results by using the nodes and edges in the subgraph as candidate paths and answers.
4.3 Knowledge Based Dialog System
Knowledge based dialog is a more natural and friendly knowledge service, which can satisfy users’ needs and complete specific knowledge acquisition tasks through multiple rounds of human-agent interaction. The latest development in dialog system is contributed by deep learning techniques, using the encoder-decoder model to train the entire system. Related work includes [34, 35, 36, 37]. Combining an external knowledge base is a way to bridge the gap between the dialog system and humans. Using memory network, [38, 39] have achieved good results in the open domain dialog. Combining words in the generation process with common words in the knowledge base, [40] produces natural and correct answers. [41] uses Twitter's LDA model to get the input topic, and add the topic information and input representation to the joint attention module to generate a topic-related response. [42] classifies each discourse in the conversation into a field and use it to generate the domain and content of the next discourse. Dialog system also needs personality and emotion to look more like humans. [43] applies emotion embedding into the generative model. [44, 45] both consider the user's information in creating a more realistic chat bot.
With the large-scale growth of knowledge graph resources and the rapid development of machine learning models, dialog systems are gradually moving from limited areas to open areas. Sogou Wang Zai Robot is an automatic question-answering robot developed by Sogou, as shown in Figure 7. It combines Sogou's knowledge graph, Sogou's dialog technology and Sogou's intelligent voice technology to provide accurate answers in daily conversations.
Dialog generation based on knowledge graph is a key technology in knowledge-based dialogs. Traditional KBQA provides only accurate answers to all questions. For example, when asked “How tall is Andy Lau?”, the system only returns “174 cm”. However, merely providing this kind of answer is not a friendly interactive way. Users prefer to receive “The height of Hong Kong actor Andy Lau is 174 cm”. This way provides more background information related to the answer (for example, Hong Kong actor). In addition, this complete natural language sentence can better support the follow-up tasks such as answer verification and speech synthesis. In order to generate natural language answers, we use the encoder-decoder framework. Copy and retrieval mechanism is also introduced for complex questions that require facts in the knowledge graph. Different types of words are obtained from different sources by using different semantic unit acquisition methods such as copy, retrieval or prediction. Thus natural answers are generated for complex questions.
Another problem that needs to be solved in the dialog agent is the consistency of the dialog, i.e., the stability of the agent's portrait. It also requires the integration of external knowledge, e.g., personal information in Table 1. Although the agent is a robot, it should have a unified personality. Its gender, age, native place and hobbies should always be the same. When asked “where you were born?” or “Are you from Beijing?”, the answer should always be consistent. We model Sogou Wang Zai's information and import it into an encoder-decoder model in embeddings. Thus when the question is related to personal information, it will generate responses from vectors of the identity information, which achieves good consistency effect.
5. Conclusion
In this paper, we propose a novel solution that is used in Sogou Inc. in building knowledge graphs on top of a distributed search engine, specifically, SogouQdb. Our solution supplies SogouQdb by introducing data inference and graph-based query engine which makes the solution compatible with commonly used knowledge graph applications. Besides, benefited from SogouQdb, the Sogou knowledge graph can be easily scaled to store petabytes of data. We also introduce three applications of a knowledge graph in Sogou Inc.: entity detection and linking, knowledge-based question answering and knowledge-based dialog system which have been used to the Web search products to make knowledge acquisition more efficient.
Author Contributions
All of the authors contributed equally to the work. J. Xu ([email protected]) is the leader of Sogou Knowledge Graph, who drew the blueprint of the whole system. Q. Zhang ([email protected]) brought valuable insights and information to the construction and applications of the knowledge graph. P. Wang ([email protected]) and H. Jiang ([email protected]) mainly drafted the paper, while P. Wang summarized the construction part and H. Jiang summarized the application part. All the authors have made meaningful and valuable contributions in revising and proofreading the resulting manuscript.
Notes
References
Author notes
Sogou Inc., Beijing 100084, China
Sogou Inc., Beijing 100084, China
Sogou Inc., Beijing 100084, China