Knowledge Graph Construction and Applications for Web Search and Beyond

Knowledge graph (KG) has played an important role in enhancing the performance of many intelligent systems. In this paper, we introduce the solution of building a large-scale multi-source knowledge graph from scratch in Sogou Inc., including its architecture, technical implementation and applications. Unlike previous works that build knowledge graph with graph databases, we build the knowledge graph on top of SogouQdb, a distributed search engine developed by Sogou Web Search Department, which can be easily scaled to support petabytes of data. As a supplement to the search engine, we also introduce a series of models to support inference and graph based querying. Currently, the data of Sogou knowledge graph that are collected from 136 different websites and constantly updated consist of 54 million entities and over 600 million entity links. We also introduce three applications of knowledge graph in Sogou Inc.: entity detection and linking, knowledge based question answering and knowledge based dialog system. These applications have been used in Web search products to help user acquire information more efficiently.


INTRODUCTION
A knowledge graph (KG) is a kind of special database which integrates information into an ontology. As an effective way to store and search knowledge, knowledge graph has been applied in many intelligent systems and drawn a lot of research interest. While many knowledge graphs have been constructed and published, such as Freebase [1], Wikidata [2], DBpedia [3] and YAGO [4], none of these works could completely fulfill the application requirement of Sogou Inc. The main challenges are listed below: Lack of data: Though the biggest published knowledge graph (Wikidata) is reported to contain millions of entities and billions of triples, most of their data are extracted from Wikipedia and are still far less than fulfilling the requirements of Web search applications such as general purpose question answering and recommendations. For example, none of the existing knowledge graphs contains the latest Chinese songs' information which can only be obtained from specific websites.
Uncertainty of scalability: None of the existing works explicitly report their systems' capability to deal with large-scale data or discuss how the knowledge graph could be expanded on server cluster. This problem might not be very important for academic research since even the biggest knowledge graph's data can still be held by single server with a large hard disk drive. In the case of search engines, the potential data requirement of a knowledge graph is much larger and using distributed storage is unavoidable.
To solve these challenges, we propose a novel solution of building a large-scale knowledge graph. We use a distributed search engine called SogouQdb that is developed by Sogou Web Search Department for inner use as the core storage engine to obtain the capability of scalability, and develop a series of models to supply inference and graph-based querying functions which make the system compatible with the other knowledge graph applications. The inference is conducted on HDFS with Spark which makes the inference procedure capable of dealing with big data. The Sogou knowledge graph is built with this solution and has been published to support online products. Currently, the Sogou knowledge graph consists of 54 million entities and over 600 million entity links. The data are extracted from 136 different websites and constantly updated.
We also introduce three applications of knowledge graphs in Sogou Inc.: entity detection and linking, knowledge-based question answering and knowledge-based dialogue systems. These applications have been used as an infrastructural service in Web search products to help users find the information they want more efficiently.
The rest of this paper is organized as follows: In Section 2, we introduce the related works of widely known published knowledge graphs. In Section 3, we elaborate our solution to construct a knowledge graph from scratch. Section 4 presents the application of knowledge graphs, especially in Sogou Inc. Finally, we draw a conclusion in Section 6.

RELATED WORK
While many works about building a domain-specific knowledge graph have been published, we focus on works of building large-scale multi-domain knowledge graphs and list the most widely known works in this section.

Data Extraction
The role of data extraction is extracting data into pre-defined form from various input data. Specifically, the input and output of data extraction are defined as follows: Input: Data downloaded or crawled from the Internet, e.g., the Web pages, XML data or JSON data downloaded by APIs. While the input data comprise mostly of free text, many data contain structured information such as: images, geo-coordinates, links to external Web pages and disambiguation pages. Output: Structured data in the form of JSON-LD that record the knowledge information extracted from the input data.
Data extraction operations can be classified into two categories: Structured data extraction only deals with the input data with structured information, specifically, the data that contain recognizable markup. Free text extraction detects entities and extracts the property information of specific entities from free text. Overview of Sogou knowledge graph construction framework. The framework could be divided into three parts: Data Preparation contains operations including collecting data from various sources, extracting data from both structured source and free text and normalizing data; Knowledge graph construction contains all models to build a knowledge graph based on the extracted and normalized data; Application is composed of applications or services of a knowledge graph. A box with solid line represents an operation or model to process data while a box with dashed line represents the intermediate data.

Structured Data Extraction
As the structured information has recognizable markups, we use rule-based method to build the extractors. The extractors firstly parse the Web page to unified DOM-tree, then find the target information according to the manually written rules and save the extracted data in JSON-LD form. For each website, we build specialized extractors to deal with its data to make it independently update the data of different websites. Currently, in March 2019, Sogou knowledge graph system has 45 websites as data sources and 77 rulebased extractors.

Free Text Extraction
The task of free text extraction is combined with a series of sub-tasks including extracting named entity mentions from plain text, linking the mentions to the entities in knowledge graphs and extracting entities' properties or the relationships between extracted entities. Since training a model that could deal with all entity types is quite time consuming, we currently just focus on limited types of entities including: Person (PER), Geo-political Entity (GPE), Organization (ORG), Facility (FAC) and Location (LOC). For named entity recognition and linking tasks, we train a Bi-LSTM-CRF model and the feature and parameter selection follows work of [7] which got the best performance in TAC KBP 2017 competition [8]. The training data are constructed by the SogouBaike and Wikipedia Web pages that contain anchor markups. More details of the model and the training data can be found in Section 4.1.

Merging
The merging section is the entrance of KG storage which is a distributed database storing the whole knowledge graph. Any operations aiming to change the KG database including adding new data, updating or deleting data have to be transformed into unit operations following a pre-defined interface (including "add", "update" and "delete") in the merging section. All unit operations are executed with logs which can be used to roll back to any historical version.
For adding entities, the merging section checks whether the entity already exists in the KG database. If the entity to be added is found in database, the old entity's property value will be updated to the value of added entity's same properties. Otherwise, the entity will be added into the database as a new entity. To distinguish the entities with the same name, we develop a heuristic model that also compares the entities' property values. For updating and deleting data, the @id property is required and the operation will be executed to the entities with given ids.

Inference
As an additional way to supply data, the inference section infers new relationships of entities based on the existing relations. For example, when we know A is B's son, we could infer a new relation that B is A's father. In the construction framework, the inference is conducted on the whole data that are dumped from KG database and the inference result is added back to the KG through the merging part. Currently, all of our inference models are rule-based. While neural network based inference methods (such as TransE and TransR) can infer more potential relations, the accuracy of these inference models' result is not good enough to be applied to products.

Knowledge Graph Storage
The Sogou knowledge graph storage is developed on top of SogouQdb which is an open source search engine. Figure 2 gives an overview of the architecture of the KG storage. SogouQdb is used as a distributed database to store data and provide search services. KG Storage Service wraps up SogouQdb to provide storing and querying APIs that are more proper for applications of knowledge graph based cases. In practice, we find the querying requests are much more than storing requests and cost more computation resources. To reduce cost and improve querying speed, a cache layer is added between querying API and the KG storage service.
Compared with graph databases such as Neo4j and OrientDB which is commonly used in knowledge graph storage, using SogouQdb has more advantages on querying speed, scalability and more engineering optimizations. One disadvantage of SogouQdb is that it does not natively support knowledge graph query languages such as SPARQL. To solve this problem, we introduce the KG storage service to parse SPARQL to SogouQdb's APIs. Another disadvantage is that SogouQdb is relatively inefficient for conducting data inference. To solve this problem, we separate the inference part from KG storage and conduct the inference on HDFS using Spark. The data to be inferred are dumped from SogouQdb using Qdb-Hadoop tools.

Entity Linking
The entity linking task identifies the character string representing the entity from the natural language text and maps it to a specific entity in the knowledge base. For example, Wiki editors manually add hyperlinks to phrases representing entities in the text to the corresponding Wikipedia pages. This phrase with Wiki internal hyperlinks is called Anchor Text. Traditional entity linking method is based on feature engineering. This kind of method calculates the link matching degree through the features between the candidate entity and its context. Features usually include prior information of entities, contextual semantic features, and features associated with entities. Commonly used models include Ranking SVM [9], CRF [10] and S-MART [11]. With the development of neural networks, feature learning is gradually replacing the original method based on feature engineering. This kind of method calculates the context representation of the entity phrase and the representation of the candidate entity through a specific neural network. The matching score is defined as the similarity between vectors. The entity linking models based on deep learning include [12,13,14,15]. In addition, knowledge graph embedding is also applied to entity linking tasks. The vector representation of each entity is learned through a large number of knowledge base triplets as training data, so that similar entities have similar vector representations. The methods of vector learning based on knowledge base include [12,16,17,18,19].
The focus of entity linking is to find the correct entity from multiple candidates and eliminate ambiguity. For example, "Li Na" has multiple possible candidate entities, which may represent a tennis star, pop singer, football baby of Sogou, or even a movie with the same name. In the absence of context information, it is difficult to link entities accurately. A well-designed entity linking service needs to consider many factors, including the prior knowledge of the entity itself, the matching degree between the entity and the phrase, and the fit degree between the context in which the entity and the phrase are located.

Knowledge Graph Construction and Applications for Web Search and Beyond
In Sogou, the entity linking problem is treated as a ranking problem. We take into consideration the entity prior, the similarity between the entity description and the context, and the coherence between entities and entities in the same paragraph. Based on the knowledge graph of Sogou, we have developed a set of entities linking APIs, which provides short text linking service, long text linking service and table linking service. These services link the entities contained in the text to the Sogou Knowledge Graph.
The short text entity linking service is mainly used for entity linking of query text in our search engine. After entity linking, the structural information in the knowledge graph related to the entities is shown to the user, along with illustrations and pictures (Figure 3). At the same time, based on the type of entities and the relationship between entities in the knowledge graph, recommendations of relevant entities are given ( Figure 4). These richer results make it quicker for users to obtain what they want and what they are interested in. Also, entity linking is the basis for automatic question answering, especially for the task of knowledge-based question answering. The existing entities in the question need to be accurately linked, to limit the scope of semantic search.
Long text entity service is mainly used for Anchor Text generation in Web pages, as shown in Figure 5. In order to help readers quickly access the introduction information of entities in Sogou Encyclopedia's pages, these entities contain hyperlinks to their own pages, i.e., Anchor Text. Our automated entity linking service greatly improves the manual editing efficiency. In addition, the long text entity linking service is also applied to Sogou's news feed with personalized recommendations. In combination with the entity linking process, similar or related entities are extended in the knowledge graph. Thus we can provide personalized recommendations, along with very interpretable reasons.   Table entity linking is also used to generate entity Anchor Text in tables online, such as entities in tables of Sogou Encyclopedia. Meanwhile, tables provide rich entity type information, entity relationship information, etc. After entity linking, these tables can also supply a large amount of high confidence triplet information to our knowledge graph.

Knowledge-Based Question Answering
Knowledge graph usually comes with a descriptive language, such as MQL provided by Freebase, SPARQL formulated by W3C, and CycL provided by Cyc. However, for ordinary users, this structured query syntax has a high usage threshold. A knowledge-based question answering system uses natural language as interface to provide a more friendly way for knowledge querying. On the one hand, natural language has very strong expressive power. On the other hand, this method does not require users to receive any professional training. Due to its broad application prospect, knowledge-base question answering (KBQA) has become a research hot-spot in both academia and industry.
For question understanding, we focus on the automatic question answering task based on a knowledge graph. The task is to find one or more corresponding answer entities from the knowledge graph for questions describing objective facts. For a question that contains only simple semantics, the process of automatic question answering is equivalent to converting the question into a fact triplet on the knowledge base. However, the problems raised by human beings are not always presented in simple forms. More restrictions will be added to them. For example, there are multiple entities and types related to the answer in the question. In complex semantic scenarios, the KBQA has the following challenges: 1) How to find multiple relationships from questions and combine them into a candidate semantic structure; 2) How to calculate the matching degree between natural language questions and complex semantic structures.
Commonly used methods are based on semantic parsing or ranking. The method based on semantic parsing is to convert the question into a formal query statement of a certain standard knowledge base, i.e., finding the optimal (question, semantic query) pair instead of a simple answer entity. Related work includes

Knowledge Graph Construction and Applications for Web Search and Beyond
the generation of semantic parsing trees using the Combinatory Categorial Grammar(CCG) [20,21,22], and λ-DCS [23,24,25]. Typical application projects include ATIS [26] in the air travel information question and answer system, CLANG [27] in the robot soccer game, GeoQuery [28] in the US geographic knowledge question and answer system, and an open source question and answer system SEMPRE [23]. The ranking method does not need formal representation of questions, but directly ranks candidate entities or answers in the knowledge base. This kind of method follows the representation-comparison framework, in which the traditional feature-based engineering methods include [29] and deep learning based methods include [30,31,32].
We have implemented a KBQA system and integrated it into Sogou Search Engine ( Figure 6) and Sogou's dialogue service. Sogou's KBQA relies mainly on the combination of manual templates and models. By using templates, the user's query is directly converted into structural KB query. In the model approach, the entities in the query are first linked to the knowledge graph, and then a subgraph is constructed with the entity as the center. The final answer is to sort the results by using the nodes and edges in the subgraph as candidate paths and answers.

Knowledge-Based Dialogue System
Knowledge based dialogue is a more natural and friendly knowledge service, which can satisfy users' needs and complete specific knowledge acquisition tasks through multiple rounds of human-agent interaction. The latest development in dialogue systems is based on deep learning techniques, using the encoder-decoder model to train the entire system. Related work includes [33,34,35,36]. Combining an external knowledge base is a way to bridge the gap between the dialogue system and humans. Using memory network, [37,38] have achieved good results in the open domain dialogue. Combining words in the generation process with common words in the knowledge base, [39] produces natural and correct answers. [40] uses Twitter's LDA model to get the input topic, and add the topic information and input representation to the joint attention module to generate a topic-related response. [41] classifies each chinaXiv:202211.00466v1 discourse in the conversation into a field and uses it to generate the domain and content of the next discourse. Dialogue system also needs personality and emotion to look more like humans. [42] applies emotion embedding into the generative model. [43,44] both consider the user's information in creating a more realistic chat bot.
With the large-scale growth of knowledge graph resources and the rapid development of machine learning models, dialogue systems are gradually moving from limited areas to open areas. Sogou Wang Zai Robot is an automatic question-answering robot developed by Sogou, as shown in Figure 7. It combines Sogou's knowledge graph, Sogou's dialogue technology and Sogou's intelligent voice technology to provide accurate answers in daily conversations.
Dialogue generation based on knowledge graphs is a key technology in knowledge-based dialogues. Traditional KBQA provides only accurate answers to all questions. For example, when asked "How tall is Andy Lau?", the system only returns "174 cm". However, merely providing this kind of answer is not a friendly interactive way. Users prefer to receive "The height of Andy Lau, actor of Hong Kong, China, is 174 cm". This way provides more background information related to the answer (for example, actor of Hong Kong, China). In addition, this complete natural language sentence can better support the follow-up tasks such as answer verification and speech synthesis. In order to generate natural language answers, we use the encoder-decoder framework. Copy and retrieval mechanism is also introduced for complex questions that require facts in the knowledge graph. Different types of words are obtained from different sources by using different semantic unit acquisition methods such as copy, retrieval or prediction. Thus natural answers are generated for complex questions.

Knowledge Graph Construction and Applications for Web Search and Beyond
Another problem that needs to be solved in the dialogue agent is the consistency of the dialogue, i.e., the stability of the agent's portrait. It also requires the integration of external knowledge, e.g., personal information in Table 1. Although the agent is a robot, it needs to have a unified personality. Its gender, age, native place and hobbies should always be the same. When asked "where were you born?" or "Are you from Beijing?", the answer will always be consistent. We model Sogou Wang Zai's information and import it into an encoder-decoder model in embeddings. Thus when the question is related to personal information, it will generate responses from vectors of the identity information, which achieves good consistency effect.

CONCLUSION
In this paper, we propose a novel solution that is used in Sogou Inc. in building knowledge graphs on top of a distributed search engine, specifically, SogouQdb. Our solution supplies SogouQdb by introducing data inference and graph-based query engine which makes the solution compatible with commonly used knowledge graph applications. Besides, benefited from SogouQdb, the Sogou knowledge graph can be easily scaled to store petabytes of data. We also introduce three applications of a knowledge graph in Sogou Inc.: entity detection and linking, knowledge-based question answering and knowledge-based dialogue system which have been used as the Web search products to make knowledge acquisition more efficient.

AUTHOR CONTRIBUTIONS
All of the authors contributed equally to the work. J. Xu (xujingfang@sogou-inc.com) is the leader of Sogou Knowledge Graph, who drew the blueprint of the whole system. Q. Zhang (qizhang@sogou-inc. com) brought valuable insights and information to the construction and applications of the knowledge graph. P. Wang (wangpeilu@sogou-inc.com) and H. Jiang (jianghao216568@sogou-inc.com) mainly drafted the paper, while P. Wang summarized the construction part and H. Jiang summarized the application part. All of the authors have made meaningful and valuable contributions in revising and proofreading the resulting manuscript. chinaXiv:202211.00466v1