KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation

Pre-trained language representation models (PLMs) learn effective language representations from large-scale unlabeled corpora. Knowledge embedding (KE) algorithms encode the entities and relations in knowledge graphs into informative embeddings to do knowledge graph completion and provide external knowledge for various NLP applications. In this paper, we propose a unified model for Knowledge Embedding and Pre-trained LanguagE Representation (KEPLER), which not only better integrates factual knowledge into PLMs but also effectively learns knowledge graph embeddings. Our KEPLER utilizes a PLM to encode textual descriptions of entities as their entity embeddings, and then jointly learn the knowledge embeddings and language representations. Experimental results on various NLP tasks such as the relation extraction and the entity typing show that our KEPLER can achieve comparable results to the state-of-the-art knowledge-enhanced PLMs without any additional inference overhead. Furthermore, we construct Wikidata5m, a new large-scale knowledge graph dataset with aligned text descriptions, to evaluate KE embedding methods in both the traditional transductive setting and the challenging inductive setting, which needs the models to predict entity embeddings for unseen entities. Experiments demonstrate our KEPLER can achieve good results in both settings.


Introduction
Pre-trained language representation models (PLMs) such as ELMo (Peters et al., 2018a), BERT (Devlin et al., 2019a) and XLNet  learn effective language representations from large-scale nonstructural and unlabelled Work in progress. corpora and achieve great performance on various NLP tasks. However, they are typically lack of factual world knowledge (Petroni et al., 2019;. Recent works (Zhang et al., 2019;Peters et al., 2017;Liu et al., 2019a) utilize entity embeddings of large-scale knowledge bases to provide external knowledge for PLMs and improve their performance on various NLP tasks. However, they have some issues: (1) They use fixed entity embeddings learned by a separate knowledge embedding (KE) algorithm, which cannot be easily aligned with the language representations because they are essentially in two different vector spaces. (2) They require an entity linker to link the words in context to corresponding entities so that they can benefit from the entity embeddings, which makes them suffer from the error propagation problem. (3) Their sophisticated mechanisms to retrieve and use entity embeddings lead to additional inference overhead compared with vanilla PLMs.
Actually, knowledge embedding methods have a strong connection with NLP models. There are not only many works integrating knowledge embeddings into NLP models to improve the performance of NLP applications such as machine translation (Zaremoodi et al., 2018), reading comprehension (Mihaylov and Frank, 2018;Zhong et al., 2019) and dialogue system (Madotto et al., 2018), but also some early works use text as additional information (Xie et al., 2016;An et al., 2018) or jointly train the knowledge and text embedding in the same space (Wang et al., 2014;Toutanova et al., 2015;Han et al., 2016;Cao et al., 2017Cao et al., , 2018.
In this paper, we propose to learn knowledge embedding and language representation with a unified model and encode them into the same semantic space, which can not only better integrate knowledge into PLMs but also help to learn more informative knowledge embeddings with the effec-tive language representations. We propose KE-PLER, which is short for "a unified model for Knowledge Embedding and Pre-trained LanguagE Representation". We collect informative textual descriptions for entities in the knowledge graph and utilizes a typical PLM to encode the descriptions as text embeddings, then we treat the description embeddings as entity embeddings and optimize a KE objective function on top of them. The key idea is to encode structural knowledge in the textual representation of entities using a PLM, which can generalize to unobserved entities in the knowledge graph.
Our KEPLER enjoys the following advantages: (1) We integrate world knowledge into PLMs with the supervision of the KE objective, which is more flexible for the PLMs, and encode the entity and text into the same space, which avoids the gap between the language representations and fixed entity embeddings.
(2) We do not need an entity linker or additional mechanisms to retrieve corresponding entity embeddings, which avoids the error propagation problem and extra overhead. During inference, our KEPLER is exactly the same as standard PLMs, which can be adopted in a wide range of NLP applications. (3) Different from conventional KE methods, our KEPLER encodes textual entity descriptions as entity embeddings, which enables our model to infer knowledge embedding in the inductive setting (get entity embeddings for the unseen entities). This is especially useful for deployment, where the model may deal with unseen entities.
The existing KE datasets are relatively smallscale, which is not sufficient to pre-train a large model, and typically lack of description data and a data split for the inductive setting. Therefore, we construct Wikidata5m, a new large-scale knowledge graph dataset with aligned text description for each entity. Wikidata5m is a subset of Wikidata (Vrandečić and Krötzsch, 2014), a free knowledge base with about sixty million entities. To ensure each entity is informative and the knowledge base is as clean as possible, we only select the entities with corresponding Wikipedia pages. Our Wikidata5m contains five million entities and twenty million triplets. We also benchmark several classical KE methods on Wikidata5m to facilitate future research. To our knowledge, this is the first million-scale general knowledge graph dataset.
To summarize, our contribution is three-fold: (1) We propose to encode entities and texts into the same space and jointly train the KE and language modeling objectives, and then get a better knowledge-enhanced PLM which avoids error propagation and additional overhead. Experimental results on various NLP tasks demonstrate the effectiveness of our KEPLER. (2) We encode textual descriptions as entity embeddings, which improves KE with textual information and enables inductive KE. (3) We introduce a new large-scale knowledge graph dataset Wikidata5m, which may promote the research on large-scale knowledge graph, inductive knowledge embedding and interactions between knowledge graph and NLP.

Related Work
Pre-trained Language Model There has been a long history of pre-training in NLP. Early works focus on distributed word representation (Collobert and Weston, 2008;Mikolov et al., 2013;Pennington et al., 2014), many of which are still often adopted in current models as word embeddings for their ability to capture syntactic and semantic information from large-scale corpora. Peters et al. (2018b) push this trend a step forward by using a bidirectional LSTM to capture contextualized word embeddings (ELMo) for richer semantic meanings under different circumstances.
Apart from those methods using pre-trained word embeddings as input features, there is another trend exploring pre-trained encoders. Dai and Le (2015) first propose to train an auto-encoder on unlabeled data, and then fine-tune it on downstream tasks. Howard and Ruder (2018) propose a universal language model (ULMFiT) based on AWD-LSTM (Merity et al., 2018). With the powerful Transformer (Vaswani et al., 2017) as its encoder, Radford et al. (2018) demonstrate a pre-trained generative model (GPT) and its effects, while Devlin et al. (2019b) release a pre-trained deep Bidirectional Encoder Representation from Transformers (BERT), achieving state-of-the-arts on dozens of benchmarks.
After Devlin et al. (2019b), similar pre-trained encoders spring up recently.  propose a permutation language model (XLNet) based on TransformerXL . Later, Liu et al. (2019c) show that more data and more sophisticated parameter tuning would benefit pretrained encoders a lot and release a new state-ofthe-art model (Roberta). Other works explore how to add more tasks (Liu et al., 2019b) and more … are three scientific laws describing the motion of planets around the Sun, published by Johannes Kepler.

NASA
… is an independent agency … for the civilian space program … Kepler space telescope … is a retired space telescope launched by NASA to … Named after astronomer Johannes Kepler. Knowledge Graph Embeddings In recent years knowledge embeddings have been extensively studied through predicting missing links in graphs. Conventional models define score functions for relation triples (h, r, t) and predict head or tail entities with scores of candidate entities. For example, TransE (Bordes et al., 2013) treats tail entities as translations of heads, while DistMult (Yang et al., 2015) use matrix multiplications as score functions and ComplEx (Trouillon et al., 2016) adopt complex operations based on it. RotatE (Sun et al., 2019a) combines the advantages of both of them.
Among these works, Xie et al. (2016) propose to utilize entity descriptions as an external information source and introduce an entity description encoder to enhance the TransE score function. Though similar to our method, Xie et al. (2016) aim at utilizing entity descriptions to help knowledge representation learning, while we take entity descriptions as a tool to incorporate external knowledge in our model.

KEPLER Model
In this section, we introduce the structure of our KEPLER model, and how we combine two training goals of masked language modeling and knowledge representation learning.

Training Objectives
To incorporate world knowledge into our pretrained language representation models (PLMs), we design a multi-task loss as shown in Figure 1 and Equation 1, where L KE represents knowledge embedding loss and L LM represents language model loss. Since our PLMs are involved in both tasks, jointly optimizing the two objectives could implicitly integrate knowledge from external graphs with text encoders, while keeping the strong abilities of PLMs for syntactic and semantic understanding.
More specifically, we adopt a general L KE format using negative sampling, where (h, r, t) is the correct triple from knowledge graphs and (h i , r, t i ) are negative sampling triples. d r is the score function, for which we have many choices. Different from conventional knowledge embedding methods, for entity embeddings h and t, instead of looking up in embedding tables, we use PLMs as our text encoders to extract entity representations from their descriptions.
For L LM , many alternatives for pre-trained language representation can be used, e.g., masked language model (Devlin et al., 2019b). Note that those two tasks only share the text encoder and for each mini-batch, text sampled for L KE and L LM is not (necessarily) the same.

Model Details
Though we have many alternatives of model structures and training objectives to choose under KE-PLER framework, here for better clarity, we introduce a specific one that we use in experiments.

Model Structure
We use the transformer architecture (Vaswani et al., 2017) as in (Devlin et al., 2019b;Liu et al., 2019c), which we will not address in details. To be more specific, we use RoBERTaBASE codes and checkpoints 1 in all our experiments since it is one of the state-of-the-art pre-trained models with acceptable computing requirements. Besides the training data and hyperparameters, one of the major differences between RoBERTa and BERT is that RoBERTa uses Byte-Pair Encoding (BPE) (Sennrich et al., 2016) to better tokenize rare words.
Given a sequence of tokens x 1 , x 2 , ..., x N , the input format is PLM Objective Inspired by BERT (Devlin et al., 2019b), MLM randomly selects 15% of input tokens, among which 80% are masked with the special mark [MASK], 10% are replaced by another 1 https://github.com/pytorch/fairseq random token, and the rest remain unchanged. Under MLM, models try to predict the correct tokens and a cross-entropy loss is calculated over the selected positions.
We adopt the pre-trained checkpoint of RoBERTaBASE for the initialization of our model. However, we still keep MLM as one of our objectives to avoid catastrophic forgetting (McCloskey and Cohen, 1989) while training towards the KRL loss. Note that experiments show that only further pre-training from RoBERTaBASE checkpoint does not bring promotion, suggesting that the combination of the two tasks contributes most to the performance.

KE Objective
We use the loss formula from (Sun et al., 2019b) as our KE objective, which takes negative sampling (Mikolov et al., 2013) for efficient optimization: where (h, r, t) is the correct triple, (h i , r, t i ) are negative sampling triples, γ is the margin, σ is the sigmoid function, and d r is the score function, for which we choose to follow TransE (Bordes et al., 2013) for its simplicity and efficiency, where we take the norm p as 1. Due to the limit of computing resources, we take the negative sampling size n as 1. The negative sampling policy is to fix the head entity and randomly sample a tail entity, and vice versa.
Different from conventional KE methods, we do not have an entity embedding lookup table. Instead, we use our KEPLER model to encode the corresponding entity descriptions and take the [CLS] outputs as the entity embeddings.

Downstream Tasks
Like all BERT-like models, we fine-tune KEPLER on downstream tasks and use [CLS] output for sentence-level prediction and the outputs of all tokens for sequence labelling tasks (Devlin et al., 2019b). For supervised relation extraction and fewshot relation extraction, we follow the approaches from (Baldini Soares et al., 2019) and (Gao et al., 2019) respectively.

Wikidata5m
We construct a new large-scale knowledge graph dataset with aligned text descriptions. Our dataset is built by integrating Wikidata (Vrandečić and Krötzsch, 2014), a large-scale open knowledge base, with Wikipedia. Each entity in the knowledge graph is aligned with its text description in Wikipedia pages. In the following sections, we will first introduce the data collection steps, and then give the benchmarks of popular KE methods on this dataset.

Data Collection
We pull the latest dump of Wikidata 2 and Wikipedia 3 from their websites respectively. We remove pages whose first paragraphs contain fewer than 5 words. For each entity, we align it to a Wikipedia page with the MediaWiki wbgetentities action API. The first section of Wikipedia pages is extracted as the description for entities. Entities that have no corresponding Wikipedia pages are discarded.
To construct the knowledge graph, we retrieve all the statements in entity pages, and map the entities and relations in statements to their canonical IDs in Wikidata. A statement is considered to be a valid triplet if both of its entities can be aligned with Wikipedia pages and its relation has a non-empty page in Wikidata. The final knowledge graph dataset contains 4,813,455 entities, 822 relations and 21,344,269 triplets, where each entity has a text description. Statistics of our Wikidata5m dataset and four widely-used datasets are showed in Table 1. Top-5 entity categories are listed in Table 3. We can see that our Wikidata5m is much larger than existing knowledge graph datasets, covering all sorts of domains.

Data Split
The data split statistics for the conventional transductive setting are also shown in Table 1.
In this work, we also evaluate models on the challenging inductive setting, which requires the models to produce entity embeddings for entities which are not seen at the training time and also do link predictions for the unseen entities. So we provide a data split for the inductive setting evaluation. The statistics for the inductive setting data split are shown in Table 2. In the inductive setting, the entities and triplets in training, validation and test sets are mutually disjoint, while in the transductive setting, only the triplet sets are mutually disjoint.

Benchmarks
To assess the challenges of Wikidata5m, we benchmark several popular knowledge graph embedding models on the dataset. Since the conventional knowledge graph embedding models are inherently transductive, we split the triplets of knowledge graph into train, valid and test sets. Each model is trained on the training set and evaluated on the link  prediction task. We conduct 5 knowledge graph embedding models , including TransE (Bordes et al., 2013), Dist-Mult (Yang et al., 2015), ComplEx (Trouillon et al., 2016), SimplE (Kazemi and Poole, 2018) and Ro-tatE (Sun et al., 2019b). Because their original implementations do not scale to Wikidata5m, we benchmark these methods using the multi-GPU implementation in GraphVite . The performance of link prediction is evaluated in the filtered setting, where test triplets are ranked against all candidate triplets that are not observed in the knowledge graph. We report the standard metrics of Mean Rank (MR), Mean Reciprocal Rank (MRR) and Hits at N (HITS@N). Table 4 shows the benchmarks of popular methods on Wikidata5m.

Experiments
In this section, we introduce the experiment settings and experimental results of KEPLER on various NLP and KE tasks.

Pre-training settings
In experiments, we choose RoBERTa (Liu et al., 2019c) as our base model and implement our methods in the fairseq framework (Ott et al., 2019) for pre-training. Due to the computing resource limit, we choose RoBERTa BASE architecture and use the released roberta.base 4 parameters to initialize our model.
In our pre-training procedure, we only use the English Wikipedia corpus to save time and also for a fair comparison with previous knowledgeenhanced PLMs (Zhang et al., 2019;

NLP Tasks
In this section, we introduce how our KEPLER can be used as a knowledge-enhanced PLM on various NLP tasks and its performance compared with state-of-the-art models.

Relation Classification
Relation classification is an important NLP task that requires models to classify relation types between two given entities from text. We evaluate our model and baselines on two commonlyused datasets: TACRED (Zhang et al., 2017) and FewRel . TACRED covers 42 relation types and contains 106,264 sentences. FewRel is a few-shot relation classification dataset, which has 100 relations and 700 instances for each relation.
Here we follow the relation extraction finetuning procedure from Zhang et al. (2019), where four special tokens are added before and after entity mentions in the sentence to highlight where the entities are. Then we take the [CLS] output as the sentence representation for classification. Table 5 shows results of various models on TA-CRED, from which we can see that our model Model 5-way 1-shot 5-way 5-shot 10-way 1-shot 10-way 5-shot  Table 6: Accuracies (%) on FewRel dataset. "Proto" indicates Prototypical Networks (Snell et al., 2017) used in . "PAIR" is proposed in Gao et al. (2019)   achieves state-of-the-art on this benchmark. Note that some baselines use the LARGE version of pretrained language models while we still take the BASE architecture. We have gained a large promotion over our base model (RoBERTaBASE) while staying a little bit advanced over other competitive methods (even if they use a LARGE architecture). Our model has also shown strength on FewRel dataset. We use Prototypical Networks (Snell et al., 2017) and PAIR (Gao et al., 2019) as the base frameworks and try out different kinds of pretrained models as encoders. As shown in Table  6, for both frameworks, our models have superior performance over others. We have also compared with current state-of-the-art MTP (Baldini Soares et al., 2019), which outperforms us a little. But note that MTP uses a large version of BERT while we use the base version, and also it carries out a new pre-training task specifically targeting relation extraction, while ours is a general way to combine knowledge and natural language which would benefit all knowledge-related tasks.

Entity Typing
Entity typing requires models to classify given entity mentions into pre-defined entity types. For this task, we evaluate all the models on OpenEntity (Choi et al., 2018) following the setting from Zhang et al. (2019), which focuses on nine general entity types. Evaluation results are demonstrated in Table  7. For now we have achieved better results than RoBERTa, and ERNIE and KnowBERT show slightly better results than ours. It is mainly due to that we use different ways of extracting entity representations. KnowBERT adds special tokens before and after the mention and uses the output of the token before the mention as the representation for typing, while ours, for now, directly uses [CLS]. We will try this better way of entity representation in the future.

Knowledge Embedding
In this section, we show how our KEPLER works as a KE model, and evaluate it on our Wikidata5m dataset in inductive setting.
We do not use the existing KE benchmarks because they are lack of high-quality text descriptions for their entities and they do not have a reasonable data split for the inductive setting.

Inductive Setting
We evaluate the generalization ability of our KE-PLER by testing it on the inductive setting in Wiki-data5m (as described in Section 4.2), which re-quires it to produce effective entity embeddings for the unseen entities. The results are shown in Table 8.

Conclusion and Future Work
In this paper, we propose KEPLER, a unified model for knowledge embedding and pre-trained language representation. We jointly train the knowledge embedding and language representation objectives on top of the language representation model. Experimental results on extensive tasks demonstrate the effectiveness of our model.
In the future, we will: (1) Evaluate whether our model can recall factual knowledge with more tasks.
(2) Try variations of existing models, such as highlighting entity mentions in descriptions or changing knowledge embedding form, to get better understanding of how KEPLER works and bring more promotion for downstream tasks.