Abstract
We present a memory-augmented approach to condition an autoregressive language model on a knowledge graph. We represent the graph as a collection of relation triples and retrieve relevant relations for a given context to improve text generation. Experiments on WikiText-103, WMT19, and enwik8 English datasets demonstrate that our approach produces a better language model in terms of perplexity and bits per character. We also show that relational memory improves coherence, is complementary to token-based memory, and enables causal interventions. Our model provides a simple yet effective way to combine an autoregressive language model and a knowledge graph for more coherent and logical generation.
1 Introduction
A core function of language is to communicate propositions (e.g., who did what to whom). As such, language models need to be able to generate this information reliably and coherently. Existing language models (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020) do not have explicit representations for such information and rely on it being implicitly encoded in their parameters (Liu et al., 2019; Petroni et al., 2019; Wang et al., 2020). This encoding mechanism makes it difficult to interpret what the language models know and often leads to generating illogical and contradictory contents. For example, Logan et al. (2019) observe that existing language models rely heavily on word correlation and fall short of logical reasoning. This causes the model to hallucinate—for example, that Barack Obama’s wife is Hillary Clinton based on the high co-occurrence of the two entities. In another example, Lake and Murphy (2020) notice that GPT-2 (Radford et al., 2019) states that unicorns have four horns, directly after speaking that unicorns have one horn.
In this work, we explore ways to combine an autoregressive language model with a knowledge graph. We design a memory-augmented architecture that stores relations from a knowledge graph and investigate the effect of conditioning on this relational memory in an autoregressive language model. In contrast to existing token-based memory-augmented language models that store context-target pairs (Khandelwal et al., 2020b; Yogatama et al., 2021), our memory stores relation triples (head entity, relation, tail entity). Relation triples form the basis of knowledge bases, empowering a wide range of applications such as question answering (Yasunaga et al., 2021), machine reading (Yang and Mitchell, 2019), and reasoning (Minervini et al., 2020). From a cognitive science perspective, we can consider the neural language model to be an instance of System 1, which performs fast inference and the symbolic relational memory as a world model to support slow and logical reasoning of System 2 (Kahneman, 2011).1 We hypothesize that relational memory can improve performance and coherence of an autoregressive language model.
Given an observed context, we first run an entity tagger to identify entities in the context. We then use tf-idf (Ramos et al., 2003) to select salient entities. We retrieve relations (from a knowledge base) for the selected entities and design a gating function that allows the language model to adaptively combine information from extracted relations and observed textual context to predict the next token. Existing knowledge bases such as Freebase and Wikidata can be used as a source of information from which to retrieve relations. However, they are often incomplete and do not contain relations that are suitable for the particular dataset that we want to work with. Instead of using these predefined knowledge bases, we choose to perform open information extraction (OpenIE) on each language modeling dataset to get relations. As a result, our model is able to move beyond simple co-occurrence statistics and generate text that is more grounded on real-world relations observed in a particular corpus.
Our main contributions are as follows:
We evaluate the model on three English language modeling datasets. We show that our model outperforms a strong transformer- XL baseline (Dai et al., 2019) on both word-level (WikiText-103 and WMT19) and character-level (enwik8) language modeling in terms of perplexity and bits per character respectively (§3.3).
We conduct comprehensive ablation and design choice studies to understand contributions of different components of our models (§4.1).
We measure coherence with human evaluation and two automatic metrics (knowledge perplexity and knowledge F1) and demonstrate that relational memory improves coherence (§4.2).
We study the relationship between our method and a typical memory-augmented language model that stores word tokens in its memory (Yogatama et al., 2021). We show that relational memory is complementary to token-based memory and combining them improves performance further (§3.3).
We perform qualitative analysis by examining gate values and retrieved relations. In line with our main motivation, we find that the relational memory is particularly useful for predicting entities. Further, we demonstrate that such explicit propositional representations allow causal interventions and increase interpretability of language models (§4.3).
2 Model
2.1 Transformer-XL
We use transformer-XL (Dai et al., 2019)—which is based on transformer (Vaswani et al., 2017)— to parametrize the conditional probabilities in Eq. 1. Transformer stacks multiple self-attention layers to obtain contextualized representations.
Language modeling datasets usually consist of articles of different lengths. It is impractical to apply transformer to encode long articles, as its computational complexity is quadratic in the sequence length. In practice, each article is usually truncated into fixed-length text segments {xt−N +1,…,xt} of length N to train and evaluate the model. However, this approximation prevents transformer from capturing long-term dependency beyond text segments. Transformer-XL reuses hidden states from previous text segments to extend the context window.
More specifically, denote the hidden state of xt at layer ℓ as . Given a text segment {xt−N +1,…,xt} and its extended context {xt−N−M +1,…,xt−N} of length M, both the hidden states of the text segment and the hidden states of the extended context are used. When performing self-attention, each token in the text segment can attend to the preceding tokens in the text segment and all the tokens in the extended context, enabling longer-term dependency compared to a vanilla transformer. Importantly, transformer-XL does not backpropagate through the hidden states of the extended context during training (by adding stop gradient operators to all the hidden states in the extended context).
2.2 Relational Memory
In this section, we first introduce how we obtain relation triples using OpenIE (§2.2.1). We then use tf-idf to score entities in the observed context and retrieve relation triples related to these entities (§2.2.2) to construct relational memory. Finally, we show an integrated architecture that allows transformer-XL to incorporate the relational memory for predicting the next token (§2.2.3). We show our architecture in Figure 1. The pseudocode of training or evaluating with the relational memory is demonstrated in Algorithm 1. In the pseudocode, we use TRAIN(xc, ℳ) and EVAL(xc, ℳ to refer to training with the cross entropy loss and evaluating (e.g., calculating perplexity) on the text segment xc conditioned on the relational memory ℳ, respectively.
We identify salient entities in the previous text segment and extract relations to build our relational memory. We encode each relation with an LSTM encoder, aggregate the resulting representations into a vector, and use a gate mechanism that allows our language model to adaptively take advantage of relational information for predicting the next token.
We identify salient entities in the previous text segment and extract relations to build our relational memory. We encode each relation with an LSTM encoder, aggregate the resulting representations into a vector, and use a gate mechanism that allows our language model to adaptively take advantage of relational information for predicting the next token.
2.2.1 Open Information Extraction
A key challenge of utilizing relational information for language modeling is obtaining high-quality relation triples. There are several well-established knowledge bases, such as Freebase Bollacker et al. (2007) and YAGO (Rebele et al., 2016). However, existing knowledge bases suffer from missing relations and often do not contain relation triples related to observed contexts in a target corpus, even though research on knowledge base completion has resulted in significant advances (Bordes et al., 2013; Trouillon et al., 2016; Zhang et al., 2019).
In this work, we use OpenIE (Angeli et al., 2015; Etzioni et al., 2008) to obtain relation triples. Since OpenIE directly extracts relation triples from each dataset , it provides a structured way to represent knowledge in .2 Specifically, we perform OpenIE on the training set of . Given an entity e, we retrieve a set of relation triples , where e is either the head entity or the tail entity in these relation triples. Conceptually, consists of all the relation triples from the one-hop subgraph centred at the entity e in the knowledge graph constructed from . Therefore, can provide “global” information about the entity.
Dynamic OpenIE.
Dynamic OpenIE takes advantage of the autoregressive nature of language modeling, where text segments are sequentially processed. In addition to extracting relations from the training set of , we can also extract relations from previously seen text segments of our evaluation set. We refer to this extraction mechanism as dynamic OpenIE. After a text segment {xt−N +1,…,xt} has been evaluated, for example, after calculating perplexity on this text segment, we perform OpenIE on it to obtain new relation triples to be added to our knowledge graph. Note that we only perform OpenIE on previously seen text segments and do not use unseen text. We expect that the relation triples extracted from seen text segments are potentially useful for predicting the next tokens. This extraction mechanism will not violate the autoregressive nature of language modeling. Metrics such as perplexity and bits per character are calculated as usual. The idea of using seen text segments during evaluation to improve language modeling is related to dynamic evaluation (Krause et al., 2018, 2019). In dynamic evaluation, the model is adapted based on recent history during evaluation via gradient descent so that it can assign higher probabilities to re-occurring patterns. In contrast to dynamic evaluation, we do not update model parameters and only extract new relations from seen text segments to enrich our corpus-specific knowledge graph.
Mismatch between Training and Evaluation.
As shown in Algorithm 1, because we do not use dynamic OpenIE during training due to its additional efficiency overhead (see speed comparison in §4.1), this results in a mismatch between training and evaluation. We extract all the relation triples from the training set of each dataset before training on D. As a result, during training we may retrieve relation triples extracted from unseen text of the training set when performing relation retrieval (§2.2.2). We do not suffer from this issue during evaluation, as we extract relations from previously seen text of our evaluation set. We believe this mismatch is minor given the superior performance of our model in the experiments.
2.2.2 Relation Retrieval
Given a knowledge graph (represented as a collection of triples), an ideal relational memory consists of a set of triples that are relevant to the observed context. There are many choices to measure the relatedness between the observed context and relation triples in our knowledge graph—for example, based on keyword search or dense retrieval (Karpukhin et al., 2020; Guu et al., 2020; Yogatama et al., 2021).
In this work, we use keyword search because of its simplicity and leave methods based on dense retrieval to future work. Specifically, given the observed context, we perform entity recognition (Ratinov and Roth, 2009; Nadeau and Sekine, 2007) on this context and score the tagged entities with tf-idf (Ramos et al., 2003). The top-K scored entities (K is set to 5 in our experiments) are used to retrieve relations . These retrieved relations are used to construct the relational memory ℳ. Note that the entities are selected from the observed context, so that unseen text is not utilized. We limit the capacity of ℳ to P. If the number of newly retrieved triples is larger than P, we randomly drop relations and only select P of them to be inserted into ℳ. Otherwise, the relational memory operates with a first-in-first-out principle. When ℳ is full, older relations retrieved will be overwritten by newly retrieved relations. The relational memory is re-initialized to empty when an article ends.
As shown in Algorithm 1, since we update ℳ only after processing an entire text segment, all the tokens in the same text segment will be conditioned on the same relational memory. This approach is more efficient compared to updating ℳ each time a new entity is encountered and is more amenable for batch training.
2.2.3 Integration with Transformer-XL
We now show how we can integrate relational memory with transformer-XL. We refer to our model as RelationLM.
Relation Triple Encoding.
We first discuss how we encode relation triples in the relational memory ℳ. We treat relation triples as text and serialize each relation triple into a sequence, for example, (Barack Obama, president of, United States) is converted into a sequence “Barack Obama, president of, United States”. This sequential representation can well capture the order of head entities and tail entities and is also adopted by KG-BERT (Yao et al., 2019) and Kepler (Wang et al., 2021b). Because each example in a batch corresponds to P retrieved relations, we obtain B ⋅ P relation sequences for each batch, where B and P denote batch size and relational memory length, respectively. In the order of hundreds of relation triples, this prevents us from using large models (e.g., a multi-layer transformer) to encode these sequences due to memory constraints. In our preliminary experiments, we compare LSTM (Hochreiter and Schmidhuber, 1997), GRU (Cho et al., 2014), and a one-layer transformer and find that LSTM performs marginally better. Therefore, for each relation triple rp, we reuse the transformer-XL word embedding matrix We to map each token in the sequence to its embedding vector. We then run LSTM to encode the sequence and use the hidden representation of the last token as the relation representation rp.
Integration.
3 Experiments
3.1 Datasets and OpenIE
We use three English language modeling datasets: WikiText-103 (Merity et al., 2017), WMT19 (Barrault et al., 2019), and enwik8 (Hutter, 2012). Descriptive statistics of these datasets are shown in Table 1. WikiText-103 and WMT19 are (sub) word-level datasets, while enwik8 is a character- level dataset.
Statistics of datasets used in our experiments. For each subset, we show the number of (sub)words for WikiText-103 and WMT19 or the number of characters for enwik8.
Dataset . | # Train . | # Valid . | # Test . | # Articles . | # Vocab . | # Entities . | # Relations . | # Relations/Entity . |
---|---|---|---|---|---|---|---|---|
WikiText | 103M | 0.2M | 0.2M | 28,595 | 267,735 | 980K | 8.9M | 9.03 |
WMT19 | 151M | 0.3M | 0.3M | 169,180 | 50,259 | 976K | 7.8M | 7.97 |
enwik8 | 94M | 5M | 5M | 12,350 | 256 | 361K | 2.4M | 6.66 |
Dataset . | # Train . | # Valid . | # Test . | # Articles . | # Vocab . | # Entities . | # Relations . | # Relations/Entity . |
---|---|---|---|---|---|---|---|---|
WikiText | 103M | 0.2M | 0.2M | 28,595 | 267,735 | 980K | 8.9M | 9.03 |
WMT19 | 151M | 0.3M | 0.3M | 169,180 | 50,259 | 976K | 7.8M | 7.97 |
enwik8 | 94M | 5M | 5M | 12,350 | 256 | 361K | 2.4M | 6.66 |
WikiText-103 is a knowledge-driven dataset consisting of featured articles from English Wikipedia. WMT19 contains English news from the WMT19 workshop.3 The news are segmented into months. We use the news from January to October for training, and news in November and December for development and test, respectively. Compared to Wikipedia articles, news contains more dynamic and temporal information, exposing new challenges for utilizing relational information. We reuse the vocabulary of GPT-2 (Radford et al., 2019) with 50,259 tokens to tokenize this dataset. enwik8 contains more than 100M bytes of Wikipedia text. Character-level language modeling has a much smaller vocabulary size than (sub)word-level language modeling.
We perform OpenIE on each dataset. For enwik8, OpenIE is performed after detokenizing its text into words. Statistics of extracted relations are also included in Table 1. Each entity from WikiText-103, WMT19, and enwik8 has 9.03, 7.97, and 6.66 relation triples on average.
3.2 Implementation Details
All models are implemented with JAX4 (Bradbury et al., 2018) and Haiku5 (Hennigan et al., 2020). We set the hidden size to 512 and the number of layers to 16 for all models. In (sub)word-level language modeling, we use adaptive softmax (Grave et al., 2017) for efficiency. We use GELU (Hendrycks and Gimpel, 2016) as our activation function and Adam (Kingma and Ba, 2015) as the optimizer. For training, we use batch size 128 and train the models on 64 16GB TPUs. We apply 4,000 warmup steps, before utilizing cosine annealing to decay the learning rate. Dropout (Srivastava et al., 2014) is applied during training with a rate of 0.25.
We set the lengths of text segment N, extended context M, and the relational memory P to (512, 512, 300), (384, 384, 800), and (768, 1536, 400) for WikiText-103, WMT19, and enwik8, respectively. These are determined by grid searches on development sets.
3.3 Main Results
We compare with a strong transformer-XL baseline trained under the same setting as our model. Our main results are shown in Table 2. We obtain three observations comparing transformer-XL and RelationLM. First, RelationLM consistently outperforms transformer-XL on all three datasets, demonstrating the effectiveness of relational memory. Note that a decrease of 0.01 is considerable on enwik8 with the bits per character metric. Second, relational memory not only improves language modeling on knowledge-driven articles (WikiText-103), but also generalizes to the challenging news domain (WMT19), where information is more dynamic and temporal. Last, the results indicate that relational memory improves both (sub)word-level and character-level language modeling.
We use perplexity (↓) on WikiText-103 and WMT19 and bits per character (↓) on enwik8 for evaluation.
. | Model . | # Params . | Dev . | Test . |
---|---|---|---|---|
WikiText | Transformer-XL | 122M | 19.0 | 19.9 |
RelationLM | 124M | 18.5 | 19.2 | |
Spalm | 122M | 18.1 | 19.0 | |
↪+ RelationLM | 124M | 17.7 | 18.6 | |
WMT19 | Transformer-XL | 114M | 21.7 | 21.5 |
RelationLM | 116M | 21.0 | 20.7 | |
Spalm | 114M | 20.4 | 20.3 | |
↪ + RelationLM | 116M | 19.8 | 19.6 | |
enwik8 | Transformer-XL | 93M | 1.05 | 1.03 |
RelationLM | 95M | 1.04 | 1.02 | |
Spalm | 93M | 1.04 | 1.02 | |
↪ + RelationLM | 95M | 1.03 | 1.01 |
. | Model . | # Params . | Dev . | Test . |
---|---|---|---|---|
WikiText | Transformer-XL | 122M | 19.0 | 19.9 |
RelationLM | 124M | 18.5 | 19.2 | |
Spalm | 122M | 18.1 | 19.0 | |
↪+ RelationLM | 124M | 17.7 | 18.6 | |
WMT19 | Transformer-XL | 114M | 21.7 | 21.5 |
RelationLM | 116M | 21.0 | 20.7 | |
Spalm | 114M | 20.4 | 20.3 | |
↪ + RelationLM | 116M | 19.8 | 19.6 | |
enwik8 | Transformer-XL | 93M | 1.05 | 1.03 |
RelationLM | 95M | 1.04 | 1.02 | |
Spalm | 93M | 1.04 | 1.02 | |
↪ + RelationLM | 95M | 1.03 | 1.01 |
Complementarity to Spalm.
Spalm (Yogatama et al., 2021) is a state-of-the-art memory-augmented language model. Instead of retrieving relation triples, it retrieves a set of related tokens at each timestep. Specifically, it first stores (context, the next token) pairs from training data. It then uses a pre-trained transformer language model to measure the similarities between the stored contexts and the observed context during training/evaluation. The next tokens of similar contexts are retrieved and are integrated with the observed context via a gating mechanism for generation.
We investigate whether RelationLM is complementary to Spalm. Because Spalm also uses a gating mechanism for integrating the retrieved tokens, we first apply RelationLM to combine transformer-XL output with relational information to obtain zt (as shown in §2.2.3), before using Spalm to integrate zt with retrieved tokens. The results are shown in Table 2. Spalm outperforms transformer-XL and even performs comparably or better compared to RelationLM on three datasets, demonstrating the effectiveness of retrieving related tokens. However, integrating RelationLM and Spalm can further improve the performance, indicating that these two models are not mutually exclusive. Therefore, retrieving relation triples brings complementary benefits to retrieving tokens.
4 Analysis
In this section, we study several design choices of relational memory, including its knowledge source, input component, capacity, dynamic OpenIE, entity scoring method used, and speed comparison. We then show quantitative and qualitative analysis results to better understand our model.
4.1 Ablations and Design Choice Studies
For the ablation studies, we use the development set of WikiText-103.
Source of Relation Triples.
We compare relation triples extracted from Freebase or using OpenIE. In the Freebase case, we use the Freebase API6 to obtain relation triples for each entity. For WikiText-103, there are 10.74 relations per entity on average, which is comparable to OpenIE relations (9.03 relations/entity). The results are shown in Table 3. Although Freebase relations have been observed to improve the performance on smaller datasets (e.g., WikiText-2; Logan et al., 2019) and particular domains (e.g., movies and actors; Ahn et al., 2016), we find that RelationLM with Freebase relations does not improve over transformer-XL on a much larger WikiText-103 dataset. We observe that a large portion of Freebase relations is from infoboxes of Wikipedia pages, which only cover information such as occupation, birth place, and religion. We believe these triples are too general to be useful for most contexts. The result of RelationLM with OpenIE shows the advantages of extracting relations from each dataset compared to using Freebase relations.
Ablating Relation Triples.
We ablate relation and/or tail entity from a relation triple (head entity, relation, tail entity) to study the contribution brought by each component. The results are shown in Table 4. We find that ablating both relation and tail entity performs comparably to transformer- XL. As head entities are extracted from the observed context, we believe the extended memory of transformer-XL can offset the effect brought by conditioning on head entities. Ablating relation performs better than transformer-XL. This shows the advantage of introducing tail entities. Using complete relation triples performs the best, demonstrating the effectiveness of this triple representation of knowledge.
Length of Relational Memory.
We study how many relation triples need to be stored in the relational memory. As shown in Figure 2, we can see that the perplexity improves with more relation triples. However, the curve becomes flat with more than 300 relation triples.
Perplexity on WikiText-103 with different number of relation triples.
Length of Transformer-XL Memory.
As increasing the length of context window can capture longer dependency, we study whether increasing the length of extended (transformer-XL) memory removes the performance gap between RelationLM and transformer-XL. As shown in Figure 3, the performance of both RelationLM and transformer-XL improves with larger extended memory. However, RelationLM still outperforms transformer-XL even with extended memory length 3072. We conclude that relational memory brings complementary benefits to simply expanding extended memory, since it provides global information about entities on each dataset.
Dynamic OpenIE.
All our main results use dynamic OpenIE. We show results without dynamic OpenIE in Table 5. We include the results on three datasets for a comparison. We can see that RelationLM with dynamic OpenIE performs comparably to RelationLM without dynamic OpenIE on WikiText-103 and enwik8, while larger improvements are obtained on WMT19. This indicates that dynamic OpenIE is more helpful for the news domain, which is more dynamic and temporal compared to knowledge-driven articles.
Perplexity with and without dynamic OpenIE.
Model . | Wiki . | WMT . | ew8 . |
---|---|---|---|
Transformer-XL | 19.0 | 21.7 | 1.05 |
w/o Dynamic OpenIE | 18.6 | 21.4 | 1.04 |
w/ Dynamic OpenIE | 18.5 | 21.0 | 1.04 |
Model . | Wiki . | WMT . | ew8 . |
---|---|---|---|
Transformer-XL | 19.0 | 21.7 | 1.05 |
w/o Dynamic OpenIE | 18.6 | 21.4 | 1.04 |
w/ Dynamic OpenIE | 18.5 | 21.0 | 1.04 |
Entity Scoring.
We study different entity scoring mechanisms for relation retrieval. We consider random selection (where entities extracted from the observed context are randomly selected), frequency-based scoring, and tf-idf scoring. As shown in Table 6, tf-idf performs the best.
Speed Comparison.
The wall clock time for both training and evaluation is shown in Table 7. RelationLM is 1.5 and 2.1 times slower during training and evaluation, respectively. Evaluation slows down some more due to dynamic OpenIE as shown in Algorithm 1.
4.2 Does Relational Memory Improve Coherence?
For evaluating coherence, we use two automatic metrics—knowledge perplexity and knowledge F1—to investigate whether the models can faithfully use entities. We further perform a human evaluation to study whether language models can generate coherent and knowledgeable sequences. We believe the human evaluation is a reliable way of evaluating coherence. This claim is advocated in Barzilay and Lapata (2005). We note that question answering is also often used to evaluate coherence (Guu et al., 2020; Lin et al., 2021). We leave this to future work.
Knowledge Perplexity.
While vanilla perplexity considers all words in an evaluation set, knowledge perplexity only considers entities for calculating perplexity. We use it to evaluate whether the model can assign higher probabilities for the correct entities under different contexts. Table 8 shows the numbers of entity words and non-entity words in our corpora. We show the results in Table 9. We observe that the gap between RelationLM and transformer-XL is larger on knowledge perplexity. RelationLM only performs comparably or slightly better compared to transformer-XL on non-entity perplexity. This shows that relational memory is helpful for predicting entity words. Note that knowledge perplexity tends to be much higher than perplexity on non-entity words, indicating the difficulty of predicting entity words. This collection of results indicates that relational memory helps the model use entities coherently and consistently under different contexts.
Statistics of entity and non-entity tokens.
Dataset . | Subset . | # Entity . | # Non-Entity . |
---|---|---|---|
WikiText | Dev | 61.6K | 155.9K |
Test | 65.8K | 179.7K | |
WMT | Dev | 84.9K | 262.2K |
Test | 81.0K | 256.6K | |
enwik8 | Dev | 1.7M | 3.3M |
Test | 1.7M | 3.3M |
Dataset . | Subset . | # Entity . | # Non-Entity . |
---|---|---|---|
WikiText | Dev | 61.6K | 155.9K |
Test | 65.8K | 179.7K | |
WMT | Dev | 84.9K | 262.2K |
Test | 81.0K | 256.6K | |
enwik8 | Dev | 1.7M | 3.3M |
Test | 1.7M | 3.3M |
Knowledge perplexity (↓) and non-entity perplexity (↓).
. | Metric . | Model . | Dev . | Test . |
---|---|---|---|---|
Knowledge PPX | WikiText | Transformer-XL | 47.3 | 52.3 |
RelationLM | 45.6 | 50.9 | ||
WMT | Transformer-XL | 77.2 | 77.0 | |
RelationLM | 73.2 | 73.1 | ||
enwik8 | Transformer-XL | 2.25 | 2.21 | |
RelationLM | 2.22 | 2.19 | ||
Non-entity PPX | WikiText | Transformer-XL | 13.3 | 13.8 |
RelationLM | 13.0 | 13.4 | ||
WMT | Transformer-XL | 14.4 | 14.4 | |
RelationLM | 14.2 | 14.3 | ||
enwik8 | Transformer-XL | 1.98 | 1.95 | |
RelationLM | 1.98 | 1.95 |
. | Metric . | Model . | Dev . | Test . |
---|---|---|---|---|
Knowledge PPX | WikiText | Transformer-XL | 47.3 | 52.3 |
RelationLM | 45.6 | 50.9 | ||
WMT | Transformer-XL | 77.2 | 77.0 | |
RelationLM | 73.2 | 73.1 | ||
enwik8 | Transformer-XL | 2.25 | 2.21 | |
RelationLM | 2.22 | 2.19 | ||
Non-entity PPX | WikiText | Transformer-XL | 13.3 | 13.8 |
RelationLM | 13.0 | 13.4 | ||
WMT | Transformer-XL | 14.4 | 14.4 | |
RelationLM | 14.2 | 14.3 | ||
enwik8 | Transformer-XL | 1.98 | 1.95 | |
RelationLM | 1.98 | 1.95 |
Knowledge F1.
We use knowledge F1 to explore whether our model generates tokens that are grounded to its contexts. Given a context as input, we sequentially generate 32 words (or 128 characters) for word-(character-)level language modeling by sampling from the distribution of the next word (character). To reduce variance, we generate 100 continuations for each context. We then perform entity recognition for both the generated sequences and their corresponding ground-truth sequences and calculate an F1 score based on these two sets of entities. For example, given the context “...Ayola was nominated and shortlisted for the ‘Female Performance in TV’ award”, we compare the generated text and the ground truth “in the 2006 Screen Nation Awards, for her role as Kyla Tyson in Holby City...” to calculate F1. The results are shown in Table 10. We notice that RelationLM performs better compared to transformer-XL. We conclude that models with relational memory can generate more coherent and logical text.
Knowledge F1 (↑).
Metric . | Model . | Dev . | Test . |
---|---|---|---|
WikiText | Transformer-XL | 9.9 | 9.4 |
RelationLM | 11.4 | 11.2 | |
WMT | Transformer-XL | 11.4 | 11.0 |
RelationLM | 12.6 | 12.3 | |
enwik8 | Transformer-XL | 16.0 | 18.9 |
RelationLM | 16.6 | 19.4 |
Metric . | Model . | Dev . | Test . |
---|---|---|---|
WikiText | Transformer-XL | 9.9 | 9.4 |
RelationLM | 11.4 | 11.2 | |
WMT | Transformer-XL | 11.4 | 11.0 |
RelationLM | 12.6 | 12.3 | |
enwik8 | Transformer-XL | 16.0 | 18.9 |
RelationLM | 16.6 | 19.4 |
Human Evaluation.
We conduct a human evaluation to study whether language models can generate coherent and knowledgeable sequences. We take 1,000 contexts from the test set of WikiText-103. We show the contexts, ground- truth sequences, and continuations generated by RelationLM and transformer-XL to five annotators. We use greedy decoding for both models. We shuffle the order of the continuations generated by RelationLM and transformer-XL so that the annotators are unaware of the sources of sequences. We then pose the following questions to the annotators:
Coherent. Given the context and its ground- truth continuation for reference, which generated sequence is more logical and coherent?
Knowledgeable. Given the context and its ground-truth continuation, which generated sequence provides more insights and is more knowledgeable?
We show the results in Table 11. We find that RelationLM outperforms transformer-XL in the human evaluation. These results are consistent with the two automatic metrics, knowledge perplexity and knowledge F1. This corroborates our claim that relational memory improves coherence in language modeling.
We show the number of contexts in which a continuation from a particular model is chosen by human evaluators for each evaluation criterion. Recall that the total number of contexts used for human evaluation is 1,000. Because we have five annotators, we use majority voting to decide the favored model for each continuation. We use the Kappa statistic to measure inter-annotator agreement. The statistic is 0.64, which shows substantial agreement among the annotators.
Model . | Coherent . | Knowledgeable . |
---|---|---|
Transformer-XL | 388 | 416 |
RelationLM | 612 | 584 |
Model . | Coherent . | Knowledgeable . |
---|---|---|
Transformer-XL | 388 | 416 |
RelationLM | 612 | 584 |
4.3 Qualitative Analysis
Gate Values.
As we use a gating function to integrate transformer-XL with relational information, we study gate values in this section. The histogram of gate values is shown in Figure 5. We notice that the histogram concentrates around 0.9. This is expected because non-entity words, which account for a large portion of text (according to Table 8), benefit less from the relational memory and mainly rely on the observed context for prediction as shown in §4.2. We further calculate the average gate values for entity words and non-entity words. The average gate value for entity words is 0.87, while the average value is 0.92 for non-entity words. This confirms that entity words rely more on relational information for prediction compared to non-entity words. We also plot the heatmap of gate values and a cherry-picked example is shown in Figure 4. Note that we randomly select 100 dimensions from 512 dimensions for readability. We notice that the entities, Aberdeen and Alec Flett, use more relational information than other positions (as shown by the horizontal blue lines). These results demonstrate that RelationLM can adaptively incorporate relational information for prediction.
Example.
We show three cherry-picked examples in Table 12. We take the first for illustration, which shows a text segment from the article, Joe Biden 2008 presidential campaign7 and some retrieved relations. We find that the first two relations, (Joe Biden, senior Senator, Delaware) and (Joe Biden presidential campaign, began, January 7 2007), are extracted from previous text segments, while (Joe Biden, was nominated, vice president) and (Biden, withdrew nomination, 1987) are extracted from the other articles, Joe Biden8 and Joe Biden 1988 presidential campaign,9 respectively. We notice that the relation (Joe Biden, was nominated, vice president) is highly predictive of the sequence, “Biden was selected to be Democratic presidential nominee Barack Obama’s vice presidential running mate”. From the observed context, the model also identifies a closely related entity, Barack Obama, and retrieves the relation (Barack Obama, president of, United States). Therefore, we conclude that the relational memory can give a global picture of related entities and provide relevant information for language modeling.
Causal Intervention.
We use causal intervention to study whether changing the contents in the relational memory will affect language model prediction. Given the relation (Obama, born in, Hawaii) along with other relations about Barack Obama, we let the model complete the sequence, “Obama was born in”. RelationLM outputs “Obama was born in and raised in Hawaii.” with greedy decoding. However, after modifying the relation to (Obama, born in, Kenya), we obtain “Obama was born in Kenya and was the first African-American president.” We further change to (Obama, born in, Paris) and the model outputs “Obama was born in Paris, France.” This indicates that RelationLM can take advantage of relation triples for making prediction. While we can also use prompts as intervention for vanilla language models, it remains challenging about selecting the appropriate prompts in different applications (Liu et al., 2021a).
5 Related Work
Knowledge-enhanced Architectures.
Injecting symbolic knowledge to machine learning models is widely adopted to improve the performance of natural language understanding (Annervaz et al., 2018; Ostendorff et al., 2019), question answering (Zhang et al., 2018; Huang et al., 2019; Hixon et al., 2015)), dialogue systems (Zhou et al., 2018; Moon et al., 2019; Guo et al., 2018; Liu et al., 2021b), and recommendation systems (Zhang et al., 2016; Wang et al., 2018a, 2019). Different from these models, we focus on using symbolic knowledge for language modeling. Existing language models are prone to generating illogical and contradictory contents. We believe that connecting language modeling and knowledge graphs is a promising direction to overcome the problem. Next we review previous knowledge- enhanced language models.
Knowledge-enhanced Language Models.
Our model is closely related to previous work on grounding autoregressive language models with knowledge graphs (Ahn et al., 2016; Logan et al., 2019; Hayashi et al., 2020; Wang et al., 2021a). However, these models rely on complex and adhoc preprocessing or rules to link text with knowledge bases (e.g., Freebase and Wikidata). As a result, previous work is more aligned with conditional language modeling, for example, graph-to-text generation in Wang et al. (2021a), which contrasts with unconditional language modeling p(x) considered in this work. As the graph is constructed with the unseen text x, predicting x given is easier due to this information leakage for Wang et al. (2021a). Also in Hayashi et al. (2020), topic entities are required for language modeling, which may not be available in most datasets, for example, the news domain. We do not compare with these previous models due to the different settings. In contrast, we adopt OpenIE relations and use a tf-idf search to retrieve relation triples for connecting language models and knowledge graphs. In the experiments, we demonstrate the effectiveness of our approach on three datasets, WikiText-103, WMT19, and enwik8.
There are language models incorporating entity information, such as entity coreference annotations (Ji et al., 2017; Clark et al., 2018), surface forms of entities (Kiddon et al., 2016; Yang et al., 2017; Cao et al., 2021), entity types (Parvez et al., 2018; Wang et al., 2018b), and entity descriptions (Bahdanau et al., 2017). Different from these models, we augment language models with a relational memory consisting of relation triples. We demonstrate the effectiveness of using relation triples by ablating tail entities and relations in §4.1.
Knowledge-enhanced Pretraining.
Using knowledge information for pretraining language models (Peters et al., 2019; Sun et al., 2019; Liu et al., 2020; Guu et al., 2020; Wang et al., 2021b; Agarwal et al., 2021; Verga et al., 2021) has recently grown in popularity and has achieved substantial improvements on knowledge-driven tasks such as question answering and named entity recognition. Instead of using knowledge information for improving downstream knowledge-driven tasks, we focus on using knowledge information for improving the generation capability of the language model itself.
Retrieval-augmented Models.
Retrieval-augmented models are now widely adopted in open-domain question answering (Chen et al., 2017; Lewis et al., 2020; de Masson d’Autume et al., 2019; Izacard and Grave, 2021), dialogue (Dinan et al., 2019; Fan et al., 2021; Thulke et al., 2021), and machine translation (Bapna and Firat, 2019; Khandelwal et al., 2020a). We focus on retrieval augmentation for language modeling (Merity et al., 2017; Grave et al., 2016; Khandelwal et al., 2020b; Yogatama et al., 2021). These algorithms are specifically tailored for language modeling, where related tokens are retrieved to help predict the next token. In this work, we move beyond token augmentation and show the benefits of retrieving relation triples. We also demonstrate that our model is complementary to a token augmentation model, Spalm (Yogatama et al., 2021), in the experiments.
6 Conclusion
We presented RelationLM, a language model that is augmented with relational memory. We showed how to obtain relevant knowledge graphs for a given corpus and how to combine them with a state-of-the-art language model such as transformer-XL. We demonstrated that our model improves performance and coherence on WikiText-103, WMT19, and enwik8. We also performed a comprehensive analysis to better understand how our model works. Our model provides a way to combine an autoregressive language model with general knowledge graphs.
Acknowledgments
We would like to thank our action editor (Xavier Carreras) and three anonymous reviewers for their insightful comments. We also thank Angeliki Lazaridou, Cyprien de Masson d’Autume, Lingpeng Kong, Laura Rimell, Aida Nematzadeh, and the DeepMind language team for their helpful discussions.
Notes
This view is also advocated in a parallel work by Nye et al. (2021), which presents a model for story generation and instruction following.
We provide a comparison of using relations extracted from OpenIE and Freebase in §4.1.
References
Author notes
Work completed during an internship at DeepMind.
Action Editor: Xavier Carreras