Abstract
Story generation, namely, generating a reasonable story from a leading context, is an important but challenging task. In spite of the success in modeling fluency and local coherence, existing neural language generation models (e.g., GPT-2) still suffer from repetition, logic conflicts, and lack of long-range coherence in generated stories. We conjecture that this is because of the difficulty of associating relevant commonsense knowledge, understanding the causal relationships, and planning entities and events with proper temporal order. In this paper, we devise a knowledge-enhanced pretraining model for commonsense story generation. We propose to utilize commonsense knowledge from external knowledge bases to generate reasonable stories. To further capture the causal and temporal dependencies between the sentences in a reasonable story, we use multi-task learning, which combines a discriminative objective to distinguish true and fake stories during fine-tuning. Automatic and manual evaluation shows that our model can generate more reasonable stories than state-of-the-art baselines, particularly in terms of logic and global coherence.
1 Introduction
Story generation is a strong indicator of machine understanding of natural language. It is often approached as selecting a sequence of events to form a story with a reasonable logic or plot. Although existing generative models (Roemmele, 2016; Fan et al., 2018; Fan et al., 2019) can generate stories with good local coherence, they are still struggling to plan a coherent plot and maintain a reasonable event sequence throughout the story, or they are often biased towards generating a limited set of stories with generic plots (See et al., 2019) (e.g., I have a great time), even when using the powerful generative model OpenAI’s GPT-2 (Radford et al., 2019), as shown in Table 1.
Pretrained GPT-2 has been shown to capture useful semantic and syntactic features (Alt et al., 2019), as demonstrated by state-of-the-art performance on some generation tasks such as machine translation and text summarization (Radford et al., 2019). However, compared with such tasks whose source inputs have contained sufficient information to generate desired target texts, story generation is a typical open- ended generation task, where only very limited information is given in the input. As shown in this paper, we observe some severe issues when applying GPT-2 to generate reasonable stories, particularly commonsense stories from a limited beginning. These issues include repetition, logic conflicts, and lack of long-range coherence (See et al., 2019; Holtzman et al., 2019), as exemplified in Table 1. Specifically, although GPT-2 performs reasonably well at generating some related concepts to bus (e.g., driver, and the probable destinations home or party), it completely ignores the other key entity accident in the leading context, which could be caused by its lower frequency in GPT-2’s initial training corpus (less than 7% of bus). Besides, even though the concepts are relevant, they are usually generic, and used repeatedly and illogically in the generated stories. Therefore, given limited information as input, it is extremely challenging for the subsequent generation without any external guidance—for instance, commonsense knowledge. And the difficulties lie in associating inter dependent commonsense knowledge for expanding a reasonable story, handling the causal relationships, as well as deciding the temporal orders between entities and events in context.
Context:The first time I saw the results of an accident, I was in a bus. |
Golden Continuation by Human:We were stopped and a friend pointed out the accident on the road. A man was under a car after having been run over. People tried to move the car off him but only made it worse. After seeing that, I am much more careful when crossing the street. |
pretrained GPT-2:She was trying to get home but the other passengers would not let her go. He thought she was going to die because of her weight. She was trying to get home but the other passengers would not let her go. The bus driver would not let her go. |
Fine-tuned GPT-2:I was on my way to a party. I was on my way to a party and I ’d gotten out of my seat, and started driving. I got a flat tire, so I stopped driving. I drove to the party and had a great time. |
Context:The first time I saw the results of an accident, I was in a bus. |
Golden Continuation by Human:We were stopped and a friend pointed out the accident on the road. A man was under a car after having been run over. People tried to move the car off him but only made it worse. After seeing that, I am much more careful when crossing the street. |
pretrained GPT-2:She was trying to get home but the other passengers would not let her go. He thought she was going to die because of her weight. She was trying to get home but the other passengers would not let her go. The bus driver would not let her go. |
Fine-tuned GPT-2:I was on my way to a party. I was on my way to a party and I ’d gotten out of my seat, and started driving. I got a flat tire, so I stopped driving. I drove to the party and had a great time. |
Explicitly introducing external commonsense knowledge has been shown helpful to improve language understanding and long-range coherence of generated texts (Zhou et al., 2018; Guan et al., 2019; Yang et al., 2019b). For example, for the entities in the given context of Table 1, many potentially related concepts (e.g., run over, cross street) can be inferred and predicted based on external commonsense knowledge bases such as ConceptNet (Speer and Havasi, 2012) and ATOMIC (Sap et al., 2019). These knowledge bases contain abundant semantic knowledge of concepts and inferential knowledge for commonsense reasoning. We enhance GPT-2 with such knowledge by post-training the model on the knowledge examples constructed from these knowledge bases, which can provide additional crucial information for story generation. Empirical experiments demonstrate that training with millions of such examples helps improve the coherence and logicality of generated stories. Meanwhile, we adopt multi-task learning to address the problem of handling causal and temporal dependencies. We combine the generation objective with an auxiliary multi-label classification objective, which requires distinguishing true stories from fake stories that are constructed by randomly shuffling the sentences, replacing a sentence with a negatively sampled one, or repeating a sentence in an original story. The additional classification task empowers our model to better capture the logicality in a story implicitly, namely, modeling the causal and temporal dependencies, inter-sentence coherence, and avoiding repetition.
The main contributions of this paper are summarized as follows:
- •
We propose a knowledge-enhanced pretraining model for commonsense story generation by extending GPT-2 with external commonsense knowledge. The model is post-trained on the knowledge examples constructed from ConceptNet and ATOMIC, thereby improving long-range coherence of generated stories.
- •
To generate reasonable stories, we adopt a classification task to distinguish true stories from auto-constructed fake stories. The auxiliary task makes the model implicitly capture the causal, temporal dependencies between sentences and inter-sentence coherence, and lead to less repetition.
- •
We conduct extensive experiments with automatic and manual evaluation. Results show that our model can generate more reasonable stories than strong baselines, particularly in terms of logicality and global coherence.1
2 Related Work
2.1 Neural Story Generation
Many existing neural story generation models generated stories by conditioning upon various contents such as images (Huang et al., 2016) and short text descriptions (Jain et al., 2017). Different from these studies, we consider the setting of open-ended story generation from only a limited leading context in this paper. For this task, prior studies have attempted to build specific sentence representations by modeling story entities and events to simplify the dependencies between sentences (Ji et al., 2017; Clark et al., 2018). Another line is to decompose story generation into separate steps (Martin et al., 2018; Fan et al., 2018; Wang et al., 2016; Xu et al., 2018; Yao et al., 2019; Fan et al., 2019). These models usually focused on first planning story sketches and then generating sentences from the sketches. However, improving pretrained models to generate commonsense stories is yet to be well investigated.
2.2 Pretraining
Recently, large-scale pretraining models have been widely developed in various NLP tasks. Some work leveraged pretraining to provide better language representations at the word level (Mikolov et al., 2013; Pennington et al., 2014; Peters et al., 2018) or sentence level (Le and Mikolov, 2014; Kiros et al., 2015) for various downstream task-specific architectures. However, Radford et al. (2018) and Devlin et al. (2018) suggest that these complex task-specific architectures are no longer necessary, and it is sufficient to merely fine-tune pretrained task-independent transformer language models for downstream tasks. Mehri et al. (2019) explored different pretraining methods based on language models for dialogue context representation learning. Furthermore, Radford et al. (2019) demonstrate pretrained language models (i.e., GPT-2) can perform downstream tasks better than state-of-the-art models even in a zero-shot setting (i.e., without any fine- tuning on task-specific data). Wolf et al. (2019) fine-tuned GPT-2 for personalized conversation generation, which obtains very competitive results in the challenge. However, as previous studies (See et al., 2019; Holtzman et al., 2019) observed, transferring GPT-2 directly to open-ended text generation still suffers from several issues such as repetition or lack of knowledge and inter-sentence coherence with different decoding algorithms. Besides, although Song et al. (2019) and Dong et al. (2019) extended the language model to support an encoder-decoder framework (Sutskever et al., 2014), we build our model based on GPT-2 because of its simplicity and broad applicability.
2.3 Commonsense Knowledge
Incorporating commonsense knowledge is necessary and beneficial for language inference (LoBue and Yates, 2011; Bowman et al., 2015; Rashkin et al., 2018b), reading comprehension (Mihaylov and Frank, 2018; Rashkin et al., 2018a), and particularly for open-ended language generation, which usually requires external knowledge to enrich the limited source information. Commonsense knowledge has been demonstrated to significantly improve dialogue generation (Zhou et al., 2018), story ending generation (Guan et al., 2019), and essay generation from given topics (Yang et al., 2019b). And recently, some work also attempted to integrate external commonsense knowledge into pretrained models such as BERT (Devlin et al., 2018) to enhance language representation for reading comprehension (Yang et al., 2019a) and other knowledge-driven NLP tasks like entity typing and relation classification (Zhang et al., 2019). Besides, Sun et al. (2019) improved BERT on Chinese NLP tasks by multi-stage knowledge masking strategy to integrate phrase and entity level knowledge into the language representation. Moreover, Bosselut et al. (2019) transferred the implicit knowledge from GPT-2 by fine-tuning the model to generate an object given the subject and a relation as input in commonsense knowledge graphs, that is, automatic knowledge base construction. However, the low novelty of the generated objects showed that it could still be difficult for GPT-2 to generate commonsense texts solely based on its implicit knowledge. Therefore, we target integrating external knowledge into GPT-2 for generating more reasonable commonsense stories.
2.4 Multi-Task Learning
Incorporating other auxiliary task objectives to complement the primary goal has been shown to improve the performance in many NLP tasks such as sentiment classification (Yu and Jiang, 2016) and conversation generation (Zhao et al., 2017). Recently, multi-task learning was also used to pretrain language models to capture dependencies in context (Devlin et al., 2018; Mehri et al., 2019) and further improve pretrained models’ representation power during fine-tuning (Wolf et al., 2019).
3 Methodology
The task in this work can be defined as follows: Given a one-sentence story beginning X as the leading context, the model should continue to complete a K-sentence story Y with a reasonable plot. The sentences in a generated story should have reasonable logical connections, causal relationships, and temporal dependencies with each other and with the given beginning. To this end, we devise a novel framework to leverage knowledge and handle the causal and temporal dependencies, as Figure 1 shows.
3.1 Pretrained Transformer Language Model
GPT-2 network is pretrained on a large-scale corpus but still suffers from many issues such as lack of necessary knowledge for commonsense story generation as aforementioned. Therefore, in this work we improve GPT-2 for generating more reasonable stories with external commonsense knowledge.
3.2 Training with Commonsense Knowledge
Commonsense knowledge can facilitate language comprehension and generation, as reported in a notable work for dialog generation (Zhou et al., 2018). To leverage commonsense knowledge in pretrained language models, we resort to existing large-scale knowledge bases ConceptNet (Li et al., 2016b) and ATOMIC (Sap et al., 2019).
The ConceptNet dataset2 consists of triples obtained from the Open Mind Common Sense entries in ConceptNet 5 (Speer and Havasi, 2012). It contains 34 relations in total and represents each knowledge triple by R = (h, r, t), meaning that head concept h has the relation r with tail concept t for example, (cross street, Causes, accident). And the ATOMIC dataset3 is an atlas of everyday commonsense reasoning containing a mass of textual description of inferential knowledge organized as typed if-then triples. For example, a typical if-then triple is (PersonX pays PersonY a compliment, xIntent, to be nice), where xIntent is the relation between the head and tail events standing for If-Event-Then-Mental-State. We implicitly introduce the knowledge to the pretrained language model by post-training on knowledge-augmented data. Some work has attempted to explicitly incorporate commonsense knowledge into language generation (Zhou et al., 2018; Guan et al., 2019; Yang et al., 2019b). However, all these works assume that there is an alignment between the training data and the knowledge bases. Therefore, they suffer from the following issues: (1) It is difficult to match the events extracted from the training data with those stored in KB. (2) Learning and utilizing multi-hop triples in knowledge graphs is costly in time because of the large-scale size. (3) Most of KB triples do not appear in the task-specific training data, so that those absent triples are not fully utilized in existing models. Fortunately, our model is trained on the knowledge bases directly, which can effectively ease these limitations.
Knowledge Bases . | Original Triples . | Examples of Transformed Sentences . |
---|---|---|
ConceptNet | (eiffel tower, AtLocation, paris) | eiffel tower is at paris. |
(telephone, UsedFor, communication) | telephone is used for communication. | |
ATOMIC | (PersonX dates for years, oEffect, continue dating) | PersonX dates for years. PersonY will continue dating. |
(PersonX cooks spaghetti, xIntent, to eat) | PersonX cooks spaghetti. PersonX wants to eat. |
Knowledge Bases . | Original Triples . | Examples of Transformed Sentences . |
---|---|---|
ConceptNet | (eiffel tower, AtLocation, paris) | eiffel tower is at paris. |
(telephone, UsedFor, communication) | telephone is used for communication. | |
ATOMIC | (PersonX dates for years, oEffect, continue dating) | PersonX dates for years. PersonY will continue dating. |
(PersonX cooks spaghetti, xIntent, to eat) | PersonX cooks spaghetti. PersonX wants to eat. |
3.3 Multi-Task Learning
In order to encourage our model to generate reasonable stories in logic, we add an auxiliary classification task to the generation task during fine-tuning on the ROCStories corpus. The task requires distinguishing true stories from fake stories. We first construct three additional sets of fake stories by shuffling the sentences, replacing a sentence with a negatively sampled one, and randomly repeating a sentence in an original story. Notably, these operations are performed only on the following K sentences of a story (i.e., not including the leading context [the beginning]). For simplicity, we denote the true story set and three manually constructed fake story sets with D1, D2, D3, and D4 respectively, as illustrated in Figure 2.
4 Experiments
4.1 Dataset
We evaluated our model on the ROCStories corpus (Mostafazadeh et al., 2016a). The corpus contains 98,162 five-sentence stories for evaluating story understanding. The original task is designed to select a correct story ending from two candidates, whereas our task is to generate a reasonable story given the first sentence of a story (i.e., K, namely, the number of generated sentences, is four in our setting). Following Radford et al. (2019), the stories are tokenized using byte pair encoding (BPE) with a vocabulary of 50,257 items. The average number of tokens in X/Y (i.e., the beginning/the following K sentences in a story) is 13.39/50.00 with BPE, while the model uses pretrained positional embeddings with a maximal sequence length of 1024 tokens.
As for the knowledge bases, we used the 605k version of ConceptNet. The second KB we used contains 709k records from the 877k tuples of ATOMIC after transformation and deduplication. We randomly selected stories and knowledge sentences for training/validation/test respectively, as shown in Table 3. Because the ROCStories dataset is rather small for generation, we made delexilization by replacing all the names in stories with special placeholders “[MALE]”, “[FEMALE]”, and “[NEUTRAL]” for male, female, and unknown names, respectively. Additionally “PersonX” and “PersonY” in ATOMIC are replaced by “[MALE]” and “[FEMALE]” as well.
4.2 Baselines
We compared our models with the following state-of-the-art baselines:
Convolutional Seq2Seq (ConvS2S):
It directly generates a story conditioned upon the beginning based on a convolutional seq2seq model (Gehring et al., 2017) with decoder self-attention.
Fusion Convolutional Seq2Seq Model (Fusion):
It generates a story by first pretraining a convolutional seq2seq model, and then fixing the model and providing it to the second clone model with fusion mechanism (Fan et al., 2018).
Plan&Write:
Skeleton-based Model with Reinforcement Learning (SKRL):
The model first generates a compressed story including the most critical phrases, called skeleton, and then generates a story conditioned upon the skeleton. The skeleton is automatically learned by reinforcement learning (Xu et al., 2018).
Decomposed Model with Semantic Role Labeling (DSRL):
It first generates a predicate-argument structure conditioned upon the beginning and then generates a story by surface realization on top of the structure. The structures are identified by semantic role labelling (Fan et al., 2019).
We also made comparisons with GPT-2 in different settings as follows:
GPT-2 (Scratch):
The network architecture is the same as GPT-2, but the model is only trained on ROCStories without any pretrained parameters.
GPT-2 (Pretrain):
This model directly used the public checkpoint of pretrained parameters4 for story generation. Following Radford et al. (2019), stories are generated in a zero-shot setting. To induce story generation behavior, we conditioned the language model on a context of example stories, and then sample sentences from the model after a final prompt of story beginning. We used the first K generated sentences as the generated story.
GPT-2 (Fine-tuning):
This model is fine-tuned on the ROCStories corpus from the public checkpoint of pretrained parameters.
Furthermore, we also conducted ablation tests by removing the proposed components respectively to investigate the influence of each component with the same network structure.
4.3 Experiment Settings
We set the parameters by following the small version of Radford et al. (2019)’s design: The language model is equipped with 12 layers, 768-dimensional hidden states, and 12 attention heads. The batch size is 10 during training on the ROCStories corpus using Adam optimizer with an initial learning rate of 1e-4. The scale factor λ is set to 0.05. And we generated stories using a top-k sampling scheme (Fan et al., 2018) with k = 40 and a softmax temperature of 0.7 (Goodfellow et al., 2016) to balance the trade-off between diversity and fluency. We applied these settings to all the baselines.
4.4 Automatic Evaluation
Evaluation Metrics
We adopted the following automatic metrics to evaluate the generation performance in the entire test set. (1) Perplexity (PPL). Smaller perplexity scores indicate better fluency in general. (2) BLEU. BLEU (Papineni et al., 2002) evaluates n-gram overlap between a generated story and a human-written story. However, BLEU is usually inappropriate for open-ended text generation (Fan et al., 2018) because there are multiple plausible stories for the same input but only one story is given in the dataset. And BLEU scores will become extremely low for large n. We thus experimented with n = 1,2. (3) Coverage. To access the effect of incorporating commonsense knowledge, we calculated the coverage score as the average number of commonsense triples matched in each generated story, which requires both head and tail entities/events appears in the same story. (4) Repetition. We measured the redundancy of stories by computing repetition-4, the percentage of generated stories that repeat at least one 4-gram (Shao et al., 2019). (5) Distinct. To measure the generation diversity, we adopted distinct-4 (Li et al., 2016a), the ratio of distinct 4-grams to all the generated 4-grams.
Results
The results of automatic evaluation are shown in Table 4. Note that the perplexity scores of some baselines are not comparable with ours because they tokenize stories by words rather than by byte pair encodings as used in GPT-2. Thus, we did not provide these scores. Our model outperforms the variants of GPT-2 in terms of perplexity, and has higher BLEU scores than all the baselines, indicating better fluency and more overlaps with the reference stories. Our model also has higher knowledge coverage and distinct-4 scores, showing that our model can generate more diverse stories with more abundant knowledge. However, we observed that pretraining might lead to more severe repetition by comparing three variants of GPT-2. Our model effectively improves the situation but still performs worse than the baselines with task-specific architectures, for instance, the planning-based models (e.g., DSRL). Fortunately, See et al. (2019) showed that increasing k for top-k sampling could alleviate the repetition issue. Additionally compared with training from scratch, fine-tuned GPT-2 performs much better in fluency (lower perplexity scores) but suffers from worse repetition, and only improve slightly in coverage and diversity. Furthermore, pretrained GPT-2 has the lowest coverage and distinct-4, which further verifies our hypothesis that GPT-2 lacks the necessary knowledge to expand a story plot.
Models . | PPL . | BLEU-1 . | BLEU-2 . | Coverage . | Repetition-4(%) . | Distinct-4(%) . |
---|---|---|---|---|---|---|
ConvS2S | N/A | 0.312 | 0.132 | 13.64 | 22.87 | 72.78 |
Fusion | N/A | 0.322 | 0.137 | 12.02 | 24.23 | 72.82 |
Plan&Write | N/A | 0.308 | 0.126 | 13.38 | 17.06 | 67.20 |
SKRL | N/A | 0.267 | 0.088 | 10.82 | 18.34 | 69.42 |
DSRL | N/A | 0.293 | 0.117 | 10.38 | 15.36 | 73.08 |
GPT-2 (Scratch) | 11.82 | 0.311 | 0.134 | 10.76 | 22.87 | 73.33 |
GPT-2 (Pretrain) | 33.50 | 0.257 | 0.085 | 8.04 | 39.22 | 64.99 |
GPT-2 (Fine-tune) | 7.96 | 0.322 | 0.141 | 12.40 | 29.41 | 73.85 |
Ours | 7.85 | 0.326 | 0.143 | 18.48 | 21.93 | 78.96 |
w/o Pretrain | 11.04 | 0.316 | 0.134 | 16.33 | 21.52 | 77.17 |
w/o Knowledge | 7.70 | 0.314 | 0.136 | 13.95 | 25.08 | 73.24 |
w/o Multi-task | 8.04 | 0.324 | 0.140 | 17.19 | 24.40 | 79.43 |
Golden Story | N/A | N/A | N/A | 19.28 | 7.64 | 89.51 |
Models . | PPL . | BLEU-1 . | BLEU-2 . | Coverage . | Repetition-4(%) . | Distinct-4(%) . |
---|---|---|---|---|---|---|
ConvS2S | N/A | 0.312 | 0.132 | 13.64 | 22.87 | 72.78 |
Fusion | N/A | 0.322 | 0.137 | 12.02 | 24.23 | 72.82 |
Plan&Write | N/A | 0.308 | 0.126 | 13.38 | 17.06 | 67.20 |
SKRL | N/A | 0.267 | 0.088 | 10.82 | 18.34 | 69.42 |
DSRL | N/A | 0.293 | 0.117 | 10.38 | 15.36 | 73.08 |
GPT-2 (Scratch) | 11.82 | 0.311 | 0.134 | 10.76 | 22.87 | 73.33 |
GPT-2 (Pretrain) | 33.50 | 0.257 | 0.085 | 8.04 | 39.22 | 64.99 |
GPT-2 (Fine-tune) | 7.96 | 0.322 | 0.141 | 12.40 | 29.41 | 73.85 |
Ours | 7.85 | 0.326 | 0.143 | 18.48 | 21.93 | 78.96 |
w/o Pretrain | 11.04 | 0.316 | 0.134 | 16.33 | 21.52 | 77.17 |
w/o Knowledge | 7.70 | 0.314 | 0.136 | 13.95 | 25.08 | 73.24 |
w/o Multi-task | 8.04 | 0.324 | 0.140 | 17.19 | 24.40 | 79.43 |
Golden Story | N/A | N/A | N/A | 19.28 | 7.64 | 89.51 |
. | Grammaticality . | Logicality . | ||||||
---|---|---|---|---|---|---|---|---|
Models . | Win (%) . | Lose (%) . | Tie (%) . | κ . | Win (%) . | Lose (%) . | Tie (%) . | κ . |
Ours vs. Fusion | 50.0** | 27.0 | 23.0 | 0.421 | 57.0** | 28.0 | 15.0 | 0.455 |
Ours vs. DSRL | 58.0** | 24.0 | 18.0 | 0.441 | 58.0** | 29.0 | 12.0 | 0.475 |
Ours vs. GPT-2 (Scratch) | 54.0** | 24.5 | 21.5 | 0.385 | 54.0** | 26.0 | 20.0 | 0.304 |
Ours vs. GPT-2 (Pretrain) | 52.0** | 31.5 | 16.5 | 0.483 | 56.5** | 32.5 | 11.0 | 0.493 |
Ours vs. GPT-2 (Fine-tune) | 42.0** | 28.0 | 30.0 | 0.344 | 51.0** | 27.5 | 21.5 | 0.371 |
Ours vs. Ours w/o Pretrain | 51.0** | 31.0 | 18.0 | 0.378 | 56.0** | 28.0 | 16.0 | 0.375 |
Ours vs. Ours w/o Knowledge | 46.0** | 23.0 | 21.0 | 0.289 | 48.0** | 29.0 | 23.0 | 0.314 |
Ours vs. Ours w/o Multi-task | 37.5 | 31.0 | 31.5 | 0.313 | 48.5** | 25.5 | 26.0 | 0.297 |
. | Grammaticality . | Logicality . | ||||||
---|---|---|---|---|---|---|---|---|
Models . | Win (%) . | Lose (%) . | Tie (%) . | κ . | Win (%) . | Lose (%) . | Tie (%) . | κ . |
Ours vs. Fusion | 50.0** | 27.0 | 23.0 | 0.421 | 57.0** | 28.0 | 15.0 | 0.455 |
Ours vs. DSRL | 58.0** | 24.0 | 18.0 | 0.441 | 58.0** | 29.0 | 12.0 | 0.475 |
Ours vs. GPT-2 (Scratch) | 54.0** | 24.5 | 21.5 | 0.385 | 54.0** | 26.0 | 20.0 | 0.304 |
Ours vs. GPT-2 (Pretrain) | 52.0** | 31.5 | 16.5 | 0.483 | 56.5** | 32.5 | 11.0 | 0.493 |
Ours vs. GPT-2 (Fine-tune) | 42.0** | 28.0 | 30.0 | 0.344 | 51.0** | 27.5 | 21.5 | 0.371 |
Ours vs. Ours w/o Pretrain | 51.0** | 31.0 | 18.0 | 0.378 | 56.0** | 28.0 | 16.0 | 0.375 |
Ours vs. Ours w/o Knowledge | 46.0** | 23.0 | 21.0 | 0.289 | 48.0** | 29.0 | 23.0 | 0.314 |
Ours vs. Ours w/o Multi-task | 37.5 | 31.0 | 31.5 | 0.313 | 48.5** | 25.5 | 26.0 | 0.297 |
As for the ablation test, our model without pretraining has significantly higher perplexity, indicating that pretraining contributes to story fluency. When removing external knowledge, coverage and distinct-4 drop while repetition-4 rises substantially, suggesting that post-training on millions of knowledge sentences can effectively enhance the language model’s ability to generate stories with more commonsense knowledge, although we do not explicitly utilize knowledge during fine-tuning on ROCStories. Removing multi-task learning also leads to slightly better distinct-4 but causes much higher repetition-4, indicating that the classification loss is of great help for reducing redundancy.
We also provide the performance of our model on the auxiliary story classification task and the predicted proportional distribution of the generated stories by different models on the four story types with the auxiliary story classifier, as shown in Table 6. Both metrics are computed on 1,000 samples from the test set. We can observe that it is relatively easier to detect fake stories with repeated plots (D4) than those with disordered logic (D2) and unrelated topics (D3). When using the auxiliary story classifier to classify the generated stories, pretrained GPT-2 is considered to generate more fake stories, with only 15.83% stories of type D1, which agrees with the previous automatic evaluation especially in terms of repetition. Besides, our model performs better than baselines, indicating that the external knowledge and the auxiliary task can encourage our model to generate more reasonable stories.
Story types . | D1 . | D2 . | D3 . | D4 . |
---|---|---|---|---|
F1 score | 0.80 | 0.81 | 0.88 | 0.98 |
Models | Proportional Distribution (%) | |||
GPT-2 (Pretrain) | 15.83 | 40.8 | 39.36 | 4.01 |
GPT-2 (Fine-tune) | 86.94 | 9.98 | 2.93 | 0.15 |
Ours | 90.12 | 7.98 | 1.86 | 0.04 |
w/o Knowledge | 87.76 | 9.51 | 2.67 | 0.06 |
w/o Multi-task | 88.69 | 9.07 | 2.02 | 0.22 |
Story types . | D1 . | D2 . | D3 . | D4 . |
---|---|---|---|---|
F1 score | 0.80 | 0.81 | 0.88 | 0.98 |
Models | Proportional Distribution (%) | |||
GPT-2 (Pretrain) | 15.83 | 40.8 | 39.36 | 4.01 |
GPT-2 (Fine-tune) | 86.94 | 9.98 | 2.93 | 0.15 |
Ours | 90.12 | 7.98 | 1.86 | 0.04 |
w/o Knowledge | 87.76 | 9.51 | 2.67 | 0.06 |
w/o Multi-task | 88.69 | 9.07 | 2.02 | 0.22 |
Following Fan et al. (2018) and See et al. (2019), we computed beginning ranking accuracy (BR) to measure how strongly the output of a model is coherent with the beginning, and logic ranking accuracy (LR) to measure the ability of capturing the causal and temporal dependencies in the context. For BR, we first sampled 9 negative beginnings (first sentence) for a true story, and then calculated the perplexity of the 10 stories. If the true story has the lowest perplexity by our model, it is regarded as a correct prediction. As for LR, since each story in ROCStories consists of five sentences, we produced four shuffled versions by switching each pair of adjacent sentences. We then used our model to score the five stories with perplexity. A prediction is regarded as correct if the true story has the lowest score. We randomly sampled 1,000 human-written stories from the test set in our evaluation. As shown in Table 7, the external knowledge and multi-task learning effectively promote the coherence and help capture inter-sentence dependencies in the context.
Models . | BR (%) . | LR (%) . |
---|---|---|
GPT-2 (Pretrain) | 59.3 | 44.8 |
GPT-2 (Fine-tune) | 73.4 | 69.6 |
Ours | 76.2 | 72.7 |
w/o Knowledge | 74.9 | 71.5 |
w/o Multi-task | 75.7 | 70.4 |
Models . | BR (%) . | LR (%) . |
---|---|---|
GPT-2 (Pretrain) | 59.3 | 44.8 |
GPT-2 (Fine-tune) | 73.4 | 69.6 |
Ours | 76.2 | 72.7 |
w/o Knowledge | 74.9 | 71.5 |
w/o Multi-task | 75.7 | 70.4 |
4.5 Manual Evaluation
To evaluate the fluency and logic of generated stories, we conducted pairwise comparisons with two strong baseline models (Fusion and DSRL) that performed best in automatic evaluation, three variants of GPT-2, and three ablated models of ours. For manual evaluation, we randomly sampled 200 stories from the test set and obtained 1,800 stories from the nine models. For each pair of stories (one by our model and the other by a baseline, along with the beginning), three annotators were hired to give a preference (win, lose, or tie) in terms of two metrics, respectively. We resorted to a crowdsourcing service Amazon Mechanical Turk (AMT) for annotation, and we adopted majority voting to make final decisions among the three annotators.
Evaluation Metrics
We evaluated the models from the following two perspectives: grammaticality to indicate whether a story is natural and fluent, and logicality to indicate whether a story is coherent to the given beginning and reasonable in terms of causal and temporal dependencies in the context. Note that the two aspects are independently evaluated. We show a screenshot of the annotation on AMT in Figure 4.
Results
The manual evaluation results are shown in Table 5. To measure the inter-annotator agreement, we calculated Fleiss’ kappa (Fleiss, 1971) for each pairwise comparison and all the results show fair agreement (0.2 ≤ κ ≤ 0.4) or moderate agreement (0.4 ≤ κ ≤ 0.6). We also conducted a sign test to check the significance of the differences. The results indicate that our model performs significantly better than other baselines in both metrics. More specifically, post-training on knowledge bases leads to significant improvements in grammar and logic by offering more knowledge for expanding the story plots. And multi-task learning further enhances the performance in logic and does not affect fluency of generated stories.
4.6 Relation Understanding
It is still necessary to further investigate whether our model really understands the relations between head and tail entities/events. For example, when our model learns car accident causes injury from ConceptNet, it will agree with car accident leads to injury and denies car accident is driven by injury if our model can identify the specific relation between the head (car accident) and tail (injury). By contrast, the model will not distinguish the three statements if it only learns simple relevance (or, co-occurrence) between car accident and injury instead of the specific causal relation.
Therefore, we constructed two sets of sentences including correct and wrong knowledge respectively based on the test set of ConceptNet. Specifically, the correct sentences are produced with a synonymous template whose relation tokens are replaced by synonyms (e.g., causes can also be translated to leads to), while the wrong sentences with a random template whose relation tokens are randomly replaced by another one. Besides, we use training template referring to the templates that are used during post-training on knowledge bases. Then, we regard the sentence with lower perplexity as more reasonable. We calculate the accuracy of relation ranking as the percentage of cases where the sentence with the wrong template has the highest perplexity compared with the sentences with correct and training templates. Furthermore, we also conducted an automatic pairwise comparison to distinguish the reasonable sentences from unreasonable ones based on the perplexity scores of different models.
As shown in Table 8, the external knowledge can help our language model distinguish false sentences from true ones with higher accuracy than GPT-2 (Random chance scores 33.3%). Furthermore, our model prefers the correct template compared with the wrong one (winning rate of 71.91%), and has a close preference between the training and correct templates (winning rate of 55.76%). By contrast, GPT-2 without any external knowledge relies more on frequency to score relations, and thus can hardly tell the differences among different types of sentences. The results indicate that our model can capture not only simple relevance between head and tail entities/events, but also the specific causal relations.
. | . | Comparison Pairs . | ||
---|---|---|---|---|
Models . | Acc . | C vs. W . | T vs. C . | T vs. W . |
GPT-2 (Pretrain) | 39.28 | 53.83 | 44.74 | 49.87 |
GPT-2(Fine-tune) | 47.48 | 60.31 | 39.57 | 56.01 |
Ours | 67.07 | 71.91 | 55.76 | 79.89 |
w/o Knowledge | 48.07 | 62.07 | 42.43 | 55.64 |
. | . | Comparison Pairs . | ||
---|---|---|---|---|
Models . | Acc . | C vs. W . | T vs. C . | T vs. W . |
GPT-2 (Pretrain) | 39.28 | 53.83 | 44.74 | 49.87 |
GPT-2(Fine-tune) | 47.48 | 60.31 | 39.57 | 56.01 |
Ours | 67.07 | 71.91 | 55.76 | 79.89 |
w/o Knowledge | 48.07 | 62.07 | 42.43 | 55.64 |
4.7 Case Study
We presented some generated examples in Table 9. Our model can generate more natural and reasonable stories than baselines.
Context: | [MALE] was driving around in the snow. |
Golden Story: | He needed to get home from work. He was driving slowly to avoid accidents. Unfortunately the roads were too slick and [MALE] lost control. His tires lost traction and he hit a tree. |
ConvS2S: | He did n’t feel like driving on his way out of the street. He put on his gloves and drove away. [MALE] made a lot of paperwork and took a taxi to the mechanic. He gave [MALE] a few thousand dollars to help him get up. |
Fusion: | He was driving for a bit but was getting icy. [MALE] ended up driving over a cliff. [MALE] was in the middle of the road. He was out of the snow by the time he got home. |
Plan&write: | He noticed a man sitting on the side of the road. He went to the nearest gas station. The man was n’t sure if he could have the money. The man was left a big deal. |
SKRL: | A drop of battery stacked in the quarter. [MALE] went to check the phone and turned back. He finally got a car hit off from his way to play again. [MALE] had forgotten his hair. |
DSRL: | A man came in a cop. He called for an ambulance. He said he might go to a park. He finally showed up to post what was wrong. |
GPT-2 (Scratch): | He noticed a large snow pile. He got out and began driving. Then he sped off and went home. [MALE] was happy that he had saved his money. |
GPT-2 (Pretrain): | The doctors took him to a hospital. They told him he was going to be fine. He got all the doctors and nurses. The [MALE] parents said they have no hope. |
GPT-2 (Fine-tune): | He noticed a snowglobe blocking his road. [NEUTRAL] he backed up to the car and got a snow shovel. [MALE] used to shovel the snow and shovel the snow. Now the snow is gone but he has snow boots and snow gloves. |
Ours: | Suddenly his car broke down on the side of the road. [MALE] had to call a tow truck. The tow truck came and took [MALE] home. [MALE] was happy he was able to get home. |
w/o Pretrain: | He tried to stop to see how he could get out of his car. His car started to sputter down on him. [MALE] pulled over and kept driving . He was able to make up and the car sped away. |
w/o Knowledge: | He noticed a young lady running off. He stopped and picked her up. When he checked his id it was lost. [MALE] realized he had forgotten his id. |
w/o Multi-task: | He noticed a car in the road . He decided to stop . He got out of his car. He drove for half an hour. |
Context: | [MALE] was driving around in the snow. |
Golden Story: | He needed to get home from work. He was driving slowly to avoid accidents. Unfortunately the roads were too slick and [MALE] lost control. His tires lost traction and he hit a tree. |
ConvS2S: | He did n’t feel like driving on his way out of the street. He put on his gloves and drove away. [MALE] made a lot of paperwork and took a taxi to the mechanic. He gave [MALE] a few thousand dollars to help him get up. |
Fusion: | He was driving for a bit but was getting icy. [MALE] ended up driving over a cliff. [MALE] was in the middle of the road. He was out of the snow by the time he got home. |
Plan&write: | He noticed a man sitting on the side of the road. He went to the nearest gas station. The man was n’t sure if he could have the money. The man was left a big deal. |
SKRL: | A drop of battery stacked in the quarter. [MALE] went to check the phone and turned back. He finally got a car hit off from his way to play again. [MALE] had forgotten his hair. |
DSRL: | A man came in a cop. He called for an ambulance. He said he might go to a park. He finally showed up to post what was wrong. |
GPT-2 (Scratch): | He noticed a large snow pile. He got out and began driving. Then he sped off and went home. [MALE] was happy that he had saved his money. |
GPT-2 (Pretrain): | The doctors took him to a hospital. They told him he was going to be fine. He got all the doctors and nurses. The [MALE] parents said they have no hope. |
GPT-2 (Fine-tune): | He noticed a snowglobe blocking his road. [NEUTRAL] he backed up to the car and got a snow shovel. [MALE] used to shovel the snow and shovel the snow. Now the snow is gone but he has snow boots and snow gloves. |
Ours: | Suddenly his car broke down on the side of the road. [MALE] had to call a tow truck. The tow truck came and took [MALE] home. [MALE] was happy he was able to get home. |
w/o Pretrain: | He tried to stop to see how he could get out of his car. His car started to sputter down on him. [MALE] pulled over and kept driving . He was able to make up and the car sped away. |
w/o Knowledge: | He noticed a young lady running off. He stopped and picked her up. When he checked his id it was lost. [MALE] realized he had forgotten his id. |
w/o Multi-task: | He noticed a car in the road . He decided to stop . He got out of his car. He drove for half an hour. |
As illustrated, the baselines (from ConvS2s to DSRL) predict wrong entities and events that are irrelevant to the leading context (e.g., paperwork), thereby leading to bad overall coherence in the generated stories. Pretrained GPT-2 without any fine-tuning generates an entirely irrelevant, unreasonable story (e.g., hospital, doctor) due to the lack of knowledge. GPT-2 trained from scratch and fine-tuned GPT-2 suffer from conflicting logic (e.g., first got out and then began driving, and backed up to the car when driving), repetition (e.g., shovel the snow), and poor coherence with some irrelevant keywords (e.g., save money). In comparison, the story by our model is coherent in logic and fluent in grammar. Furthermore, without pretraining, our model can still incorporate external knowledge to generate a story with an understandable main idea but not always reasonable locally (e.g., pulled over and kept driving). When removing knowledge out of our full model, some confusing entities (e.g., id) will be generated. Additionally, removing multi-task learning also significantly affects the logic of generated stories (e.g., first got out and then drove ) due to the inability of capturing the causal and temporal dependencies in context.
In order to verify the ability of our model to incorporate external knowledge when generating stories, we showed the utilized commonsense knowledge of this example in Figure 5. We can observe that the external knowledge is useful for expanding a reasonable story plot such as driving, broke down, call, came and took home, and get home.
5 Error Analysis
Although the proposed model outperforms the state-of-the-art baselines, it needs to be noted that there are still many unreasonable stories losing to other models in manual evaluation. Therefore, we analyzed error types by manually checking all lost stories in pairwise comparisons between our model and two strong baselines including Fusion and GPT-2 (Fine-tune) to reveal the factors that affect the performance. The numbers of stories which lost to our model in logic are 114/102 of 200/200 in total for Fusion/GPT-2 (Fine-tune) respectively. And there are 111 stories of 400 generated by our model losing to these two baselines in logic.
We manually annotated four types of error from the lost stories: repetition (repeating the same scenes), unrelated entities or events (with some wrong keywords but a reasonable main plot), conflicting logic (wrong causal relation or temporal order), and chaotic scenes (difficult to understand). The distribution of different error types is shown in Table 10. We can observe that unrelated entities/events and conflicting orders make up most of the errors for all the models. Compared with Fusion, GPT-2 (Fine-tune) reduces chaotic scenes effectively but still suffers from severe repetition. Equipped with external knowledge and multi-task learning, our model can further reduce chaotic logic and meanwhile avoid repetition. However, the analysis result illustrates that generating a coherent and reasonable story is challenging.
Error Type . | Ours . | Fusion . | GPT-2 (Fine-tune) . |
---|---|---|---|
Repetition (%) | 1.75 | 5.50 | 6.50 |
Unrelated (%) | 11.25 | 16.00 | 15.50 |
Conflicting (%) | 13.75 | 22.00 | 24.50 |
Chaotic (%) | 1.00 | 13.50 | 4.50 |
Error Type . | Ours . | Fusion . | GPT-2 (Fine-tune) . |
---|---|---|---|
Repetition (%) | 1.75 | 5.50 | 6.50 |
Unrelated (%) | 11.25 | 16.00 | 15.50 |
Conflicting (%) | 13.75 | 22.00 | 24.50 |
Chaotic (%) | 1.00 | 13.50 | 4.50 |
We also present some typical cases by our model for each error type in Table 11. These cases show that our model still does not completely prevent logical errors including sentence-level repetition (get into the army), unrelated entities to the context (test is obviously unrelated to surgery and stomach ache), conflicting events (first done but then washed the clothes), and chaotic logic (due to lack of knowledge about on thin ice). These errors also indicate external knowledge, causal relationships, and temporal dependencies play a central role in commonsense story generation.
Error Type . | Cases . |
---|---|
Repetition | [MALE] made up his mind to join the army. He was determined to get into the army. He had never been away from home. He was determined to get into the army. He was sent out to Afghanistan. |
Unrelated | [MALE] felt he was getting sick. He had to go to an emergency room. It was his first major surgery. He had a terrible stomach ache. He was nervous about a test in an hour. |
Conflicting | [FEMALE] swept and mopped the floor. She put her clothes in the washing machine. She was ready to go to bed. When she was done, she washed the clothes. She went to bed. |
Chaotic | [MALE] was on thin ice with his job. He had a friend over to help him. [MALE] was able to hold his breath the entire time. he was so cold that he frozein his tracks. [MALE] finally felt good about himself. |
Error Type . | Cases . |
---|---|
Repetition | [MALE] made up his mind to join the army. He was determined to get into the army. He had never been away from home. He was determined to get into the army. He was sent out to Afghanistan. |
Unrelated | [MALE] felt he was getting sick. He had to go to an emergency room. It was his first major surgery. He had a terrible stomach ache. He was nervous about a test in an hour. |
Conflicting | [FEMALE] swept and mopped the floor. She put her clothes in the washing machine. She was ready to go to bed. When she was done, she washed the clothes. She went to bed. |
Chaotic | [MALE] was on thin ice with his job. He had a friend over to help him. [MALE] was able to hold his breath the entire time. he was so cold that he frozein his tracks. [MALE] finally felt good about himself. |
6 Conclusions and Future Work
We present a knowledge-enhanced pretraining model with multi-task learning for commonsense story generation. The proposed framework leverages the implicit knowledge from deep pretrained language models as well as the explicit knowledge by post-training on external commonsense knowledge bases, which leads to better performance for commonsense story generation. Besides, in order to further capture the causal and temporal dependencies between the sentences in a story, we employ an auxiliary classification task to distinguish true and auto-constructed fake stories. Extensive experiments show that the proposed method can outperform strong baselines. Further analysis demonstrates that the generated stories are more coherent and reasonable thanks to the use of commonsense knowledge and multi-task learning.
As future work, it would be very interesting to make generative pretraining models have commonsense knowledge without any fine-tuning, namely, integrating the knowledge at the pretraining stage.
Acknowledgments
This work was supported by the National Science Foundation of China (grant no. 61936010/618 76096) and the National Key R&D Program of China (grant no. 2018YFC0830200). We would like to thank THUNUS NExT Joint-Lab for the support. We would also like to thank our action editor, Noah Smith, and the anonymous reviewers for their invaluable suggestions and feedback.
Notes
Our implementation is available at https://github.com/thu-coai/CommonsenseStoryGen, and demo is available at http://coai.cs.tsinghua. edu.cn/static/CommonsenseStoryGen.
The pretrained model is available at https://github.com/openai/gpt-2.