Standard multi-task benchmarks are essential for developing pretraining models that can generalize to various downstream tasks. Existing benchmarks for natural language processing (NLP) usually focus only on understanding or generating short texts. However, long text modeling requires many distinct abilities in contrast to short texts, such as the modeling of long-range discourse and commonsense relations, and the coherence and controllability of generation. The lack of standardized benchmarks makes it difficult to assess these abilities of a model and fairly compare different models, especially Chinese models. Therefore, we propose a story-centric benchmark named LOT for evaluating Chinese long text modeling, which aggregates two understanding tasks and two generation tasks. We construct new datasets for these tasks based on human-written Chinese stories with hundreds of words. Furthermore, we release an encoder-decoder-based Chinese long text pretraining model named LongLM with up to 1 billion parameters. We pretrain LongLM on 120G Chinese novels with two generative tasks including text infilling and conditional continuation. Extensive experiments show that LongLM outperforms similar-sized pretraining models substantially on both the understanding and generation tasks in LOT.

Pretrained language models have achieved significant advances in various natural language understanding (NLU) and generation (NLG) tasks (Devlin et al., 2019; Radford et al., 2019). Standard benchmarks such as GLUE (Wang et al., 2019) further boost the improvement and fast iteration of pretrained models. Popular benchmarks usually aggregate multiple tasks to spur the progress of generalizable models. But these benchmarks focus mainly on understanding or generating short texts. For example, the GLUE tasks take at most two sentences as input. And most tasks in NLG benchmarks such as GLGE (Liu et al., 2020) and GEM (Gehrmann et al., 2021) require generating only several words (e.g., dialogue generation). Although there have been many models pretrained on long texts such as GPT3 (Brown et al., 2020) and CPM (Zhang et al., 2020), the lack of benchmark datasets makes it difficult to fully assess and compare their abilities of long text modeling.

In this paper, we present LOT, a benchmark for evaluating Chinese LOng Text understanding and generation. As shown in Table 1, modeling long texts requires many distinct abilities compared to short texts, including (1) commonsense reasoning regarding characters’ reaction and intention, and knowledge about physical objects (e.g., “river”) and abstract concepts (e.g., “irony”); (2) modeling discourse-level features such as inter-sentence relations (e.g., causality) and global discourse structures (e.g., the order of events); and (3) the generation coherence and controllability, which require both maintaining a coherent plot and adhering to controllable attributes (e.g., topics). Accordingly, LOT contains two understanding tasks and two generation tasks regarding the above abilities. We construct new datasets for these tasks based on various kinds of stories such as fables and fairy tales collected from public web resources, considering that stories usually contain abundant commonsense and discourse relations. All these tasks require processing stories with hundreds of words. Note that LOT does not involve extra-long texts with thousands of words since the complicated linguistic phenomena in these texts make it hard to test individual abilities and guide the improvement of generation models.

Table 1: 

A long text example. The concepts and events concerning commonsense and discourse relations are highlighted in bold.

Effendi’s son is eccentric, always behaving opposed to what Effendi has ordered him to do. Familiar to his son’s temper, Effendi usually communicates using irony. One day, the father and son were blocked by a river after purchasing flour from a mill. And while they were crossing the river, one bag on the donkey’s back lost its weight and leaned. Effendi told his son with irony:“My boy! drop the sack into the river!” The son heard the words and thought:“I have been opposed to my father for so many years. For this only time, I have to obey him.” Therefore, he followed Effendi’s words and indeed pushed the sack into the river. “My boy! What are you doing?”Effendi shouted in anger.” ... 
Effendi’s son is eccentric, always behaving opposed to what Effendi has ordered him to do. Familiar to his son’s temper, Effendi usually communicates using irony. One day, the father and son were blocked by a river after purchasing flour from a mill. And while they were crossing the river, one bag on the donkey’s back lost its weight and leaned. Effendi told his son with irony:“My boy! drop the sack into the river!” The son heard the words and thought:“I have been opposed to my father for so many years. For this only time, I have to obey him.” Therefore, he followed Effendi’s words and indeed pushed the sack into the river. “My boy! What are you doing?”Effendi shouted in anger.” ... 

Furthermore, we release LongLM, a Chinese Long text pretraining Language Model. LongLM is a Transformer-based model with an encoder-decoder architecture. LongLM has three different versions ranging from 60 million to 1 billion parameters. We pretrain LongLM on 120G Chinese novels with two generative tasks, including text infilling (Lewis et al., 2020) and conditional continuation (Radford et al., 2018). The pretraining data do not include other types of texts (e.g., news, Wiki-texts) since we mainly focus on commonsense and discourse relations within general long texts instead of factual and technical knowledge. To the best of our knowledge, LongLM is the first pretraining model of the same size scale that focuses on modeling long-form stories. Extensive experiments on LOT show that LongLM outperforms strong baselines substantially on both the understanding and generation tasks. However, we also observe that LongLM is still far behind human performance, which requires better semantic representations of events and deeper modeling of the commonsense and discourse relations between them. We summarize the main contributions of this paper as follows:

I. We propose a new story-centric benchmark LOT for evaluating Chinese long text understanding and generation. LOT consists of four tasks for testing the fundamental abilities to model long texts. We also present new datasets for these tasks.

II. We release a new Chinese pretraining model named LongLM. Experiment results demonstrate the strong performance of LongLM on LOT, but there still exists considerable room for improvement.1

NLP Benchmarks

Recently, there have been a lot of multi-task benchmarks proposed to drive the progress of generalizable models. The benchmarks usually aggregate multiple model-agnostic tasks under a unified framework, enabling researchers to fairly compare different models. SentEval (Conneau and Kiela, 2018) gathered multiple classification tasks involving either one or two sentences as inputs to evaluate sentence representations. DiscoEval (Chen et al., 2019) extended these tasks to the discourse level regarding inter-sentence relations. GLUE (Wang et al., 2019) included more diverse tasks such as natural language inference (Rocktäschel et al., 2016). Sarlin et al. (2020) proposed SuperGLUE as a more challenging counterpart of GLUE by introducing multi-sentence tasks. But the additional tasks are only limited to the formats of coreference resolution and question answering. In addition to these English benchmarks, many benchmarks were proposed to evaluate NLU for other languages, such as CLUE (Xu et al., 2020a) for Chinese. Moreover, GLGE (Liu et al., 2020) and GEM (Gehrmann et al., 2021) were proposed for evaluating NLG models across diversified generation tasks such as text summarization and personalizing dialogue. However, there is no benchmark designed specifically for long text modeling, especially Chinese. Additionally, the above benchmarks were originally designed to cover as diverse task formats as possible. In contrast, we design the LOT tasks with the guidance of necessary abilities for long text modeling as suggested by Ribeiro et al. (2020), making it easier to figure out where models are failing, and how to improve them.

Long Text Datasets

Previous studies in the field of long text modeling have frequently focused on the ROCStories (Mostafazadeh et al., 2016) and WritingPrompts (Fan et al., 2018) datasets. ROCStories contains 100k artificial five-sentence stories, while WritingPrompts consists of 300K pairs of prompts and stories with hundreds of words. Recent works collected stories with thousands of words to model longer-range dependencies, such as WikiText-103 (Merity et al., 2016), roleplayerguild (Louis and Sutton, 2018), PG-19 (Rae et al., 2020), STORIUM (Akoury et al., 2020), and Long-Range Arena (Tay et al., 2020). However, these datasets are written in English. LOT will drive the development of Chinese language models.

Moreover, LOT does not include datasets of extra-long texts, like PG-19, for the following two reasons: (1) Extra-long texts are far beyond the scope of current machine learning models because the discourse-level linguistic phenomena are entangled and complicated in these texts. Therefore, extra-long texts usually serve for computing perplexity of language models (Dai et al., 2019) but hardly provide fine-grained guidance for improving model designs. (2) LOT aims not to spur research on building fuller connections across tokens within an extra-long sequence, but to drive the progress of machines in the aforementioned fundamental abilities for long text modeling.

Story Understanding and Generation

LOT is centered on fundamental abilities for long text modeling and thus includes four story understanding and generation tasks concerning commonsense and discourse relations. Recent studies have proposed various tasks to evaluate story understanding and generation. First, story ending selection (Mostafazadeh et al., 2016), story ending generation (Guan et al., 2019), and story completion (Wang and Wan, 2019) focused on the commonsense reasoning ability on inter-event causal and temporal relations. Second, Chen et al. (2019) evaluated the ability to model discourse relations by predicting the position of a sentence or a paragraph in a text. Third, some works focused on the coherence of story generation conditioned on short prompts (Fan et al., 2018), titles (Yao et al., 2019) and beginnings (Guan et al., 2020). Fourth, some studies centered on controllability, that is, the imposing of controllable attributes on story generation such as keywords (Xu et al., 2020b), emotional trajectories (Brahman and Chaturvedi, 2020), outlines (Rashkin et al., 2020), and styles (Kong et al., 2021). LOT is a comprehensive benchmark to test the above abilities for Chinese long text modeling.

On the other hand, LOT does not involve those tasks that require learning more particular features of stories, such as event chains (Chambers and Jurafsky, 2008), character types (Bamman et al., 2013), inter-character relations (Chaturvedi et al., 2016, 2017), social networks (Agarwal et al., 2013), and abstractive structures (Finlayson, 2012). Non-neural story generation models usually retrieved events from a knowledge base with pre-specified semantic relations based on handcrafted rules (Li et al., 2013), which are costly and lack generalization. In this paper, we focus mainly on evaluating neural models for story understanding and generation.

We design LOT as an aggregation of two understanding tasks including Cloze Test (ClozeT) and Sentence Position Prediction (SenPos), and two generation tasks including Plot Completion (PlotCom) and Outline-conditioned Generation (OutGen). We show the task descriptions and data statistics in Tables 2 and 3, respectively. We use the jieba tokenizer2 for word tokenization.

Table 2: 

Overview of the tasks in LOT for the abilities they test, inputs and outputs, and the evaluation metrics. Dist and Cover refer to Distinct and Coverage (Section 5.3), respectively.

TasksAbilitiesInputsOutputsMetrics
ClozeT Commonsense Reasoning A text with a sentence removed (the position specified); Two candidate sentences. Choosing the correct sentence from two candidates. Accuracy 
SenPos Inter-sentence Relationship A text with a sentence removed (the position unspecified); The removed sentence. Choosing the correct position for the removed sentence. Accuracy 
PlotCom Commonsense Reasoning; Inter-sentence Relationship A text with a sentence removed (the position specified). Generating a sentence to complete the text. BLEU; Dist 
OutGen Discourse Structure; Coherence; Controllability A title, an outline as an out-of-order set of phrases about characters and events. Generating a coherent text adhering to the title and outline. BLEU; Dist; Cover; Order 
TasksAbilitiesInputsOutputsMetrics
ClozeT Commonsense Reasoning A text with a sentence removed (the position specified); Two candidate sentences. Choosing the correct sentence from two candidates. Accuracy 
SenPos Inter-sentence Relationship A text with a sentence removed (the position unspecified); The removed sentence. Choosing the correct position for the removed sentence. Accuracy 
PlotCom Commonsense Reasoning; Inter-sentence Relationship A text with a sentence removed (the position specified). Generating a sentence to complete the text. BLEU; Dist 
OutGen Discourse Structure; Coherence; Controllability A title, an outline as an out-of-order set of phrases about characters and events. Generating a coherent text adhering to the title and outline. BLEU; Dist; Cover; Order 
Table 3: 

Data statistics of LOT tasks. The abbreviation char/sent/len is short for character/sentence/length, respectively.

DatasetsTrainValTest
Task:ClozeT 
# Examples 644 294 294 
Vocabulary Size 9k 7k 7k 
 
Avg. # Char in Input Text 139.07 138.95 141.15 
Avg. # Word in Input Text 89.28 89.03 90.20 
Avg. # Sent in Input Text 5.95 5.94 5.95 
 
Avg. # Word in Candidate 15.60 16.38 15.75 
 
Task:SenPos 
# Examples 20,000 800 863 
Vocabulary Size 147k 10k 22k 
 
Avg. # Char in Input Text 289.59 258.48 258.52 
Avg. # Word in Input Text 254.11 224.20 223.25 
Avg. # Sent in Input Text 9.61 8.43 8.44 
Avg. # Word in Removed Sent 30.48 29.28 30.26 
 
Avg. # Candidate Positions 8.05 6.91 6.91 
 
Task:PlotCom 
# Examples 13,099 465 464 
Vocabulary Size 22k 8k 8k 
Avg. # Char in Input Text 164.35 137.67 133.26 
 
Avg. # Word in Input Text 105.48 87.56 84.98 
Avg. # Sent in Input Text 7.17 5.59 5.48 
 
Avg. # Word in Output Sent 15.08 15.96 16.15 
 
Task:OutGen 
# Examples 1,456 242 729 
Vocabulary Size 19k 6k 12k 
 
Avg. # Word in Input Title 4.64 4.89 4.64 
Avg. # Word in Input Outline 19.20 19.05 19.47 
Avg. # Phrase in Input Outline 8.00 8.00 8.00 
 
Avg. # Char in Output Text 169.94 169.80 170.49 
Avg. # Word in Output Text 108.91 108.68 109.04 
Avg. # Sent in Output Text 7.20 7.11 7.15 
DatasetsTrainValTest
Task:ClozeT 
# Examples 644 294 294 
Vocabulary Size 9k 7k 7k 
 
Avg. # Char in Input Text 139.07 138.95 141.15 
Avg. # Word in Input Text 89.28 89.03 90.20 
Avg. # Sent in Input Text 5.95 5.94 5.95 
 
Avg. # Word in Candidate 15.60 16.38 15.75 
 
Task:SenPos 
# Examples 20,000 800 863 
Vocabulary Size 147k 10k 22k 
 
Avg. # Char in Input Text 289.59 258.48 258.52 
Avg. # Word in Input Text 254.11 224.20 223.25 
Avg. # Sent in Input Text 9.61 8.43 8.44 
Avg. # Word in Removed Sent 30.48 29.28 30.26 
 
Avg. # Candidate Positions 8.05 6.91 6.91 
 
Task:PlotCom 
# Examples 13,099 465 464 
Vocabulary Size 22k 8k 8k 
Avg. # Char in Input Text 164.35 137.67 133.26 
 
Avg. # Word in Input Text 105.48 87.56 84.98 
Avg. # Sent in Input Text 7.17 5.59 5.48 
 
Avg. # Word in Output Sent 15.08 15.96 16.15 
 
Task:OutGen 
# Examples 1,456 242 729 
Vocabulary Size 19k 6k 12k 
 
Avg. # Word in Input Title 4.64 4.89 4.64 
Avg. # Word in Input Outline 19.20 19.05 19.47 
Avg. # Phrase in Input Outline 8.00 8.00 8.00 
 
Avg. # Char in Output Text 169.94 169.80 170.49 
Avg. # Word in Output Text 108.91 108.68 109.04 
Avg. # Sent in Output Text 7.20 7.11 7.15 

We design LOT based on the following principles: (1) Task Diversity: The tasks vary in task formats, types and lengths of inputs and outputs, and focused abilities, making LOT a comprehensive framework for evaluating the generalization of models. (2) Task Difficulty: The tasks take hundreds of words as inputs or outputs, and do not involve domain-specific knowledge about science, films, and so forth. Therefore, they are beyond the scope of current state-of-the-art models, but are solvable by most Chinese native speakers. (3) Task Formulation: The tasks have been well formulated in prior studies and agreed to be challenging but meaningful. We introduce new Chinese datasets for these tasks, which are constructed to focus more specifically on testing a certain ability than original datasets. (4) Automatic Evaluation: These tasks have reliable automatic metrics to evaluate the focused abilities. We exclude open-ended generation tasks such as story generation from titles, which is difficult to automatically evaluate (Guan et al., 2021) because the tasks suffer from the notorious one-to-many issue: There are many plausible outputs for the same input (Zhao et al., 2017).

We constructed datasets for LOT through automatic and manual annotation. First, we crawled human-written stories from public web pages as the data source. These stories are under licenses that allow use and redistribution for research purposes. Then, we hired a commercial team to create the LOT examples. The team is led by a professional screenwriter and has taken on hundreds of NLP annotation projects. All annotators are native Chinese speakers and well-trained for the annotation tasks. We show the full list of the source web pages and the annotation details in the appendix.

3.1 Cloze Test

Mostafazadeh et al. (2016) introduced the Story Cloze Test (SCT) task for evaluating story comprehension, which requires selecting the right ending from two candidates for a four-sentence leading context. However, SCT suffers from the following issues: (1) Its dataset is artificial and contains innate biases between right and wrong endings in some features such as lengths (Schwartz et al., 2017; Sharma et al., 2018). Such biases may leak information about the target labels. (2) SCT focuses on reasoning only endings but neglects other types of reasoning, such as abductive reasoning (Bhagavatula et al., 2019), which requires reasoning what happens between observed beginnings and endings. (3) SCT limits the scope of commonsense reasoning to realistic events. The limitation may be neither necessary nor sufficient. For example, “Cupid can fly” canbe reasoned based on common sense although it is not realistic, while some story settings may be realistic but fail to be reasoned only based on the context and common sense, as shown in Table 4. Therefore, when constructing our ClozeT dataset, we adopt the following approaches to alleviate the above issues: (1) All examples are derived from existing human-written stories. (2) We allow annotators to create examples where the removed sentence is initially in the middle of the story. (3) We change the scope of commonsense reasoning to all events that embody characters’ reaction and intention, or the nature of physical objects and concepts. Table 6 shows two ClozeT examples. Furthermore, we also conducted experiments to investigate the potential biases of our dataset in Section 5.5.

Table 4: 

An example for selecting a sentence that can be reasoned based on the context and common sense (in red). We also highlight a sentence that does not satisfy the requirement in green, which introduces a new character “the Devil King”.

An example for selecting a sentence that can be reasoned based on the context and common sense (in red). We also highlight a sentence that does not satisfy the requirement in green, which introduces a new character “the Devil King”.
An example for selecting a sentence that can be reasoned based on the context and common sense (in red). We also highlight a sentence that does not satisfy the requirement in green, which introduces a new character “the Devil King”.
Story Filtering

To ensure the quality of LOT examples, we asked annotators to judge whether each crawled story meets the following definition: “anything which is told in the form of a coherent event sequence involving several specific and related characters” (Mostafazadeh et al., 2016). We provided detailed cases for annotators to instruct them about this definition. Then, annotators needed to refine those stories which do not meet the definition by rewriting the plots. They should also clean up the stories by the following heuristics: (1) refusing examples that may violate ethical principles (e.g., discrimination); (2) deleting noisy words (e.g., links); (3) changing slang and informal words into standard modern Chinese; (4) rewriting all dialogues to objective events. Finally, we collected 2,427 high-quality Chinese stories, which will be used to construct the datasets for the ClozeT, PlotCom, and OutGen tasks.

Dataset Construction

We presented the stories to another group of annotators to construct the ClozeT dataset. For each story, they should select a sentence as the right candidate that can be reasoned based on the context and common sense. Table 4 shows an example presented to the annotators to illustrate how to judge whether a sentence satisfies this requirement. Then, the annotators rewrite the sentence into another one as the wrong candidate that maintains a good topical relatedness with the context but violates common sense. The wrong candidates should either embody unreasonable reactions or intentions, or violate the nature of physical objects or concepts. And we require annotators not to select the first sentence, which usually aims to introduce story settings instead of narrating an event. We browse through the annotation results and give the annotators detailed feedback before approving their submissions. Finally, we collected 1,232 examples in total and split them for training, validation and testing.

3.2 Sentence Position Prediction

We use the sentence position prediction task (Chen et al., 2019) to evaluate the ability to capture inter-sentence relations (e.g., causality). We formulate the task as follows: Given a text with a sentence removed, models should choose the correct position of the sentence in the text from multiple candidates. Chen et al. (2019) constructed an English dataset for this task by randomly removing sentences from existing texts. However, such examples may be invalid since a sentence may have multiple plausible positions in a text, as illustrated in Table 5. Therefore, we construct the dataset for our task based on the following pipeline: (1) extracting paragraphs with less than 500 words from crawled stories; (2) randomly selecting a sentence to remove for each paragraph, and regarding all positions between two adjacent sentences as candidates3 ; and (3) asking annotators to refine part of the auto-constructed examples as the validation and test sets, and the remaining as the training set. Table 7 shows two SenPos examples.

Table 5: 

A poor example for the SenPos task. The removed sentence has multiple reasonable positions including [2] and [3] in the original text.

Text: I couldn’t control my anger very well.[1]My parents would yell at me, and i ran to my room.[2]I buried my head in a pillow and screamed.[3]I threw my pillow and hit it hard. 
Removed Sentence: I tried to express my anger. 
Text: I couldn’t control my anger very well.[1]My parents would yell at me, and i ran to my room.[2]I buried my head in a pillow and screamed.[3]I threw my pillow and hit it hard. 
Removed Sentence: I tried to express my anger. 
Table 6: 

Two ClozeT examples. The right candidates are extracted from the original stories (at the position of “[MASK]”) while the wrong candidates are written by crowd-sourced annotators. The first example focuses on common sense regarding the fox’s reaction to the silly wolf’s behavior, while the second example focuses on common sense regarding the relations between palace and prince. We highlight the entities and events related to the commonsense relations in red, and those which violate common sense in the wrong candidates in green.

Two ClozeT examples. The right candidates are extracted from the original stories (at the position of “[MASK]”) while the wrong candidates are written by crowd-sourced annotators. The first example focuses on common sense regarding the fox’s reaction to the silly wolf’s behavior, while the second example focuses on common sense regarding the relations between palace and prince. We highlight the entities and events related to the commonsense relations in red, and those which violate common sense in the wrong candidates in green.
Two ClozeT examples. The right candidates are extracted from the original stories (at the position of “[MASK]”) while the wrong candidates are written by crowd-sourced annotators. The first example focuses on common sense regarding the fox’s reaction to the silly wolf’s behavior, while the second example focuses on common sense regarding the relations between palace and prince. We highlight the entities and events related to the commonsense relations in red, and those which violate common sense in the wrong candidates in green.
Table 7: 

Two SenPos examples. The special tokens from [1] to [9] refer to the candidate positions. The first/second example focuses on testing the ability to capture the inter-sentence causal/temporal relations, respectively. We highlight the entities and events implying the relations in red.

Two SenPos examples. The special tokens from [1] to [9] refer to the candidate positions. The first/second example focuses on testing the ability to capture the inter-sentence causal/temporal relations, respectively. We highlight the entities and events implying the relations in red.
Two SenPos examples. The special tokens from [1] to [9] refer to the candidate positions. The first/second example focuses on testing the ability to capture the inter-sentence causal/temporal relations, respectively. We highlight the entities and events implying the relations in red.
Dataset Construction

We asked annotators to refine each example so that the removed sentence has only one reasonable position in the text. We did not allow annotators to select the first or last sentence of the original text as the removed sentence because they usually contain obvious wording features (e.g., “once upon a time,” “they lived happily together”), which may make this task trivial. Unlike ClozeT, we allowed the texts for SenPos to be incomplete or include dialogues that also embody rich inter-sentence relations. Finally, we collected 1,663 examples for validation and testing through human annotation. And we constructed 20,000 examples automatically for training.

3.3 Plot Completion

We use the Plot Completion task (Wang and Wan, 2019) to test the ability to make inferences based on common sense. We formulate this task as follows: Given a story with a sentence removed, models should generate a sentence to complete the story and make it reasonable and coherent.

Dataset Construction

Prior studies (Wang and Wan, 2019; Paul and Frank, 2021) automatically constructed datasets for this task based on existing datasets by randomly removing one sentence from a story. However, as shown in Table 4, not all sentences in a story can be reasoned only based on the context and common sense. Therefore, we only used the above automatic method to construct the training data. And we adapted the ClozeT data to this task for validation and testing, since annotators have marked out the qualified sentences. Specifically, we randomly sampled some ClozeT examples and took the incomplete story of each example as input, and the right candidate as the target sentence to be generated.

3.4 Outline-conditioned Generation

Prior work tended to test the ability of long text generation through story generation conditioned on inputs with limited information such as titles (Yao et al., 2019). However, these tasks are extremely open-ended so that it is difficult to reliably measure the generation quality using automatic metrics (Guan and Huang, 2020). To alleviate the issue, we introduce the Outline-conditioned Generation task (Rashkin et al., 2020), which requires generating a coherent long-form story conditioned on an outline of characters and events. We formulate the outline as a set of out-of-order phrases, which not only narrows down the set of plausible stories but also serves for testing the controllability and planning ability of models to arrange the given events reasonably at the discourse level.

Dataset Construction

We built the dataset for this task automatically based on filtered stories. We followed Rashkin et al. (2020) to extract the outline of a story using the RAKE algorithm (Rose et al., 2010). We extract at most eight phrases for each story, and each phrase contains no more than eight words. For example, the outline for the story in Table 1 is {“told his son with irony,” “purchasing flour from a mill,” “crossing the river,” “drop the sack into the river,” “indeed pushed the sack,” “familiar to his son’s temper,” “shouted,” “one bag”}. The outline can serve as discourse-level guidance for generation models, which should rearrange the events reasonably and generate a story with a good global discourse structure, rather than focus on modeling only the local coherence.

3.5 Overall Score

Existing benchmarks usually summarize the performance of a model as a single score by averaging all metric scores without considering task difficulties. To encourage models to progress on those tasks where there is a more significant gap between machines and humans, we propose to average metric scores with different weights. Suppose that there are a total of M metrics for all tasks, we derive the overall score as follows:
(1)
(2)
where Hi, Bi, and Si are the score of humans, a pre-selected baseline, and the evaluated model for the i-th metric, respectively, and wi is the weight for this metric. Intuitively, the metric scores where the baseline model has a larger gap with humans will have a larger weight when computing the overall score. We use BERT and GPT2 as the baseline models for the understanding and generation tasks in LOT, respectively.

To provide more flexibility on both understanding and generation tasks, we build LongLM following the original encoder-decoder design of Transformer (Vaswani et al., 2017) with three different sizes, as shown in Table 8. We follow Cui et al. (2020) to use a sentencepiece vocabulary of 32,000 wordpieces (Kudo and Richardson, 2018). And we set the maximum sequence length to 512 for both the encoder and decoder.

Table 8: 

Hyper-parameter settings for different versions of LongLM. dm, dff, and dkv are the dimension of hidden states, the feed forward layers, and the keys/values in the self-attention layers, respectively. nh is the number of attention heads. ne and nd denote the number of hidden layers for the encoder and decoder, respectively. # P is the number of parameters.

Versionsdmdffdkvnhne/nd# P
Small 512 2,048 64 6/6 60M 
Base 768 3,072 64 12 12/12 223M 
Large 1,536 3,072 64 12 24/32 1B 
Versionsdmdffdkvnhne/nd# P
Small 512 2,048 64 6/6 60M 
Base 768 3,072 64 12 12/12 223M 
Large 1,536 3,072 64 12 24/32 1B 
Pretraining Data

We collect 120G novels as the pretraining data for LongLM, which cover various topics such as romance, military, and so on. Since a novel is usually much longer than the maximum input and output length of LongLM, we split a novel into multiple segments for pretraining.

Pretraining Tasks

Encoder-decoder models are trained typically by maximizing the likelihood of the target output given an input. To improve capacities of both the encoder and decoder, we propose to train LongLM with two pretraining tasks including text infilling (Raffel et al., 2020) and conditional continuation (Radford et al., 2019). For the first task, the input is a text where a number of spans are sampled and replaced by special tokens with unique IDs, while the output is the spans delimited by the special tokens used in the input. The lengths of masked spans are drawn from a Poisson distribution with λ=3 and all masked tokens compress 15% of the original texts. As for the second task, the input and output are, respectively, the front and back half of a text, which is split into two parts randomly. We show an example of the pretraining tasks in Figure 1.

Figure 1: 

Schematic of the pretraining tasks. ¡X¿ and ¡Y¿ is the special tokens used for masking spans. ¡Z¿ is the “end of sequence” token.

Figure 1: 

Schematic of the pretraining tasks. ¡X¿ and ¡Y¿ is the special tokens used for masking spans. ¡Z¿ is the “end of sequence” token.

Close modal
Pretraing Details

We set the learning rate to 1e-4 with the Adam optimizer and the batch size to 1,000. We pretrained LongLM for 2.5M steps. It took about two months to train the largest model using eight NVIDIA V100 GPUs.

Model Performance

To assess the performance of LongLM on the pretraining tasks, we randomly separated out 1,000 texts from the initial pretraining data for testing, which were never seen in the pretraining phase. We used perplexity and BLEU-n (n = 3,4) to evaluate both pretraining tasks. And we generated outputs using the greedy decoding algorithm for the text infilling task, and top-k sampling (Fan et al., 2018) with k = 40 and a softmax temperature of 0.7 (Goodfellow et al., 2014) for the conditional continuation task. As shown in Table 9, the performance improves substantially as the number of parameters increases.

Table 9: 

Perplexity (PPL) and BLEU scores of LongLM for text infilling (TextInfill) and conditional continuation (CondCont). The best performance is in bold and the second best is underlined.

ModelsTextInfillCondCont
PPLBLEU-3/4PPLBLEU-3/4
LongLMsmall 11.61 73.80/68.96 22.91 5.30/2.43 
LongLMbase 8.24 75.65/71.05 17.03 5.73/2.64 
LongLMlarge 6.50 77.08/72.65 14.08 8.91/5.97 
ModelsTextInfillCondCont
PPLBLEU-3/4PPLBLEU-3/4
LongLMsmall 11.61 73.80/68.96 22.91 5.30/2.43 
LongLMbase 8.24 75.65/71.05 17.03 5.73/2.64 
LongLMlarge 6.50 77.08/72.65 14.08 8.91/5.97 

In this section, we tested LongLM and existing models on LOT with automatic and manual evaluation. Furthermore, we conducted extensive experiments to investigate the potential biases of the ClozeT and SenPos datasets (Section 5.5), and measure the overlap between training and testing data (Section 5.6).

5.1 Evaluated Models

We evaluated the following models, which are implemented based on the register models of HuggingFace Transformers:4(1) Vanilla Transformer: It has the same architecture as BERTbase except that the number of layers is set to 3 (Vaswani et al., 2017). (2) BERT: It is implemented based on the bert-base-Chinese register model (Devlin et al., 2019). (3) RoBERTa: It is implemented based on the hfl/chinese-roberta-wwm-ext register model (Cui et al., 2020). (4) GPT2: It is implemented based on the uer/gpt2-chinese-cluecorpussmall register model (Zhao et al., 2019). (5) mT5: It is implemented based on the google/mt5-base register model (Xue et al., 2021). We set all the baseline models to the base version due to limited computational resources.

To show the generic benefits of the pretraining data of LongLM for long text modeling, we pretrained a left-to-right language model from scratch on the data with the standard language modeling objective. This model has the same architecture as GPT2base and is denoted as GPT2base. Moreover, we evaluated two task-specific pretraining models including PlotMachines (PM) (Rashkin et al., 2020) and Plan&Write (PW) (Yao et al., 2019), and two typical non-pretrained models including ConvS2S (Gehring et al., 2017) and Fusion (Fan et al., 2018) on the generation tasks in LOT. We used GPT2base as the backbone model of PM and PW. For PM, we regard input sentences (for PlotCom) or input phrases (for OutGen) as the plot elements used in the memory network, and update the memory representations at each step of decoding. As for PW, we take a keyword extracted from the target sentence using the RAKE algorithm (for PlotCom) or the sorted input phrases in order (for OutGen) as the intermediate representations for planning. We implemented these models based on the codes provided by the original papers.

5.2 Experiment Settings

Understanding Tasks

For both tasks, we encode the input of each example and then predict a distribution over all candidates by normalizing the dot-product values between the representations of each candidate and the context. We use the candidate with the maximum probability as the prediction result. For ClozeT, we represent a candidate using the hidden state at the end of it, and we regard the hidden state at the position of the removed sentence appearing in the original text as the context representation. And for SenPos, we take the hidden state at each candidate position as the candidate representation and the hidden state at the end of the removed sentence as the context representation. When evaluating mT5 and LongLM, we feed the same input into the encoder and decoder (Lewis et al., 2020) and use the hidden states of the decoder for prediction in the above way.

Generation Tasks

For PlotCom, we take the incomplete story of an example as input to generate the missing sentence. And for OutGen, we concatenate all phrases in an outline with special tokens as input to generate a story.

Hyper-Parameters

For all models, we set the batch size to 12, the maximum sequence length to 512, and the learning rate to 3e-5. We decode outputs use top-k sampling with k = 40 and a softmax temperature of 0.7 for the generation tasks.

5.3 Automatic Evaluation

Metrics

We use accuracy to evaluate the understanding tasks. As for generation tasks, we use BLEU-n (B-n) and Distinct-n (D-n) to evaluate the n-gram overlap with ground-truth texts (Papineni et al., 2002) and n-gram generation diversity (Li et al., 2016), respectively. We set n = 1,2 for both generation tasks. Additionally, we also use the following two metrics to evaluate OutGen: (1) Coverage (Cover): It is used to evaluate the generation controllability, which is computed as the average Rouge-L recall score (Lin, 2004) between the generated text and each input phrase. A higher coverage score indicates the generated text covers more input phrases. (2) Order: It is used to measure the gap between the positional orders of input phrases appearing in the generated texts and ground-truth texts. Specifically, we compute the order score as the average ratio of the number of inversions in the generated story to the number of all position pairs of any two phrases. An inversion refers to a position pair that are out of the ground-truth order. And we use the position of the longest common subsequence between a story and a phrase as the position of the phrase in the story. Because an input phrase does not always appear in the generated story, we regard all position pairs of such a phrase and others as inversions.

Results

Tables 10 and 11 show the results on the understanding and generation tasks, respectively. To obtain the human performance on the understanding tasks, we randomly sampled 100 examples from the validation set or test set and hired three crowd-sourced annotators (native Chinese speakers) to do these tasks. We made final decisions among them through majority voting. All results show an almost perfect inter-annotator agreement with Fleiss’s κ > 0.85 (Fleiss and Joseph, 1971). For generation tasks, we regard the scores of ground-truth texts as human performance.

Table 10: 

Accuracy (%) on the understanding tasks in LOT. # P means the number of parameters. The best performance is in bold and the second best is underlined. wi is the metric weight with BERT as the baseline model when computing the overall score.

Models# PClozeTSenPosOverall
Validation Set 
Transformer 38M 55.78 17.38 31.46 
BERTbase 102M 70.75 40.13 51.36 
RoBERTabase 102M 72.11 51.63 59.14 
GPT2base 102M 70.07 37.78 49.62 
GPT2base 102M 74.49 39.25 52.17 
mT5base 582M 72.45 63.25 66.62 
 
LongLMsmall 60M 73.81 48.75 57.94 
LongLMbase 223M 75.17 64.38 68.34 
LongLMlarge 1B 79.93 70.00 73.64 
 
Humans N/A 99.00 97.00 97.73 
 
wi N/A 0.37 0.63 1.00 
 
Test Set 
Transformer 38M 54.42 16.34 31.23 
BERTbase 102M 69.39 43.68 53.74 
RoBERTabase 102M 67.69 51.35 57.74 
GPT2base 102M 73.13 37.25 51.28 
GPT2base 102M 76.87 39.28 53.98 
mT5base 582M 75.17 61.41 66.79 
 
LongLMsmall 60M 77.21 53.07 62.51 
LongLMbase 223M 77.55 62.34 68.29 
LongLMlarge 1B 80.61 69.41 73.39 
 
Humans N/A 100.00 98.00 98.78 
 
wi N/A 0.39 0.61 1.00 
Models# PClozeTSenPosOverall
Validation Set 
Transformer 38M 55.78 17.38 31.46 
BERTbase 102M 70.75 40.13 51.36 
RoBERTabase 102M 72.11 51.63 59.14 
GPT2base 102M 70.07 37.78 49.62 
GPT2base 102M 74.49 39.25 52.17 
mT5base 582M 72.45 63.25 66.62 
 
LongLMsmall 60M 73.81 48.75 57.94 
LongLMbase 223M 75.17 64.38 68.34 
LongLMlarge 1B 79.93 70.00 73.64 
 
Humans N/A 99.00 97.00 97.73 
 
wi N/A 0.37 0.63 1.00 
 
Test Set 
Transformer 38M 54.42 16.34 31.23 
BERTbase 102M 69.39 43.68 53.74 
RoBERTabase 102M 67.69 51.35 57.74 
GPT2base 102M 73.13 37.25 51.28 
GPT2base 102M 76.87 39.28 53.98 
mT5base 582M 75.17 61.41 66.79 
 
LongLMsmall 60M 77.21 53.07 62.51 
LongLMbase 223M 77.55 62.34 68.29 
LongLMlarge 1B 80.61 69.41 73.39 
 
Humans N/A 100.00 98.00 98.78 
 
wi N/A 0.39 0.61 1.00 
Table 11: 

Evaluation results on the generation tasks in LOT. # P means the number of parameters. The best performance is in bold and the second best is underlined. wi is the metric weight with GPT2base as the baseline model when computing the overall score.

Models# PPlotComOutGenOverall
B-1B-2D-1D-2B-1B-2D-1D-2CoverOrder
Validation Set 
ConvS2S 58M 18.92 4.18 6.31 32.18 29.23 10.38 3.45 21.79 14.81 25.34 11.85 
Fusion 109M 20.56 4.69 8.63 35.73 29.22 10.34 3.39 22.67 17.41 26.55 12.61 
 
GPT2base 102M 22.67 6.22 24.75 70.57 30.43 14.87 10.95 44.38 60.90 55.52 20.24 
GPT2base 102M 22.49 5.43 26.88 74.87 35.29 18.31 13.89 51.36 64.01 57.64 21.73 
PM 102M 22.11 5.49 23.89 69.74 31.81 14.94 12.99 50.56 62.98 56.75 20.45 
PW 102M 22.45 5.57 25.64 71.54 35.84 18.47 11.86 47.62 64.93 57.30 21.48 
mT5base 582M 22.56 6.46 24.44 71.31 36.71 22.25 14.52 50.01 77.98 63.15 23.53 
 
LongLMsmall 60M 21.78 7.11 20.17 59.63 35.03 19.17 10.80 39.70 62.53 56.53 21.02 
LongLMbase 223M 22.91 8.28 22.16 63.54 40.33 24.29 14.66 51.82 79.60 62.78 24.75 
LongLMlarge 1B 23.76 8.70 25.93 72.18 42.79 24.91 16.13 57.71 80.46 64.36 26.12 
 
Truth N/A 100.00 100.00 35.32 84.33 100.00 100.00 21.66 71.43 100.00 100.00 92.23 
 
wi N/A 0.11 0.40 0.04 0.03 0.08 0.17 0.05 0.04 0.04 0.04 1.00 
 
Test Set 
ConvS2S 58M 19.60 4.20 6.00 32.42 29.00 10.14 1.60 13.95 15.45 25.77 11.27 
Fusion 109M 20.52 4.90 8.43 35.09 28.77 10.22 1.47 14.12 17.10 26.36 11.91 
 
GPT2base 102M 22.94 5.76 24.69 70.30 30.17 14.91 7.62 36.87 60.87 55.90 19.21 
GPT2base 102M 22.45 5.38 26.08 73.26 35.79 18.68 9.89 43.52 64.43 56.96 20.76 
PM 102M 22.87 5.75 24.08 71.19 31.85 15.24 8.62 41.32 63.15 57.21 19.77 
PW 102M 22.76 6.07 25.55 70.72 35.12 17.96 8.68 40.17 63.70 55.17 20.52 
mT5base 582M 22.52 6.48 24.33 70.53 36.33 22.07 10.90 43.65 78.66 63.79 22.59 
 
LongLMsmall 60M 22.05 7.45 19.93 59.79 34.48 19.17 7.93 34.25 63.75 57.64 20.48 
LongLMbase 223M 23.28 8.58 21.37 62.43 40.25 24.15 10.75 44.40 79.88 63.67 23.93 
LongLMlarge 1B 24.20 9.06 25.75 71.08 42.10 24.77 12.04 50.29 81.48 64.82 25.29 
 
Truth N/A 100.00 100.00 35.01 84.56 100.00 100.00 15.71 63.46 100.00 100.00 91.64 
 
wi N/A 0.10 0.42 0.03 0.03 0.08 0.16 0.05 0.04 0.04 0.04 1.00 
Models# PPlotComOutGenOverall
B-1B-2D-1D-2B-1B-2D-1D-2CoverOrder
Validation Set 
ConvS2S 58M 18.92 4.18 6.31 32.18 29.23 10.38 3.45 21.79 14.81 25.34 11.85 
Fusion 109M 20.56 4.69 8.63 35.73 29.22 10.34 3.39 22.67 17.41 26.55 12.61 
 
GPT2base 102M 22.67 6.22 24.75 70.57 30.43 14.87 10.95 44.38 60.90 55.52 20.24 
GPT2base 102M 22.49 5.43 26.88 74.87 35.29 18.31 13.89 51.36 64.01 57.64 21.73 
PM 102M 22.11 5.49 23.89 69.74 31.81 14.94 12.99 50.56 62.98 56.75 20.45 
PW 102M 22.45 5.57 25.64 71.54 35.84 18.47 11.86 47.62 64.93 57.30 21.48 
mT5base 582M 22.56 6.46 24.44 71.31 36.71 22.25 14.52 50.01 77.98 63.15 23.53 
 
LongLMsmall 60M 21.78 7.11 20.17 59.63 35.03 19.17 10.80 39.70 62.53 56.53 21.02 
LongLMbase 223M 22.91 8.28 22.16 63.54 40.33 24.29 14.66 51.82 79.60 62.78 24.75 
LongLMlarge 1B 23.76 8.70 25.93 72.18 42.79 24.91 16.13 57.71 80.46 64.36 26.12 
 
Truth N/A 100.00 100.00 35.32 84.33 100.00 100.00 21.66 71.43 100.00 100.00 92.23 
 
wi N/A 0.11 0.40 0.04 0.03 0.08 0.17 0.05 0.04 0.04 0.04 1.00 
 
Test Set 
ConvS2S 58M 19.60 4.20 6.00 32.42 29.00 10.14 1.60 13.95 15.45 25.77 11.27 
Fusion 109M 20.52 4.90 8.43 35.09 28.77 10.22 1.47 14.12 17.10 26.36 11.91 
 
GPT2base 102M 22.94 5.76 24.69 70.30 30.17 14.91 7.62 36.87 60.87 55.90 19.21 
GPT2base 102M 22.45 5.38 26.08 73.26 35.79 18.68 9.89 43.52 64.43 56.96 20.76 
PM 102M 22.87 5.75 24.08 71.19 31.85 15.24 8.62 41.32 63.15 57.21 19.77 
PW 102M 22.76 6.07 25.55 70.72 35.12 17.96 8.68 40.17 63.70 55.17 20.52 
mT5base 582M 22.52 6.48 24.33 70.53 36.33 22.07 10.90 43.65 78.66 63.79 22.59 
 
LongLMsmall 60M 22.05 7.45 19.93 59.79 34.48 19.17 7.93 34.25 63.75 57.64 20.48 
LongLMbase 223M 23.28 8.58 21.37 62.43 40.25 24.15 10.75 44.40 79.88 63.67 23.93 
LongLMlarge 1B 24.20 9.06 25.75 71.08 42.10 24.77 12.04 50.29 81.48 64.82 25.29 
 
Truth N/A 100.00 100.00 35.01 84.56 100.00 100.00 15.71 63.46 100.00 100.00 91.64 
 
wi N/A 0.10 0.42 0.03 0.03 0.08 0.16 0.05 0.04 0.04 0.04 1.00 

We summarize the evaluation results as follows: (1) Pretrained models have significantly better performance than non-pretrained models. (2)LongLMlarge outperforms other baselines substantially on both the understanding and generation tasks. LongLMbase/LongLMsmall achieves better overall scores with half fewer parameters than mT5/GPT2. (3) By comparing GPT2 and GPT2, we can derive that our pretraining data can effectively improve the ability to model long texts. (4) LongLMsmall has a better performance than GPT2 on the understanding tasks, and is comparable with GPT2 on the generation tasks, suggesting the benefits of the encoder-decoder framework and the text infilling task. (5) It is still extremely challenging for all models to capture the commonsense and inter-sentence discourse relations between events in long texts for tackling the ClozeT and SenPos tasks. Furthermore, we investigate how the size of training data influences the accuracy of BERT for SenPos. The result in Figure 2 indicates the necessity to develop better representations of discourse relations instead of relying only on increasing the data size. (6) The results on the generation tasks show that LongLM does well in generating more word overlaps with references than similar-sized baselines for both tasks, and covers more input phrases and arranges them in correct orders for OutGen. But LongLM underperforms GPT2-based models in terms of diversity on PlotCom. (7) Dynamically tracking plot states (i.e., PM) does not bring significant improvement on the generation tasks compared with GPT2, suggesting that it may require modeling the discourse structure explicitly to tackle the generation tasks. And the superiority of PW to GPT2 on OutGen further indicates the benefit of modeling discourse-level features. In summary, we believe LOT will serve as an effective evaluation for capturing the commonsense and discourse relations of long texts beyond the surface events, and generating coherent and controllable long-form texts.

Figure 2: 

Accuracy of BERT for SenPos as the size of training data increases.

Figure 2: 

Accuracy of BERT for SenPos as the size of training data increases.

Close modal

5.4 Manual Evaluation

Because automatic metrics may be unreliable for evaluating NLG (Guan and Huang, 2020), we conducted a point-wise manual evaluation to measure the disparity between machines and humans for the generation tasks in LOT. For each task, we randomly sampled 100 examples from the test set and obtained 100 ground-truth texts and 300 generated texts from three typical models including GPT2base, mT5base and LongLMlarge. For each text along with the input, we hired three crowd-sourced workers to judge its quality with a binary score (1 for good, and 0 otherwise) in terms of three aspects: (1) grammaticality (intra-sentence grammar quality of generated texts), (2) coherence (causal and temporal dependencies within generated texts), and (3) relatedness to inputs (reasonable logical connections to the input context for PlotCom; and reasonable utilization of input phrases for OutGen). These aspects are independently evaluated. We made final decisions among three annotators through majority voting. We show the annotation instructions in the appendix.

Table 12 shows the evaluation results. For both tasks, LongLM outperforms GPT2 and mT5 significantly in all aspects (p < 0.05, sign test). However, it is difficult for all models to generate a logical completion for PlotCom (relatedness score <0.1), showing their poor ability to capture commonsense and inter-sentence relations. And the big gap between LongLM and humans also proves both tasks challenging to existing generation models. We also observe the positive correlation between the manual evaluation and automatic evaluation (Table 11), suggesting that it may be acceptable to use automatic evaluation to compare and improve models on the generation tasks in LOT.

Table 12: 

Manual evaluation results for PlotCom and OutGen in terms of grammaticality (Gram), coherence (Cohe), and relatedness (Relat). The best performance is highlighted in bold. All results show a fair inter-annotator agreement with Fleiss’ κ > 0.2.

ModelsGram (κ)Cohe (κ)Relat (κ)
Task: PlotCom 
GPT2base 0.84 (0.49) 0.41(0.71) 0.01 (0.50) 
mT5base 0.85 (0.24) 0.53 (0.65) 0.01 (0.50) 
LongLMlarge 0.95 (0.48) 0.82 (0.64) 0.09 (0.69) 
Truth 1.00 (1.00) 1.00 (1.00) 0.99 (0.49) 
 
Task: OutGen 
GPT2base 0.54 (0.52) 0.18 (0.52) 0.39 (0.43) 
mT5base 0.53 (0.26) 0.08 (0.46) 0.49 (0.38) 
LongLMlarge 0.81 (0.23) 0.37 (0.43) 0.62 (0.45) 
Truth 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 
ModelsGram (κ)Cohe (κ)Relat (κ)
Task: PlotCom 
GPT2base 0.84 (0.49) 0.41(0.71) 0.01 (0.50) 
mT5base 0.85 (0.24) 0.53 (0.65) 0.01 (0.50) 
LongLMlarge 0.95 (0.48) 0.82 (0.64) 0.09 (0.69) 
Truth 1.00 (1.00) 1.00 (1.00) 0.99 (0.49) 
 
Task: OutGen 
GPT2base 0.54 (0.52) 0.18 (0.52) 0.39 (0.43) 
mT5base 0.53 (0.26) 0.08 (0.46) 0.49 (0.38) 
LongLMlarge 0.81 (0.23) 0.37 (0.43) 0.62 (0.45) 
Truth 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 

5.5 Bias Investigation

It is essential to investigate potential biases of a dataset, which may leak information about target labels and enable models to easily use shortcuts to handle complex inputs without actually mastering the focused abilities (Ribeiro et al., 2020). Therefore, we experimented with the following baselines to inspect the ClozeT and SenPos datasets: (1) Random: It chooses a candidate randomly. (2) Majority: It chooses the candidate with an index that is most frequently selected in the training set. (3) Length: For ClozeT, it chooses the candidate that contains more words; And for SenPos, it chooses the position of which the adjacent sentences have the closest number of words to the removed sentence. (4) BLEU-n: For ClozeT, it chooses the candidate with a higher BLEU-n score (Papineni et al., 2002) with the context; And for SenPos, it chooses the position of which the adjacent sentences have the largest average BLEU-n score with the removed sentence (n = 1,2). (5) Sentiment: For ClozeT, it chooses the candidate with a higher sentiment score computed by an off-the-shelf Chinese sentiment analyzer;5 for SenPos, it chooses the position where the average sentiment score of its adjacent two sentences is the closest to the score of the removed sentence. (6) Discourse Markers: For ClozeT, it chooses the candidate where its adjacent sentences contain a discourse marker matching with it. For example, if “because” occurs in the last sentence before the position of the candidates, this baseline will choose the candidate that contains “so”.6 If there do not exist such paired markers in an example or there are multiple eligible candidates, this baseline will randomly choose one. The setting of this baseline for SenPos is similar to ClozeT. We manually define 24 marker pairs for this baseline. (7) BERT w/o Context: We fine-tuned BERT to directly choose without taking the context as input (Schwartz et al., 2017). (8) BERT w/o Long: It is used to study whether solving these tasks requires modeling long-range dependencies. For ClozeT, we fine-tuned BERT to choose with only the adjacent sentences of the removed sentence as input. And for SenPos, we encoded each position and its adjacent sentences respectively using BERT and then took the hidden states at these positions for prediction. These baselines cover different levels of features ranging from the token level (e.g., Length), the sentence level (e.g., Sentiment), to the discourse level (e.g., Discourse Markers, BERT w/o Context). We believe that these baselines will provide a comprehensive inspection for the potential biases of our datasets.

As shown in Table 13, both tasks can not be trivially solved by these baselines, suggesting that the datasets may be free of biases in terms of the above features. Therefore, we believe that the tasks can focus on testing the ability of models to capture long-range commonsense and discourse relations.

Table 13: 

Accuracy (%) of different baselines on the test sets of ClozeT and SenPos for bias investigation. We use the results of BERT as a reference.

BaselinesClozeTSenPos
Random 50.00 16.03 
Majority 52.72 16.24 
Length 52.72 16.45 
BLEU-1/2 46.94/48.98 14.14/14.95 
Sentiment 50.34 16.49 
Discouse Markers 45.92 9.15 
BERT w/o Context 57.82 18.08 
BERT w/o Long 62.24 19.00 
 
BERT 69.39 43.68 
BaselinesClozeTSenPos
Random 50.00 16.03 
Majority 52.72 16.24 
Length 52.72 16.45 
BLEU-1/2 46.94/48.98 14.14/14.95 
Sentiment 50.34 16.49 
Discouse Markers 45.92 9.15 
BERT w/o Context 57.82 18.08 
BERT w/o Long 62.24 19.00 
 
BERT 69.39 43.68 

5.6 Memorization Investigation

Overlap between training and test data may result in an over-reporting of the generalization performance of machines. Therefore, it is necessary to investigate how many test data also show up in the training data. To this end, we followRadford et al. (2019) to measure the overlap between two datasets by calculating the percentage of 8-grams from one that are also in the other. We use the jieba tokenizer for tokenization.

Table 14 shows the overlapping analysis for test sets of the four tasks in LOT. We can see that all test sets have less than 1% overlap with their own training sets. Notably, there are 17 test examples of SenPos that contain more than 10% overlapped 8-grams with the training set. This is because a training example and a test example may come from the same story, and thus they share similar information (e.g., characters, locations). A test example contains at most 60.98% overlapped 8-grams, suggesting that the training set and test set do not include exactly the same example. As for the pretraining data of LongLM, the test sets of ClozeT and PlotCom still have less than 1% overlap. However, there are dozens of test examples in SenPos and OutGen that contain more than 10% overlapped 8-grams. Through manual inspection of the overlaps, we found that they mainly come from idioms, proverbs and classicfairy tales, which may be part of some novels in the pretraining data.

Table 14: 

Overlapping analysis for the test sets of the four tasks with respect to their own training sets or the pretraining data of LongLM. We compute the following statistics: (1) Percent: the percentage of 8-grams from the test set that are also in the training sets or the pretraining data; (2) # 8-grams: the number of overlapped 8-grams; (3)# Exam: the number of examples that contain at least one overlapped 8-gram; (4) # Exam >10%: the number of examples that have more than 10% overlapped 8-grams. (4) Max Percent: the maximum percentage of overlapped 8-grams from an example.

TasksClozeTSenPosPlotComOutGen
Overlap with the Training Sets 
Percent 0.00% 0.62% 0.02% 0.00% 
 
# 8-grams 1,040 
# Exam 45 
# Exam >10% 17 
Max Percent 0.00% 60.98% 2.53% 1.00% 
 
Overlap with the Pretraining Data 
Percent 0.67% 4.68% 0.38% 1.22% 
 
# 8-grams 172 7,844 151 1,212 
# Exam 83 486 88 161 
# Exam >10% 71 26 
Max Percent 47.22% 60.96% 30.77% 41.18% 
TasksClozeTSenPosPlotComOutGen
Overlap with the Training Sets 
Percent 0.00% 0.62% 0.02% 0.00% 
 
# 8-grams 1,040 
# Exam 45 
# Exam >10% 17 
Max Percent 0.00% 60.98% 2.53% 1.00% 
 
Overlap with the Pretraining Data 
Percent 0.67% 4.68% 0.38% 1.22% 
 
# 8-grams 172 7,844 151 1,212 
# Exam 83 486 88 161 
# Exam >10% 71 26 
Max Percent 47.22% 60.96% 30.77% 41.18% 

To investigate how the overlapping data influence the measurement of models’ performance, we re-evaluated LongLMlarge on the test sets of SenPos and OutGen with exclusion of the examples that have more than 10% overlapped 8-grams with the training sets or pretraining data. We also used mT5base as a baseline in the same setting of LongLM. The results for SenPos and OutGen are shown in Tables 15 and 16, respectively. The change of accuracy or BLEU-1 score is very marginal for both mT5 and LongLM when excluding the overlapping data, suggesting that the superior performance of LongLM is rarely attributable to the memorization of training data. Therefore, we believe that it is fair to compare LongLM and other models on these tasks.

Table 15: 

Accuracy on the test set of SenPos. Total means using the whole test set while w/o Overlap means excluding the examples that have more than 10% overlapped 8-grams with the training set or pretraining data from the test set. # Exam is the number of examples. denotes the change of accuracy when excluding the overlapping data compared with using the total test set.

SenPosTotalw/o Overlap (Training Set)Δ
# Exam 863 846 N/A 
mT5base 61.41% 61.82% +0.41% 
LongLMlarge 69.41% 69.50% +0.09% 
 
SenPos Total w/o Overlap (Pretraining Data) Δ 
# Exam 863 792 N/A 
mT5base 61.41% 61.24% –0.17% 
LongLMlarge 69.41% 69.32% –0.09% 
SenPosTotalw/o Overlap (Training Set)Δ
# Exam 863 846 N/A 
mT5base 61.41% 61.82% +0.41% 
LongLMlarge 69.41% 69.50% +0.09% 
 
SenPos Total w/o Overlap (Pretraining Data) Δ 
# Exam 863 792 N/A 
mT5base 61.41% 61.24% –0.17% 
LongLMlarge 69.41% 69.32% –0.09% 
Table 16: 

BLEU-1 score on the test set of OutGen. Other notations are the same as Table 15.

OutGenTotalw/o Overlap (Pretraining Data)Δ
# Exam 729 703 N/A 
mT5base 36.33 36.45 +0.12 
LongLMlarge 42.10 42.22 +0.12 
OutGenTotalw/o Overlap (Pretraining Data)Δ
# Exam 729 703 N/A 
mT5base 36.33 36.45 +0.12 
LongLMlarge 42.10 42.22 +0.12 

We present LOT, a story-centric benchmark for Chinese long text understanding and generation. LOT includes two story understanding tasks and two story generation tasks, which comprehensively investigate the abilities of commonsense reasoning, controllable generation, and modeling inter-sentence relations and the global discourse structures. We provide standard datasets for the four tasks, which are constructed based on human-written stories processed by automatic and manual annotation. Furthermore, we release a new Chinese long text pretraining model LongLM, which outperforms strong baseline models substantially on both the understanding and generation tasks in LOT. The LOT benchmark and the pretraining model will encourage further research on Chinese long text modeling.

This work was supported by the National Science Foundation for Distinguished Young Scholars (no. 62125604) and the NSFC projects (Key project no. 61936010 and regular project no. 61876096). This work was also supported by the Guoqiang Institute of Tsinghua University, with grant nos. 2019GQG1 and 2020GQG0005. We would also like to thank our action editor, Dipanjan Das, and the anonymous reviewers for their invaluable suggestions and feedback.

1

The LOT benchmark, the pretraining resources, and the appendix are available at https://github.com/thu-coai/LOT-LongLM.

3

We set the minimum length of the removed sentence to 10 Chinese characters, and we merge a sentence in a story with its neighbors if it contains less than 10 characters.

6

Different from English, paired discourse markers like “because”-“so” should be used together in Chinese.

Apoorv
Agarwal
,
Anup
Kotalwar
, and
Owen
Rambow
.
2013
.
Automatic extraction of social networks from literary text: A case study on alice in wonderland
. In
Proceedings of the Sixth International Joint Conference on Natural Language Processing
, pages
1202
1208
.
Nader
Akoury
,
Shufan
Wang
,
Josh
Whiting
,
Stephen
Hood
,
Nanyun
Peng
, and
Mohit
Iyyer
.
2020
.
STORIUM: A Dataset and evaluation platform for machine-in-the-loop story generation
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
6470
6484
,
Online
.
Association for Computational Linguistics
.
David
Bamman
,
Brendan
O’Connor
, and
Noah A.
Smith
.
2013
.
Learning latent personas of film characters
. In
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
352
361
.
Chandra
Bhagavatula
,
Ronan Le
Bras
,
Chaitanya
Malaviya
,
Keisuke
Sakaguchi
,
Ari
Holtzman
,
Hannah
Rashkin
,
Doug
Downey
,
Wen-tau
Yih
, and
Yejin
Choi
.
2019
.
Abductive commonsense reasoning
. In
International Conference on Learning Representations
.
Faeze
Brahman
and
Snigdha
Chaturvedi
.
2020
.
Modeling protagonist emotions for emotion-aware storytelling
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
5277
5294
,
Online
.
Association for Computational Linguistics
.
Tom B.
Brown
,
Benjamin
Mann
,
Nick
Ryder
,
Melanie
Subbiah
,
Jared
Kaplan
,
Prafulla
Dhariwal
,
Arvind
Neelakantan
,
Pranav
Shyam
,
Girish
Sastry
,
Amanda
Askell
,
Sandhini
Agarwal
,
Ariel
Herbert-Voss
,
Gretchen
Krueger
,
Tom
Henighan
,
Rewon
Child
,
Aditya
Ramesh
,
Daniel M.
Ziegler
,
Jeffrey
Wu
,
Clemens
Winter
,
Christopher
Hesse
,
Mark
Chen
,
Eric
Sigler
,
Mateusz
Litwin
,
Scott
Gray
,
Benjamin
Chess
,
Jack
Clark
,
Christopher
Berner
,
Sam
McCandlish
,
Alec
Radford
,
Ilya
Sutskev
, and
Dario
Amodei
.
2020
.
Language models are few-shot learners
.
Nathanael
Chambers
and
Dan
Jurafsky
.
2008
.
Unsupervised learning of narrative event chains
. In
Proceedings of ACL-08: HLT
, pages
789
797
.
Snigdha
Chaturvedi
,
Mohit
Iyyer
, and
Hal
Daume
III
.
2017
.
Unsupervised learning of evolving relationships between literary characters
. In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume
31
.
Snigdha
Chaturvedi
,
Shashank
Srivastava
,
Hal Daume
III
, and
Chris
Dyer
.
2016
.
Modeling evolving relationships between characters in literary novels
. In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume
30
.
Mingda
Chen
,
Zewei
Chu
, and
Kevin
Gimpel
.
2019
.
Evaluation benchmarks and learning criteria for discourse-aware sentence representations
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
649
662
.
Alexis
Conneau
and
Douwe
Kiela
.
2018
.
Senteval: An evaluation toolkit for universal sentence representations
. In
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
.
Yiming
Cui
,
Wanxiang
Che
,
Ting
Liu
,
Bing
Qin
,
Shijin
Wang
, and
Guoping
Hu
.
2020
.
Revisiting pre-trained models for Chinese natural language processing
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings
, pages
657
668
,
Online
.
Association for Computational Linguistics
.
Zihang
Dai
,
Zhilin
Yang
,
Yiming
Yang
,
Jaime G.
Carbonell
,
Quoc
Le
, and
Ruslan
Salakhutdinov
.
2019
.
Transformer-xl: Attentive language models beyond a fixed-length context
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
2978
2988
.
Jacob
Devlin
,
Ming-Wei
Chang
,
Kenton
Lee
, and
Kristina
Toutanova
.
2019
.
BERT: Pre-training of deep bidirectional transformers for language understanding
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
4171
4186
.
Angela
Fan
,
Mike
Lewis
, and
Yann
Dauphin
.
2018
.
Hierarchical neural story generation
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
889
898
.
Mark Alan
Finlayson
.
2012
.
Learning narrative structure from annotated folktales
. Ph.D. thesis,
Massachusetts Institute of Technology
.
Joseph
Fleis
.
1971
.
Measuring nominal scale agreement among many raters.
Psychological Bulletin
,
76
(
5
):
378
382
.
Jonas
Gehring
,
Michael
Auli
,
David
Grangier
,
Denis
Yarats
, and
Yann N.
Dauphin
.
2017
.
Convolutional sequence to sequence learning
. In
International Conference on Machine Learning
, pages
1243
1252
.
PMLR
.
Sebastian
Gehrmann
,
Tosin
Adewumi
,
Karmanya
Aggarwal
,
Pawan Sasanka
Ammanamanchi
,
Aremu
Anuoluwapo
,
Antoine
Bosselut
,
Khyathi Raghavi
Chandu
,
Miruna
Clinciu
,
Dipanjan
Das
,
Kaustubh D.
Dhole
,
Wanyu
Du
,
Esin
Durmus
,
Ondřej
Dušek
,
Chris
Emezue
,
Varun
Gangal
,
Cristina
Garbacea
,
Tatsunori
Hashimoto
,
Yufang
Hou
,
Yacine
Jernite
,
Harsh
Jhamtani
,
Yangfeng
Ji
,
Shailza
Jolly
,
Dhruv
Kumar
,
Faisal
Ladhak
,
Aman
Madaan
,
Mounica
Maddela
,
Khyati
Mahajan
,
Saad
Mahamood
,
Bodhisattwa Prasad
Majumder
,
Pedro Henrique
Martins
,
Angelina
McMillan-Major
,
Simon
Mille
,
Emiel
van Miltenburg
,
Moin
Nadeem
,
Shashi
Narayan
,
Vitaly
Nikolaev
,
Rubungo Andre
Niyongabo
,
Salomey
Osei
,
Ankur
Parikh
,
Laura Perez
Beltrachini
,
Niranjan Ramesh
Rao
,
Vikas
Raunak
,
Juan Diego
Rodriguez
,
Sashank
Santhanam
,
João
Sedoc
,
Thibault
Sellam
,
Samira
Shaikh
,
Anastasia
Shimorina
,
Marco Antonio Sobrevilla
Cabezudo
,
Hendrik
Strobelt
,
Nishant
Subramani
,
Wei
Xu
,
Diyi
Yang
,
Akhila
Yerukola
, and
Jiawei
Zhou
.
2021
.
The GEM benchmark: Natural language generation, its evaluation and metrics
.
arXiv preprint arXiv:2102.01672
.
Ian
Goodfellow
,
Jean
Pouget-Abadie
,
Mehdi
Mirza
,
Bing
Xu
,
David
Warde-Farley
,
Sherjil
Ozair
,
Aaron
Courville
, and
Yoshua
Bengio
.
2014
.
Generative adversarial nets
. In
Advances in Neural Information Processing Systems
, pages
2672
2680
.
Jian
Guan
,
Fei
Huang
,
Zhihao
Zhao
,
Xiaoyan
Zhu
, and
Minlie
Huang
.
2020
.
A knowledge- enhanced pretraining model for commonsense story generation
.
Transactions of the Association for Computational Linguistics
,
8
:
93
108
.
Jian
Guan
and
Minlie
Huang
.
2020
.
UNION: An unreferenced metric for evaluating open-ended story generation
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020
, pages
9157
9166
.
Association for Computational Linguistics
.
Jian
Guan
,
Yansen
Wang
, and
Minlie
Huang
.
2019
.
Story ending generation with incremental encoding and commonsense knowledge
. In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume
33
, pages
6473
6480
.
Jian
Guan
,
Zhexin
Zhang
,
Zhuoer
Feng
,
Zitao
Liu
,
Wenbiao
Ding
,
Xiaoxi
Mao
,
Changjie
Fan
, and
Minlie
Huang
.
2021
.
OpenMEVA: A benchmark for evaluating open-ended story generation metrics
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
6394
6407
,
Online
.
Association for Computational Linguistics
.
Xiangzhe
Kong
,
Jialiang
Huang
,
Ziquan
Tung
,
Jian
Guan
, and
Minlie
Huang
.
2021
.
Stylized story generation with style-guided planning
. In
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
, pages
2430
2436
,
Online
.
Association for Computational Linguistics
.
Taku
Kudo
and
John
Richardson
.
2018
.
Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
, pages
66
71
.
Mike
Lewis
,
Yinhan
Liu
,
Naman
Goyal
,
Marjan
Ghazvininejad
,
Abdelrahman
Mohamed
,
Omer
Levy
,
Veselin
Stoyanov
, and
Luke
Zettlemoyer
.
2020
.
BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020
, pages
7871
7880
.
Association for Computational Linguistics
.
Boyang
Li
,
Stephen
Lee-Urban
,
George
Johnston
, and
Mark
Riedl
.
2013
.
Story generation with crowdsourced plot graphs
. In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume
27
.
Jiwei
Li
,
Michel
Galley
,
Chris
Brockett
,
Jianfeng
Gao
, and
William B.
Dolan
.
2016
.
A diversity-promoting objective function for neural conversation models
. In
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
110
119
.
Chin-Yew
Lin
.
2004
.
ROUGE: A package for automatic evaluation of summaries
. In
Text Summarization Branches Out
, pages
74
81
,
Barcelona, Spain
.
Association for Computational Linguistics
.
Dayiheng
Liu
,
Yu
Yan
,
Yeyun
Gong
,
Weizhen
Qi
,
Hang
Zhang
,
Jian
Jiao
,
Weizhu
Chen
,
Jie
Fu
,
Linjun
Shou
,
Ming
Gong
,
Pengcheng
Wang
,
Jiusheng
Chen
,
Daxin
Jiang
,
Jiancheng
Lv
,
Ruofei
Zhang
,
Winnie
Wu
,
Ming
Zhou
, and
Nan
Duan
.
2020
.
GLGE: A new general language generation evaluation benchmark
.
arXiv preprint arXiv:2011.11928
.
Annie
Louis
and
Charles
Sutton
.
2018
.
Deep dungeons and dragons: Learning character-action interactions from role-playing game transcripts
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
, pages
708
713
.
Stephen
Merity
,
Caiming
Xiong
,
James
Bradbury
, and
Richard
Socher
.
2016
.
Pointer sentinel mixture models
.
arXiv preprint arXiv:1609.07843
.
Nasrin
Mostafazadeh
,
Nathanael
Chambers
,
Xiaodong
He
,
Devi
Parikh
,
Dhruv
Batra
,
Lucy
Vanderwende
,
Pushmeet
Kohli
, and
James
Allen
.
2016
.
A corpus and cloze evaluation for deeper understanding of commonsense stories
. In
Proceedings of NAACL-HLT
, pages
839
849
.
Kishore
Papineni
,
Salim
Roukos
,
Todd
Ward
, and
Wei-Jing
Zhu
.
2002
.
BLEU: A method for automatic evaluation of machine translation
. In
Proceedings of the 40th annual meeting of the Association for Computational Linguistics
, pages
311
318
.
Debjit
Paul
and
Anette
Frank
.
2021
.
COINS: Dynamically generating COntextualized inference rules for narrative story completion
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
5086
5099
,
Online
.
Association for Computational Linguistics
.
Alec
Radford
,
Karthik
Narasimhan
,
Tim
Salimans
, and
Ilya
Sutskever
.
2018
.
Improving language understanding with unsupervised learning
.
Alec
Radford
,
Jeffrey
Wu
,
Rewon
Child
,
David
Luan
,
Dario
Amodei
, and
Ilya
Sutskever
.
2019
.
Language models are unsupervised multitask learners
.
OpenAI blog
,
1
(
8
):
9
.
Jack W.
Rae
,
Anna
Potapenko
,
Siddhant M.
Jayakumar
,
Chloe
Hillier
, and
Timothy P.
Lillicrap
.
2020
.
Compressive transformers for long-range sequence modelling
. In
International Conference on Learning Representations
.
Colin
Raffel
,
Noam
Shazeer
,
Adam
Roberts
,
Katherine
Lee
,
Sharan
Narang
,
Michael
Matena
,
Yanqi
Zhou
,
Wei
Li
, and
Peter J.
Liu
.
2020
.
Exploring the limits of transfer learning with a unified text-to-text transformer
.
Journal of Machine Learning Research
,
21
:
1
67
.
Hannah
Rashkin
,
Asli
Celikyilmaz
,
Yejin
Choi
, and
Jianfeng
Gao
.
2020
.
Plotmachines: Outline- conditioned generation with dynamic plot state tracking
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
4274
4295
.
Marco Tulio
Ribeiro
,
Tongshuang
Wu
,
Carlos
Guestrin
, and
Sameer
Singh
.
2020
.
Beyond accuracy: Behavioral testing of NLP models with CheckList
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
4902
4912
.
Online
.
Association for Computational Linguistics
.
Tim
Rocktäschel
,
Edward
Grefenstette
,
Karl Moritz
Hermann
,
Tomás
Kociský
, and
Phil
Blunsom
.
2016
.
Reasoning about entailment with neural attention
. In
4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings
. http://arxiv.org/abs/1509.06664.
Stuart
Rose
,
Dave
Engel
,
Nick
Cramer
, and
Wendy
Cowley
.
2010
.
Automatic keyword extraction from individual documents
.
Text Mining: Applications and Theory
,
1
:
1
20
.
Paul-Edouard
Sarlin
,
Daniel
DeTone
,
Tomasz
Malisiewicz
, and
Andrew
Rabinovich
.
2020
.
Superglue: Learning feature matching with graph neural networks
. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages
4938
4947
.
Roy
Schwartz
,
Maarten
Sap
,
Ioannis
Konstas
,
Leila
Zilles
,
Yejin
Choi
, and
Noah A.
Smith
.
2017
.
The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task
. In
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)
, pages
15
25
.
Rishi
Sharma
,
James
Allen
,
Omid
Bakhshandeh
, and
Nasrin
Mostafazadeh
.
2018
.
Tackling the story ending biases in the story cloze test
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
, pages
752
757
.
Yi
Tay
,
Mostafa
Dehghani
,
Samira
Abnar
,
Yikang
Shen
,
Dara
Bahri
,
Philip
Pham
,
Jinfeng
Rao
,
Liu
Yang
,
Sebastian
Ruder
, and
Donald
Metzler
.
2020
.
Long range arena: A benchmark for efficient transformers
. In
International Conference on Learning Representations
.
Ashish
Vaswani
,
Noam
Shazeer
,
Niki
Parmar
,
Jakob
Uszkoreit
,
Llion
Jones
,
Aidan N.
Gomez
,
Łukasz
Kaiser
, and
Illia
Polosukhin
.
2017
.
Attention is all you need
. In
Advances in Neural Information Processing Systems
, pages
5998
6008
.
Alex
Wang
,
Amanpreet
Singh
,
Julian
Michael
,
Felix
Hill
,
Omer
Levy
, and
Samuel R.
Bowman
.
2019
.
GLUE: A multi-task benchmark and analysis platform for natural language understanding
. In
International Conference on Learning Representations
.
Tianming
Wang
and
Xiaojun
Wan
.
2019
.
T-CVAE: Transformer-based conditioned variational autoencoder for story completion
. In
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019
, pages
5233
5239
.
ijcai.org;
.
Liang
Xu
,
Hai
Hu
,
Xuanwei
Zhang
,
Lu
Li
,
Chenjie
Cao
,
Yudong
Li
,
Yechen
Xu
,
Kai
Sun
,
Dian
Yu
,
Cong
Yu
, et al.
2020a
.
CLUE: A Chinese language understanding evaluation benchmark
. In
Proceedings of the 28th International Conference on Computational Linguistics
, pages
4762
4772
.
Peng
Xu
,
Mostofa
Patwary
,
Mohammad
Shoeybi
,
Raul
Puri
,
Pascale
Fung
,
Anima
Anandkumar
, and
Bryan
Catanzaro
.
2020b
.
MEGATRON-CNTRL: Controllable story generation with external knowledge using large-scale language models
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020
, pages
2831
2845
.
Association for Computational Linguistics
.
Linting
Xue
,
Noah
Constant
,
Adam
Roberts
,
Mihir
Kale
,
Rami
Al-Rfou
,
Aditya
Siddhant
,
Aditya
Barua
, and
Colin
Raffel
.
2021
.
mt5: A massively multilingual pre-trained text-to-text transformer
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
483
498
.
Lili
Yao
,
Nanyun
Peng
,
Ralph
Weischedel
,
Kevin
Knight
,
Dongyan
Zhao
, and
Rui
Yan
.
2019
.
Plan-and-write: Towards better automatic storytelling
. In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume
33
, pages
7378
7385
.
Zhengyan
Zhang
,
Xu
Han
,
Hao
Zhou
,
Pei
Ke
,
Yuxian
Gu
,
Deming
Ye
,
Yujia
Qin
,
Yusheng
Su
,
Haozhe
Ji
,
Jian
Guan
,
Fanchao
Qi
,
Xiaozhi
Wang
,
Yanan
Zheng
,
Guoyang
Zeng
,
Huanqi
Cao
,
Shengqi
Chen
,
Daixuan
Li
,
Zhenbo
Sun
,
Zhiyuan
Liu
,
Minlie
Huang
,
Wentao
Han
,
Jie
Tang
,
Juanzi
Li
,
Xiaoyan
Zhu
, and
Maosong
Sun
.
2020
.
CPM: A large-scale generative Chinese pre-trained language model
.
arXiv preprint arXiv:2012.00413
.
Tiancheng
Zhao
,
Ran
Zhao
, and
Maxine
Eskenazi
.
2017
.
Learning discourse-level diversity for neural dialog models using conditional variational autoencoders
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
654
664
.
Zhe
Zhao
,
Hui
Chen
,
Jinbin
Zhang
,
Xin
Zhao
,
Tao
Liu
,
Wei
Lu
,
Xi
Chen
,
Haotang
Deng
,
Qi
Ju
, and
Xiaoyong
Du
.
2019
.
UER: An open-source toolkit for pre-training models
.
EMNLP-IJCNLP 2019
, page
241
.

Author notes

Action Editor: Dipanjan Das

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.