Abstract
Standard multi-task benchmarks are essential for developing pretraining models that can generalize to various downstream tasks. Existing benchmarks for natural language processing (NLP) usually focus only on understanding or generating short texts. However, long text modeling requires many distinct abilities in contrast to short texts, such as the modeling of long-range discourse and commonsense relations, and the coherence and controllability of generation. The lack of standardized benchmarks makes it difficult to assess these abilities of a model and fairly compare different models, especially Chinese models. Therefore, we propose a story-centric benchmark named LOT for evaluating Chinese long text modeling, which aggregates two understanding tasks and two generation tasks. We construct new datasets for these tasks based on human-written Chinese stories with hundreds of words. Furthermore, we release an encoder-decoder-based Chinese long text pretraining model named LongLM with up to 1 billion parameters. We pretrain LongLM on 120G Chinese novels with two generative tasks including text infilling and conditional continuation. Extensive experiments show that LongLM outperforms similar-sized pretraining models substantially on both the understanding and generation tasks in LOT.
1 Introduction
Pretrained language models have achieved significant advances in various natural language understanding (NLU) and generation (NLG) tasks (Devlin et al., 2019; Radford et al., 2019). Standard benchmarks such as GLUE (Wang et al., 2019) further boost the improvement and fast iteration of pretrained models. Popular benchmarks usually aggregate multiple tasks to spur the progress of generalizable models. But these benchmarks focus mainly on understanding or generating short texts. For example, the GLUE tasks take at most two sentences as input. And most tasks in NLG benchmarks such as GLGE (Liu et al., 2020) and GEM (Gehrmann et al., 2021) require generating only several words (e.g., dialogue generation). Although there have been many models pretrained on long texts such as GPT3 (Brown et al., 2020) and CPM (Zhang et al., 2020), the lack of benchmark datasets makes it difficult to fully assess and compare their abilities of long text modeling.
In this paper, we present LOT, a benchmark for evaluating Chinese LOng Text understanding and generation. As shown in Table 1, modeling long texts requires many distinct abilities compared to short texts, including (1) commonsense reasoning regarding characters’ reaction and intention, and knowledge about physical objects (e.g., “river”) and abstract concepts (e.g., “irony”); (2) modeling discourse-level features such as inter-sentence relations (e.g., causality) and global discourse structures (e.g., the order of events); and (3) the generation coherence and controllability, which require both maintaining a coherent plot and adhering to controllable attributes (e.g., topics). Accordingly, LOT contains two understanding tasks and two generation tasks regarding the above abilities. We construct new datasets for these tasks based on various kinds of stories such as fables and fairy tales collected from public web resources, considering that stories usually contain abundant commonsense and discourse relations. All these tasks require processing stories with hundreds of words. Note that LOT does not involve extra-long texts with thousands of words since the complicated linguistic phenomena in these texts make it hard to test individual abilities and guide the improvement of generation models.
Effendi’s son is eccentric, always behaving opposed to what Effendi has ordered him to do. Familiar to his son’s temper, Effendi usually communicates using irony. One day, the father and son were blocked by a river after purchasing flour from a mill. And while they were crossing the river, one bag on the donkey’s back lost its weight and leaned. Effendi told his son with irony:“My boy! drop the sack into the river!” The son heard the words and thought:“I have been opposed to my father for so many years. For this only time, I have to obey him.” Therefore, he followed Effendi’s words and indeed pushed the sack into the river. “My boy! What are you doing?”Effendi shouted in anger.” ... |
Effendi’s son is eccentric, always behaving opposed to what Effendi has ordered him to do. Familiar to his son’s temper, Effendi usually communicates using irony. One day, the father and son were blocked by a river after purchasing flour from a mill. And while they were crossing the river, one bag on the donkey’s back lost its weight and leaned. Effendi told his son with irony:“My boy! drop the sack into the river!” The son heard the words and thought:“I have been opposed to my father for so many years. For this only time, I have to obey him.” Therefore, he followed Effendi’s words and indeed pushed the sack into the river. “My boy! What are you doing?”Effendi shouted in anger.” ... |
Furthermore, we release LongLM, a Chinese Long text pretraining Language Model. LongLM is a Transformer-based model with an encoder-decoder architecture. LongLM has three different versions ranging from 60 million to 1 billion parameters. We pretrain LongLM on 120G Chinese novels with two generative tasks, including text infilling (Lewis et al., 2020) and conditional continuation (Radford et al., 2018). The pretraining data do not include other types of texts (e.g., news, Wiki-texts) since we mainly focus on commonsense and discourse relations within general long texts instead of factual and technical knowledge. To the best of our knowledge, LongLM is the first pretraining model of the same size scale that focuses on modeling long-form stories. Extensive experiments on LOT show that LongLM outperforms strong baselines substantially on both the understanding and generation tasks. However, we also observe that LongLM is still far behind human performance, which requires better semantic representations of events and deeper modeling of the commonsense and discourse relations between them. We summarize the main contributions of this paper as follows:
I. We propose a new story-centric benchmark LOT for evaluating Chinese long text understanding and generation. LOT consists of four tasks for testing the fundamental abilities to model long texts. We also present new datasets for these tasks.
II. We release a new Chinese pretraining model named LongLM. Experiment results demonstrate the strong performance of LongLM on LOT, but there still exists considerable room for improvement.1
2 Related Work
NLP Benchmarks
Recently, there have been a lot of multi-task benchmarks proposed to drive the progress of generalizable models. The benchmarks usually aggregate multiple model-agnostic tasks under a unified framework, enabling researchers to fairly compare different models. SentEval (Conneau and Kiela, 2018) gathered multiple classification tasks involving either one or two sentences as inputs to evaluate sentence representations. DiscoEval (Chen et al., 2019) extended these tasks to the discourse level regarding inter-sentence relations. GLUE (Wang et al., 2019) included more diverse tasks such as natural language inference (Rocktäschel et al., 2016). Sarlin et al. (2020) proposed SuperGLUE as a more challenging counterpart of GLUE by introducing multi-sentence tasks. But the additional tasks are only limited to the formats of coreference resolution and question answering. In addition to these English benchmarks, many benchmarks were proposed to evaluate NLU for other languages, such as CLUE (Xu et al., 2020a) for Chinese. Moreover, GLGE (Liu et al., 2020) and GEM (Gehrmann et al., 2021) were proposed for evaluating NLG models across diversified generation tasks such as text summarization and personalizing dialogue. However, there is no benchmark designed specifically for long text modeling, especially Chinese. Additionally, the above benchmarks were originally designed to cover as diverse task formats as possible. In contrast, we design the LOT tasks with the guidance of necessary abilities for long text modeling as suggested by Ribeiro et al. (2020), making it easier to figure out where models are failing, and how to improve them.
Long Text Datasets
Previous studies in the field of long text modeling have frequently focused on the ROCStories (Mostafazadeh et al., 2016) and WritingPrompts (Fan et al., 2018) datasets. ROCStories contains 100k artificial five-sentence stories, while WritingPrompts consists of 300K pairs of prompts and stories with hundreds of words. Recent works collected stories with thousands of words to model longer-range dependencies, such as WikiText-103 (Merity et al., 2016), roleplayerguild (Louis and Sutton, 2018), PG-19 (Rae et al., 2020), STORIUM (Akoury et al., 2020), and Long-Range Arena (Tay et al., 2020). However, these datasets are written in English. LOT will drive the development of Chinese language models.
Moreover, LOT does not include datasets of extra-long texts, like PG-19, for the following two reasons: (1) Extra-long texts are far beyond the scope of current machine learning models because the discourse-level linguistic phenomena are entangled and complicated in these texts. Therefore, extra-long texts usually serve for computing perplexity of language models (Dai et al., 2019) but hardly provide fine-grained guidance for improving model designs. (2) LOT aims not to spur research on building fuller connections across tokens within an extra-long sequence, but to drive the progress of machines in the aforementioned fundamental abilities for long text modeling.
Story Understanding and Generation
LOT is centered on fundamental abilities for long text modeling and thus includes four story understanding and generation tasks concerning commonsense and discourse relations. Recent studies have proposed various tasks to evaluate story understanding and generation. First, story ending selection (Mostafazadeh et al., 2016), story ending generation (Guan et al., 2019), and story completion (Wang and Wan, 2019) focused on the commonsense reasoning ability on inter-event causal and temporal relations. Second, Chen et al. (2019) evaluated the ability to model discourse relations by predicting the position of a sentence or a paragraph in a text. Third, some works focused on the coherence of story generation conditioned on short prompts (Fan et al., 2018), titles (Yao et al., 2019) and beginnings (Guan et al., 2020). Fourth, some studies centered on controllability, that is, the imposing of controllable attributes on story generation such as keywords (Xu et al., 2020b), emotional trajectories (Brahman and Chaturvedi, 2020), outlines (Rashkin et al., 2020), and styles (Kong et al., 2021). LOT is a comprehensive benchmark to test the above abilities for Chinese long text modeling.
On the other hand, LOT does not involve those tasks that require learning more particular features of stories, such as event chains (Chambers and Jurafsky, 2008), character types (Bamman et al., 2013), inter-character relations (Chaturvedi et al., 2016, 2017), social networks (Agarwal et al., 2013), and abstractive structures (Finlayson, 2012). Non-neural story generation models usually retrieved events from a knowledge base with pre-specified semantic relations based on handcrafted rules (Li et al., 2013), which are costly and lack generalization. In this paper, we focus mainly on evaluating neural models for story understanding and generation.
3 LOT Benchmark
We design LOT as an aggregation of two understanding tasks including Cloze Test (ClozeT) and Sentence Position Prediction (SenPos), and two generation tasks including Plot Completion (PlotCom) and Outline-conditioned Generation (OutGen). We show the task descriptions and data statistics in Tables 2 and 3, respectively. We use the jieba tokenizer2 for word tokenization.
Tasks . | Abilities . | Inputs . | Outputs . | Metrics . |
---|---|---|---|---|
ClozeT | Commonsense Reasoning | A text with a sentence removed (the position specified); Two candidate sentences. | Choosing the correct sentence from two candidates. | Accuracy |
SenPos | Inter-sentence Relationship | A text with a sentence removed (the position unspecified); The removed sentence. | Choosing the correct position for the removed sentence. | Accuracy |
PlotCom | Commonsense Reasoning; Inter-sentence Relationship | A text with a sentence removed (the position specified). | Generating a sentence to complete the text. | BLEU; Dist |
OutGen | Discourse Structure; Coherence; Controllability | A title, an outline as an out-of-order set of phrases about characters and events. | Generating a coherent text adhering to the title and outline. | BLEU; Dist; Cover; Order |
Tasks . | Abilities . | Inputs . | Outputs . | Metrics . |
---|---|---|---|---|
ClozeT | Commonsense Reasoning | A text with a sentence removed (the position specified); Two candidate sentences. | Choosing the correct sentence from two candidates. | Accuracy |
SenPos | Inter-sentence Relationship | A text with a sentence removed (the position unspecified); The removed sentence. | Choosing the correct position for the removed sentence. | Accuracy |
PlotCom | Commonsense Reasoning; Inter-sentence Relationship | A text with a sentence removed (the position specified). | Generating a sentence to complete the text. | BLEU; Dist |
OutGen | Discourse Structure; Coherence; Controllability | A title, an outline as an out-of-order set of phrases about characters and events. | Generating a coherent text adhering to the title and outline. | BLEU; Dist; Cover; Order |
Datasets . | Train . | Val . | Test . |
---|---|---|---|
Task:ClozeT | |||
# Examples | 644 | 294 | 294 |
Vocabulary Size | 9k | 7k | 7k |
Avg. # Char in Input Text | 139.07 | 138.95 | 141.15 |
Avg. # Word in Input Text | 89.28 | 89.03 | 90.20 |
Avg. # Sent in Input Text | 5.95 | 5.94 | 5.95 |
Avg. # Word in Candidate | 15.60 | 16.38 | 15.75 |
Task:SenPos | |||
# Examples | 20,000 | 800 | 863 |
Vocabulary Size | 147k | 10k | 22k |
Avg. # Char in Input Text | 289.59 | 258.48 | 258.52 |
Avg. # Word in Input Text | 254.11 | 224.20 | 223.25 |
Avg. # Sent in Input Text | 9.61 | 8.43 | 8.44 |
Avg. # Word in Removed Sent | 30.48 | 29.28 | 30.26 |
Avg. # Candidate Positions | 8.05 | 6.91 | 6.91 |
Task:PlotCom | |||
# Examples | 13,099 | 465 | 464 |
Vocabulary Size | 22k | 8k | 8k |
Avg. # Char in Input Text | 164.35 | 137.67 | 133.26 |
Avg. # Word in Input Text | 105.48 | 87.56 | 84.98 |
Avg. # Sent in Input Text | 7.17 | 5.59 | 5.48 |
Avg. # Word in Output Sent | 15.08 | 15.96 | 16.15 |
Task:OutGen | |||
# Examples | 1,456 | 242 | 729 |
Vocabulary Size | 19k | 6k | 12k |
Avg. # Word in Input Title | 4.64 | 4.89 | 4.64 |
Avg. # Word in Input Outline | 19.20 | 19.05 | 19.47 |
Avg. # Phrase in Input Outline | 8.00 | 8.00 | 8.00 |
Avg. # Char in Output Text | 169.94 | 169.80 | 170.49 |
Avg. # Word in Output Text | 108.91 | 108.68 | 109.04 |
Avg. # Sent in Output Text | 7.20 | 7.11 | 7.15 |
Datasets . | Train . | Val . | Test . |
---|---|---|---|
Task:ClozeT | |||
# Examples | 644 | 294 | 294 |
Vocabulary Size | 9k | 7k | 7k |
Avg. # Char in Input Text | 139.07 | 138.95 | 141.15 |
Avg. # Word in Input Text | 89.28 | 89.03 | 90.20 |
Avg. # Sent in Input Text | 5.95 | 5.94 | 5.95 |
Avg. # Word in Candidate | 15.60 | 16.38 | 15.75 |
Task:SenPos | |||
# Examples | 20,000 | 800 | 863 |
Vocabulary Size | 147k | 10k | 22k |
Avg. # Char in Input Text | 289.59 | 258.48 | 258.52 |
Avg. # Word in Input Text | 254.11 | 224.20 | 223.25 |
Avg. # Sent in Input Text | 9.61 | 8.43 | 8.44 |
Avg. # Word in Removed Sent | 30.48 | 29.28 | 30.26 |
Avg. # Candidate Positions | 8.05 | 6.91 | 6.91 |
Task:PlotCom | |||
# Examples | 13,099 | 465 | 464 |
Vocabulary Size | 22k | 8k | 8k |
Avg. # Char in Input Text | 164.35 | 137.67 | 133.26 |
Avg. # Word in Input Text | 105.48 | 87.56 | 84.98 |
Avg. # Sent in Input Text | 7.17 | 5.59 | 5.48 |
Avg. # Word in Output Sent | 15.08 | 15.96 | 16.15 |
Task:OutGen | |||
# Examples | 1,456 | 242 | 729 |
Vocabulary Size | 19k | 6k | 12k |
Avg. # Word in Input Title | 4.64 | 4.89 | 4.64 |
Avg. # Word in Input Outline | 19.20 | 19.05 | 19.47 |
Avg. # Phrase in Input Outline | 8.00 | 8.00 | 8.00 |
Avg. # Char in Output Text | 169.94 | 169.80 | 170.49 |
Avg. # Word in Output Text | 108.91 | 108.68 | 109.04 |
Avg. # Sent in Output Text | 7.20 | 7.11 | 7.15 |
We design LOT based on the following principles: (1) Task Diversity: The tasks vary in task formats, types and lengths of inputs and outputs, and focused abilities, making LOT a comprehensive framework for evaluating the generalization of models. (2) Task Difficulty: The tasks take hundreds of words as inputs or outputs, and do not involve domain-specific knowledge about science, films, and so forth. Therefore, they are beyond the scope of current state-of-the-art models, but are solvable by most Chinese native speakers. (3) Task Formulation: The tasks have been well formulated in prior studies and agreed to be challenging but meaningful. We introduce new Chinese datasets for these tasks, which are constructed to focus more specifically on testing a certain ability than original datasets. (4) Automatic Evaluation: These tasks have reliable automatic metrics to evaluate the focused abilities. We exclude open-ended generation tasks such as story generation from titles, which is difficult to automatically evaluate (Guan et al., 2021) because the tasks suffer from the notorious one-to-many issue: There are many plausible outputs for the same input (Zhao et al., 2017).
We constructed datasets for LOT through automatic and manual annotation. First, we crawled human-written stories from public web pages as the data source. These stories are under licenses that allow use and redistribution for research purposes. Then, we hired a commercial team to create the LOT examples. The team is led by a professional screenwriter and has taken on hundreds of NLP annotation projects. All annotators are native Chinese speakers and well-trained for the annotation tasks. We show the full list of the source web pages and the annotation details in the appendix.
3.1 Cloze Test
Mostafazadeh et al. (2016) introduced the Story Cloze Test (SCT) task for evaluating story comprehension, which requires selecting the right ending from two candidates for a four-sentence leading context. However, SCT suffers from the following issues: (1) Its dataset is artificial and contains innate biases between right and wrong endings in some features such as lengths (Schwartz et al., 2017; Sharma et al., 2018). Such biases may leak information about the target labels. (2) SCT focuses on reasoning only endings but neglects other types of reasoning, such as abductive reasoning (Bhagavatula et al., 2019), which requires reasoning what happens between observed beginnings and endings. (3) SCT limits the scope of commonsense reasoning to realistic events. The limitation may be neither necessary nor sufficient. For example, “Cupid can fly” canbe reasoned based on common sense although it is not realistic, while some story settings may be realistic but fail to be reasoned only based on the context and common sense, as shown in Table 4. Therefore, when constructing our ClozeT dataset, we adopt the following approaches to alleviate the above issues: (1) All examples are derived from existing human-written stories. (2) We allow annotators to create examples where the removed sentence is initially in the middle of the story. (3) We change the scope of commonsense reasoning to all events that embody characters’ reaction and intention, or the nature of physical objects and concepts. Table 6 shows two ClozeT examples. Furthermore, we also conducted experiments to investigate the potential biases of our dataset in Section 5.5.
Story Filtering
To ensure the quality of LOT examples, we asked annotators to judge whether each crawled story meets the following definition: “anything which is told in the form of a coherent event sequence involving several specific and related characters” (Mostafazadeh et al., 2016). We provided detailed cases for annotators to instruct them about this definition. Then, annotators needed to refine those stories which do not meet the definition by rewriting the plots. They should also clean up the stories by the following heuristics: (1) refusing examples that may violate ethical principles (e.g., discrimination); (2) deleting noisy words (e.g., links); (3) changing slang and informal words into standard modern Chinese; (4) rewriting all dialogues to objective events. Finally, we collected 2,427 high-quality Chinese stories, which will be used to construct the datasets for the ClozeT, PlotCom, and OutGen tasks.
Dataset Construction
We presented the stories to another group of annotators to construct the ClozeT dataset. For each story, they should select a sentence as the right candidate that can be reasoned based on the context and common sense. Table 4 shows an example presented to the annotators to illustrate how to judge whether a sentence satisfies this requirement. Then, the annotators rewrite the sentence into another one as the wrong candidate that maintains a good topical relatedness with the context but violates common sense. The wrong candidates should either embody unreasonable reactions or intentions, or violate the nature of physical objects or concepts. And we require annotators not to select the first sentence, which usually aims to introduce story settings instead of narrating an event. We browse through the annotation results and give the annotators detailed feedback before approving their submissions. Finally, we collected 1,232 examples in total and split them for training, validation and testing.
3.2 Sentence Position Prediction
We use the sentence position prediction task (Chen et al., 2019) to evaluate the ability to capture inter-sentence relations (e.g., causality). We formulate the task as follows: Given a text with a sentence removed, models should choose the correct position of the sentence in the text from multiple candidates. Chen et al. (2019) constructed an English dataset for this task by randomly removing sentences from existing texts. However, such examples may be invalid since a sentence may have multiple plausible positions in a text, as illustrated in Table 5. Therefore, we construct the dataset for our task based on the following pipeline: (1) extracting paragraphs with less than 500 words from crawled stories; (2) randomly selecting a sentence to remove for each paragraph, and regarding all positions between two adjacent sentences as candidates3 ; and (3) asking annotators to refine part of the auto-constructed examples as the validation and test sets, and the remaining as the training set. Table 7 shows two SenPos examples.
Text: I couldn’t control my anger very well.[1]My parents would yell at me, and i ran to my room.[2]I buried my head in a pillow and screamed.[3]I threw my pillow and hit it hard. |
Removed Sentence: I tried to express my anger. |
Text: I couldn’t control my anger very well.[1]My parents would yell at me, and i ran to my room.[2]I buried my head in a pillow and screamed.[3]I threw my pillow and hit it hard. |
Removed Sentence: I tried to express my anger. |
Dataset Construction
We asked annotators to refine each example so that the removed sentence has only one reasonable position in the text. We did not allow annotators to select the first or last sentence of the original text as the removed sentence because they usually contain obvious wording features (e.g., “once upon a time,” “they lived happily together”), which may make this task trivial. Unlike ClozeT, we allowed the texts for SenPos to be incomplete or include dialogues that also embody rich inter-sentence relations. Finally, we collected 1,663 examples for validation and testing through human annotation. And we constructed 20,000 examples automatically for training.
3.3 Plot Completion
We use the Plot Completion task (Wang and Wan, 2019) to test the ability to make inferences based on common sense. We formulate this task as follows: Given a story with a sentence removed, models should generate a sentence to complete the story and make it reasonable and coherent.
Dataset Construction
Prior studies (Wang and Wan, 2019; Paul and Frank, 2021) automatically constructed datasets for this task based on existing datasets by randomly removing one sentence from a story. However, as shown in Table 4, not all sentences in a story can be reasoned only based on the context and common sense. Therefore, we only used the above automatic method to construct the training data. And we adapted the ClozeT data to this task for validation and testing, since annotators have marked out the qualified sentences. Specifically, we randomly sampled some ClozeT examples and took the incomplete story of each example as input, and the right candidate as the target sentence to be generated.
3.4 Outline-conditioned Generation
Prior work tended to test the ability of long text generation through story generation conditioned on inputs with limited information such as titles (Yao et al., 2019). However, these tasks are extremely open-ended so that it is difficult to reliably measure the generation quality using automatic metrics (Guan and Huang, 2020). To alleviate the issue, we introduce the Outline-conditioned Generation task (Rashkin et al., 2020), which requires generating a coherent long-form story conditioned on an outline of characters and events. We formulate the outline as a set of out-of-order phrases, which not only narrows down the set of plausible stories but also serves for testing the controllability and planning ability of models to arrange the given events reasonably at the discourse level.
Dataset Construction
We built the dataset for this task automatically based on filtered stories. We followed Rashkin et al. (2020) to extract the outline of a story using the RAKE algorithm (Rose et al., 2010). We extract at most eight phrases for each story, and each phrase contains no more than eight words. For example, the outline for the story in Table 1 is {“told his son with irony,” “purchasing flour from a mill,” “crossing the river,” “drop the sack into the river,” “indeed pushed the sack,” “familiar to his son’s temper,” “shouted,” “one bag”}. The outline can serve as discourse-level guidance for generation models, which should rearrange the events reasonably and generate a story with a good global discourse structure, rather than focus on modeling only the local coherence.
3.5 Overall Score
4 Long Text Pretraining Model
To provide more flexibility on both understanding and generation tasks, we build LongLM following the original encoder-decoder design of Transformer (Vaswani et al., 2017) with three different sizes, as shown in Table 8. We follow Cui et al. (2020) to use a sentencepiece vocabulary of 32,000 wordpieces (Kudo and Richardson, 2018). And we set the maximum sequence length to 512 for both the encoder and decoder.
Pretraining Data
We collect 120G novels as the pretraining data for LongLM, which cover various topics such as romance, military, and so on. Since a novel is usually much longer than the maximum input and output length of LongLM, we split a novel into multiple segments for pretraining.
Pretraining Tasks
Encoder-decoder models are trained typically by maximizing the likelihood of the target output given an input. To improve capacities of both the encoder and decoder, we propose to train LongLM with two pretraining tasks including text infilling (Raffel et al., 2020) and conditional continuation (Radford et al., 2019). For the first task, the input is a text where a number of spans are sampled and replaced by special tokens with unique IDs, while the output is the spans delimited by the special tokens used in the input. The lengths of masked spans are drawn from a Poisson distribution with λ=3 and all masked tokens compress 15% of the original texts. As for the second task, the input and output are, respectively, the front and back half of a text, which is split into two parts randomly. We show an example of the pretraining tasks in Figure 1.
Pretraing Details
We set the learning rate to 1e-4 with the Adam optimizer and the batch size to 1,000. We pretrained LongLM for 2.5M steps. It took about two months to train the largest model using eight NVIDIA V100 GPUs.
Model Performance
To assess the performance of LongLM on the pretraining tasks, we randomly separated out 1,000 texts from the initial pretraining data for testing, which were never seen in the pretraining phase. We used perplexity and BLEU-n (n = 3,4) to evaluate both pretraining tasks. And we generated outputs using the greedy decoding algorithm for the text infilling task, and top-k sampling (Fan et al., 2018) with k = 40 and a softmax temperature of 0.7 (Goodfellow et al., 2014) for the conditional continuation task. As shown in Table 9, the performance improves substantially as the number of parameters increases.
Models . | TextInfill . | CondCont . | ||
---|---|---|---|---|
PPL . | BLEU-3/4 . | PPL . | BLEU-3/4 . | |
LongLMsmall | 11.61 | 73.80/68.96 | 22.91 | 5.30/2.43 |
LongLMbase | 8.24 | 75.65/71.05 | 17.03 | 5.73/2.64 |
LongLMlarge | 6.50 | 77.08/72.65 | 14.08 | 8.91/5.97 |
Models . | TextInfill . | CondCont . | ||
---|---|---|---|---|
PPL . | BLEU-3/4 . | PPL . | BLEU-3/4 . | |
LongLMsmall | 11.61 | 73.80/68.96 | 22.91 | 5.30/2.43 |
LongLMbase | 8.24 | 75.65/71.05 | 17.03 | 5.73/2.64 |
LongLMlarge | 6.50 | 77.08/72.65 | 14.08 | 8.91/5.97 |
5 Experiments
In this section, we tested LongLM and existing models on LOT with automatic and manual evaluation. Furthermore, we conducted extensive experiments to investigate the potential biases of the ClozeT and SenPos datasets (Section 5.5), and measure the overlap between training and testing data (Section 5.6).
5.1 Evaluated Models
We evaluated the following models, which are implemented based on the register models of HuggingFace Transformers:4(1) Vanilla Transformer: It has the same architecture as BERTbase except that the number of layers is set to 3 (Vaswani et al., 2017). (2) BERT: It is implemented based on the bert-base-Chinese register model (Devlin et al., 2019). (3) RoBERTa: It is implemented based on the hfl/chinese-roberta-wwm-ext register model (Cui et al., 2020). (4) GPT2: It is implemented based on the uer/gpt2-chinese-cluecorpussmall register model (Zhao et al., 2019). (5) mT5: It is implemented based on the google/mt5-base register model (Xue et al., 2021). We set all the baseline models to the base version due to limited computational resources.
To show the generic benefits of the pretraining data of LongLM for long text modeling, we pretrained a left-to-right language model from scratch on the data with the standard language modeling objective. This model has the same architecture as GPT2base and is denoted as GPT2. Moreover, we evaluated two task-specific pretraining models including PlotMachines (PM) (Rashkin et al., 2020) and Plan&Write (PW) (Yao et al., 2019), and two typical non-pretrained models including ConvS2S (Gehring et al., 2017) and Fusion (Fan et al., 2018) on the generation tasks in LOT. We used GPT2base as the backbone model of PM and PW. For PM, we regard input sentences (for PlotCom) or input phrases (for OutGen) as the plot elements used in the memory network, and update the memory representations at each step of decoding. As for PW, we take a keyword extracted from the target sentence using the RAKE algorithm (for PlotCom) or the sorted input phrases in order (for OutGen) as the intermediate representations for planning. We implemented these models based on the codes provided by the original papers.
5.2 Experiment Settings
Understanding Tasks
For both tasks, we encode the input of each example and then predict a distribution over all candidates by normalizing the dot-product values between the representations of each candidate and the context. We use the candidate with the maximum probability as the prediction result. For ClozeT, we represent a candidate using the hidden state at the end of it, and we regard the hidden state at the position of the removed sentence appearing in the original text as the context representation. And for SenPos, we take the hidden state at each candidate position as the candidate representation and the hidden state at the end of the removed sentence as the context representation. When evaluating mT5 and LongLM, we feed the same input into the encoder and decoder (Lewis et al., 2020) and use the hidden states of the decoder for prediction in the above way.
Generation Tasks
For PlotCom, we take the incomplete story of an example as input to generate the missing sentence. And for OutGen, we concatenate all phrases in an outline with special tokens as input to generate a story.
Hyper-Parameters
For all models, we set the batch size to 12, the maximum sequence length to 512, and the learning rate to 3e-5. We decode outputs use top-k sampling with k = 40 and a softmax temperature of 0.7 for the generation tasks.
5.3 Automatic Evaluation
Metrics
We use accuracy to evaluate the understanding tasks. As for generation tasks, we use BLEU-n (B-n) and Distinct-n (D-n) to evaluate the n-gram overlap with ground-truth texts (Papineni et al., 2002) and n-gram generation diversity (Li et al., 2016), respectively. We set n = 1,2 for both generation tasks. Additionally, we also use the following two metrics to evaluate OutGen: (1) Coverage (Cover): It is used to evaluate the generation controllability, which is computed as the average Rouge-L recall score (Lin, 2004) between the generated text and each input phrase. A higher coverage score indicates the generated text covers more input phrases. (2) Order: It is used to measure the gap between the positional orders of input phrases appearing in the generated texts and ground-truth texts. Specifically, we compute the order score as the average ratio of the number of inversions in the generated story to the number of all position pairs of any two phrases. An inversion refers to a position pair that are out of the ground-truth order. And we use the position of the longest common subsequence between a story and a phrase as the position of the phrase in the story. Because an input phrase does not always appear in the generated story, we regard all position pairs of such a phrase and others as inversions.
Results
Tables 10 and 11 show the results on the understanding and generation tasks, respectively. To obtain the human performance on the understanding tasks, we randomly sampled 100 examples from the validation set or test set and hired three crowd-sourced annotators (native Chinese speakers) to do these tasks. We made final decisions among them through majority voting. All results show an almost perfect inter-annotator agreement with Fleiss’s κ > 0.85 (Fleiss and Joseph, 1971). For generation tasks, we regard the scores of ground-truth texts as human performance.
Models . | # P . | ClozeT . | SenPos . | Overall . |
---|---|---|---|---|
Validation Set | ||||
Transformer | 38M | 55.78 | 17.38 | 31.46 |
BERTbase | 102M | 70.75 | 40.13 | 51.36 |
RoBERTabase | 102M | 72.11 | 51.63 | 59.14 |
GPT2base | 102M | 70.07 | 37.78 | 49.62 |
GPT2base† | 102M | 74.49 | 39.25 | 52.17 |
mT5base | 582M | 72.45 | 63.25 | 66.62 |
LongLMsmall | 60M | 73.81 | 48.75 | 57.94 |
LongLMbase | 223M | 75.17 | 64.38 | 68.34 |
LongLMlarge | 1B | 79.93 | 70.00 | 73.64 |
Humans | N/A | 99.00 | 97.00 | 97.73 |
wi | N/A | 0.37 | 0.63 | 1.00 |
Test Set | ||||
Transformer | 38M | 54.42 | 16.34 | 31.23 |
BERTbase | 102M | 69.39 | 43.68 | 53.74 |
RoBERTabase | 102M | 67.69 | 51.35 | 57.74 |
GPT2base | 102M | 73.13 | 37.25 | 51.28 |
GPT2base† | 102M | 76.87 | 39.28 | 53.98 |
mT5base | 582M | 75.17 | 61.41 | 66.79 |
LongLMsmall | 60M | 77.21 | 53.07 | 62.51 |
LongLMbase | 223M | 77.55 | 62.34 | 68.29 |
LongLMlarge | 1B | 80.61 | 69.41 | 73.39 |
Humans | N/A | 100.00 | 98.00 | 98.78 |
wi | N/A | 0.39 | 0.61 | 1.00 |
Models . | # P . | ClozeT . | SenPos . | Overall . |
---|---|---|---|---|
Validation Set | ||||
Transformer | 38M | 55.78 | 17.38 | 31.46 |
BERTbase | 102M | 70.75 | 40.13 | 51.36 |
RoBERTabase | 102M | 72.11 | 51.63 | 59.14 |
GPT2base | 102M | 70.07 | 37.78 | 49.62 |
GPT2base† | 102M | 74.49 | 39.25 | 52.17 |
mT5base | 582M | 72.45 | 63.25 | 66.62 |
LongLMsmall | 60M | 73.81 | 48.75 | 57.94 |
LongLMbase | 223M | 75.17 | 64.38 | 68.34 |
LongLMlarge | 1B | 79.93 | 70.00 | 73.64 |
Humans | N/A | 99.00 | 97.00 | 97.73 |
wi | N/A | 0.37 | 0.63 | 1.00 |
Test Set | ||||
Transformer | 38M | 54.42 | 16.34 | 31.23 |
BERTbase | 102M | 69.39 | 43.68 | 53.74 |
RoBERTabase | 102M | 67.69 | 51.35 | 57.74 |
GPT2base | 102M | 73.13 | 37.25 | 51.28 |
GPT2base† | 102M | 76.87 | 39.28 | 53.98 |
mT5base | 582M | 75.17 | 61.41 | 66.79 |
LongLMsmall | 60M | 77.21 | 53.07 | 62.51 |
LongLMbase | 223M | 77.55 | 62.34 | 68.29 |
LongLMlarge | 1B | 80.61 | 69.41 | 73.39 |
Humans | N/A | 100.00 | 98.00 | 98.78 |
wi | N/A | 0.39 | 0.61 | 1.00 |
Models . | # P . | PlotCom . | OutGen . | Overall . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
B-1 . | B-2 . | D-1 . | D-2 . | B-1 . | B-2 . | D-1 . | D-2 . | Cover . | Order . | |||
Validation Set | ||||||||||||
ConvS2S | 58M | 18.92 | 4.18 | 6.31 | 32.18 | 29.23 | 10.38 | 3.45 | 21.79 | 14.81 | 25.34 | 11.85 |
Fusion | 109M | 20.56 | 4.69 | 8.63 | 35.73 | 29.22 | 10.34 | 3.39 | 22.67 | 17.41 | 26.55 | 12.61 |
GPT2base | 102M | 22.67 | 6.22 | 24.75 | 70.57 | 30.43 | 14.87 | 10.95 | 44.38 | 60.90 | 55.52 | 20.24 |
GPT2base† | 102M | 22.49 | 5.43 | 26.88 | 74.87 | 35.29 | 18.31 | 13.89 | 51.36 | 64.01 | 57.64 | 21.73 |
PM | 102M | 22.11 | 5.49 | 23.89 | 69.74 | 31.81 | 14.94 | 12.99 | 50.56 | 62.98 | 56.75 | 20.45 |
PW | 102M | 22.45 | 5.57 | 25.64 | 71.54 | 35.84 | 18.47 | 11.86 | 47.62 | 64.93 | 57.30 | 21.48 |
mT5base | 582M | 22.56 | 6.46 | 24.44 | 71.31 | 36.71 | 22.25 | 14.52 | 50.01 | 77.98 | 63.15 | 23.53 |
LongLMsmall | 60M | 21.78 | 7.11 | 20.17 | 59.63 | 35.03 | 19.17 | 10.80 | 39.70 | 62.53 | 56.53 | 21.02 |
LongLMbase | 223M | 22.91 | 8.28 | 22.16 | 63.54 | 40.33 | 24.29 | 14.66 | 51.82 | 79.60 | 62.78 | 24.75 |
LongLMlarge | 1B | 23.76 | 8.70 | 25.93 | 72.18 | 42.79 | 24.91 | 16.13 | 57.71 | 80.46 | 64.36 | 26.12 |
Truth | N/A | 100.00 | 100.00 | 35.32 | 84.33 | 100.00 | 100.00 | 21.66 | 71.43 | 100.00 | 100.00 | 92.23 |
wi | N/A | 0.11 | 0.40 | 0.04 | 0.03 | 0.08 | 0.17 | 0.05 | 0.04 | 0.04 | 0.04 | 1.00 |
Test Set | ||||||||||||
ConvS2S | 58M | 19.60 | 4.20 | 6.00 | 32.42 | 29.00 | 10.14 | 1.60 | 13.95 | 15.45 | 25.77 | 11.27 |
Fusion | 109M | 20.52 | 4.90 | 8.43 | 35.09 | 28.77 | 10.22 | 1.47 | 14.12 | 17.10 | 26.36 | 11.91 |
GPT2base | 102M | 22.94 | 5.76 | 24.69 | 70.30 | 30.17 | 14.91 | 7.62 | 36.87 | 60.87 | 55.90 | 19.21 |
GPT2base† | 102M | 22.45 | 5.38 | 26.08 | 73.26 | 35.79 | 18.68 | 9.89 | 43.52 | 64.43 | 56.96 | 20.76 |
PM | 102M | 22.87 | 5.75 | 24.08 | 71.19 | 31.85 | 15.24 | 8.62 | 41.32 | 63.15 | 57.21 | 19.77 |
PW | 102M | 22.76 | 6.07 | 25.55 | 70.72 | 35.12 | 17.96 | 8.68 | 40.17 | 63.70 | 55.17 | 20.52 |
mT5base | 582M | 22.52 | 6.48 | 24.33 | 70.53 | 36.33 | 22.07 | 10.90 | 43.65 | 78.66 | 63.79 | 22.59 |
LongLMsmall | 60M | 22.05 | 7.45 | 19.93 | 59.79 | 34.48 | 19.17 | 7.93 | 34.25 | 63.75 | 57.64 | 20.48 |
LongLMbase | 223M | 23.28 | 8.58 | 21.37 | 62.43 | 40.25 | 24.15 | 10.75 | 44.40 | 79.88 | 63.67 | 23.93 |
LongLMlarge | 1B | 24.20 | 9.06 | 25.75 | 71.08 | 42.10 | 24.77 | 12.04 | 50.29 | 81.48 | 64.82 | 25.29 |
Truth | N/A | 100.00 | 100.00 | 35.01 | 84.56 | 100.00 | 100.00 | 15.71 | 63.46 | 100.00 | 100.00 | 91.64 |
wi | N/A | 0.10 | 0.42 | 0.03 | 0.03 | 0.08 | 0.16 | 0.05 | 0.04 | 0.04 | 0.04 | 1.00 |
Models . | # P . | PlotCom . | OutGen . | Overall . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
B-1 . | B-2 . | D-1 . | D-2 . | B-1 . | B-2 . | D-1 . | D-2 . | Cover . | Order . | |||
Validation Set | ||||||||||||
ConvS2S | 58M | 18.92 | 4.18 | 6.31 | 32.18 | 29.23 | 10.38 | 3.45 | 21.79 | 14.81 | 25.34 | 11.85 |
Fusion | 109M | 20.56 | 4.69 | 8.63 | 35.73 | 29.22 | 10.34 | 3.39 | 22.67 | 17.41 | 26.55 | 12.61 |
GPT2base | 102M | 22.67 | 6.22 | 24.75 | 70.57 | 30.43 | 14.87 | 10.95 | 44.38 | 60.90 | 55.52 | 20.24 |
GPT2base† | 102M | 22.49 | 5.43 | 26.88 | 74.87 | 35.29 | 18.31 | 13.89 | 51.36 | 64.01 | 57.64 | 21.73 |
PM | 102M | 22.11 | 5.49 | 23.89 | 69.74 | 31.81 | 14.94 | 12.99 | 50.56 | 62.98 | 56.75 | 20.45 |
PW | 102M | 22.45 | 5.57 | 25.64 | 71.54 | 35.84 | 18.47 | 11.86 | 47.62 | 64.93 | 57.30 | 21.48 |
mT5base | 582M | 22.56 | 6.46 | 24.44 | 71.31 | 36.71 | 22.25 | 14.52 | 50.01 | 77.98 | 63.15 | 23.53 |
LongLMsmall | 60M | 21.78 | 7.11 | 20.17 | 59.63 | 35.03 | 19.17 | 10.80 | 39.70 | 62.53 | 56.53 | 21.02 |
LongLMbase | 223M | 22.91 | 8.28 | 22.16 | 63.54 | 40.33 | 24.29 | 14.66 | 51.82 | 79.60 | 62.78 | 24.75 |
LongLMlarge | 1B | 23.76 | 8.70 | 25.93 | 72.18 | 42.79 | 24.91 | 16.13 | 57.71 | 80.46 | 64.36 | 26.12 |
Truth | N/A | 100.00 | 100.00 | 35.32 | 84.33 | 100.00 | 100.00 | 21.66 | 71.43 | 100.00 | 100.00 | 92.23 |
wi | N/A | 0.11 | 0.40 | 0.04 | 0.03 | 0.08 | 0.17 | 0.05 | 0.04 | 0.04 | 0.04 | 1.00 |
Test Set | ||||||||||||
ConvS2S | 58M | 19.60 | 4.20 | 6.00 | 32.42 | 29.00 | 10.14 | 1.60 | 13.95 | 15.45 | 25.77 | 11.27 |
Fusion | 109M | 20.52 | 4.90 | 8.43 | 35.09 | 28.77 | 10.22 | 1.47 | 14.12 | 17.10 | 26.36 | 11.91 |
GPT2base | 102M | 22.94 | 5.76 | 24.69 | 70.30 | 30.17 | 14.91 | 7.62 | 36.87 | 60.87 | 55.90 | 19.21 |
GPT2base† | 102M | 22.45 | 5.38 | 26.08 | 73.26 | 35.79 | 18.68 | 9.89 | 43.52 | 64.43 | 56.96 | 20.76 |
PM | 102M | 22.87 | 5.75 | 24.08 | 71.19 | 31.85 | 15.24 | 8.62 | 41.32 | 63.15 | 57.21 | 19.77 |
PW | 102M | 22.76 | 6.07 | 25.55 | 70.72 | 35.12 | 17.96 | 8.68 | 40.17 | 63.70 | 55.17 | 20.52 |
mT5base | 582M | 22.52 | 6.48 | 24.33 | 70.53 | 36.33 | 22.07 | 10.90 | 43.65 | 78.66 | 63.79 | 22.59 |
LongLMsmall | 60M | 22.05 | 7.45 | 19.93 | 59.79 | 34.48 | 19.17 | 7.93 | 34.25 | 63.75 | 57.64 | 20.48 |
LongLMbase | 223M | 23.28 | 8.58 | 21.37 | 62.43 | 40.25 | 24.15 | 10.75 | 44.40 | 79.88 | 63.67 | 23.93 |
LongLMlarge | 1B | 24.20 | 9.06 | 25.75 | 71.08 | 42.10 | 24.77 | 12.04 | 50.29 | 81.48 | 64.82 | 25.29 |
Truth | N/A | 100.00 | 100.00 | 35.01 | 84.56 | 100.00 | 100.00 | 15.71 | 63.46 | 100.00 | 100.00 | 91.64 |
wi | N/A | 0.10 | 0.42 | 0.03 | 0.03 | 0.08 | 0.16 | 0.05 | 0.04 | 0.04 | 0.04 | 1.00 |
We summarize the evaluation results as follows: (1) Pretrained models have significantly better performance than non-pretrained models. (2)LongLMlarge outperforms other baselines substantially on both the understanding and generation tasks. LongLMbase/LongLMsmall achieves better overall scores with half fewer parameters than mT5/GPT2. (3) By comparing GPT2† and GPT2, we can derive that our pretraining data can effectively improve the ability to model long texts. (4) LongLMsmall has a better performance than GPT2† on the understanding tasks, and is comparable with GPT2† on the generation tasks, suggesting the benefits of the encoder-decoder framework and the text infilling task. (5) It is still extremely challenging for all models to capture the commonsense and inter-sentence discourse relations between events in long texts for tackling the ClozeT and SenPos tasks. Furthermore, we investigate how the size of training data influences the accuracy of BERT for SenPos. The result in Figure 2 indicates the necessity to develop better representations of discourse relations instead of relying only on increasing the data size. (6) The results on the generation tasks show that LongLM does well in generating more word overlaps with references than similar-sized baselines for both tasks, and covers more input phrases and arranges them in correct orders for OutGen. But LongLM underperforms GPT2-based models in terms of diversity on PlotCom. (7) Dynamically tracking plot states (i.e., PM) does not bring significant improvement on the generation tasks compared with GPT2, suggesting that it may require modeling the discourse structure explicitly to tackle the generation tasks. And the superiority of PW to GPT2 on OutGen further indicates the benefit of modeling discourse-level features. In summary, we believe LOT will serve as an effective evaluation for capturing the commonsense and discourse relations of long texts beyond the surface events, and generating coherent and controllable long-form texts.
5.4 Manual Evaluation
Because automatic metrics may be unreliable for evaluating NLG (Guan and Huang, 2020), we conducted a point-wise manual evaluation to measure the disparity between machines and humans for the generation tasks in LOT. For each task, we randomly sampled 100 examples from the test set and obtained 100 ground-truth texts and 300 generated texts from three typical models including GPT2base, mT5base and LongLMlarge. For each text along with the input, we hired three crowd-sourced workers to judge its quality with a binary score (1 for good, and 0 otherwise) in terms of three aspects: (1) grammaticality (intra-sentence grammar quality of generated texts), (2) coherence (causal and temporal dependencies within generated texts), and (3) relatedness to inputs (reasonable logical connections to the input context for PlotCom; and reasonable utilization of input phrases for OutGen). These aspects are independently evaluated. We made final decisions among three annotators through majority voting. We show the annotation instructions in the appendix.
Table 12 shows the evaluation results. For both tasks, LongLM outperforms GPT2 and mT5 significantly in all aspects (p < 0.05, sign test). However, it is difficult for all models to generate a logical completion for PlotCom (relatedness score <0.1), showing their poor ability to capture commonsense and inter-sentence relations. And the big gap between LongLM and humans also proves both tasks challenging to existing generation models. We also observe the positive correlation between the manual evaluation and automatic evaluation (Table 11), suggesting that it may be acceptable to use automatic evaluation to compare and improve models on the generation tasks in LOT.
Models . | Gram (κ) . | Cohe (κ) . | Relat (κ) . |
---|---|---|---|
Task: PlotCom | |||
GPT2base | 0.84 (0.49) | 0.41(0.71) | 0.01 (0.50) |
mT5base | 0.85 (0.24) | 0.53 (0.65) | 0.01 (0.50) |
LongLMlarge | 0.95 (0.48) | 0.82 (0.64) | 0.09 (0.69) |
Truth | 1.00 (1.00) | 1.00 (1.00) | 0.99 (0.49) |
Task: OutGen | |||
GPT2base | 0.54 (0.52) | 0.18 (0.52) | 0.39 (0.43) |
mT5base | 0.53 (0.26) | 0.08 (0.46) | 0.49 (0.38) |
LongLMlarge | 0.81 (0.23) | 0.37 (0.43) | 0.62 (0.45) |
Truth | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) |
Models . | Gram (κ) . | Cohe (κ) . | Relat (κ) . |
---|---|---|---|
Task: PlotCom | |||
GPT2base | 0.84 (0.49) | 0.41(0.71) | 0.01 (0.50) |
mT5base | 0.85 (0.24) | 0.53 (0.65) | 0.01 (0.50) |
LongLMlarge | 0.95 (0.48) | 0.82 (0.64) | 0.09 (0.69) |
Truth | 1.00 (1.00) | 1.00 (1.00) | 0.99 (0.49) |
Task: OutGen | |||
GPT2base | 0.54 (0.52) | 0.18 (0.52) | 0.39 (0.43) |
mT5base | 0.53 (0.26) | 0.08 (0.46) | 0.49 (0.38) |
LongLMlarge | 0.81 (0.23) | 0.37 (0.43) | 0.62 (0.45) |
Truth | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) |
5.5 Bias Investigation
It is essential to investigate potential biases of a dataset, which may leak information about target labels and enable models to easily use shortcuts to handle complex inputs without actually mastering the focused abilities (Ribeiro et al., 2020). Therefore, we experimented with the following baselines to inspect the ClozeT and SenPos datasets: (1) Random: It chooses a candidate randomly. (2) Majority: It chooses the candidate with an index that is most frequently selected in the training set. (3) Length: For ClozeT, it chooses the candidate that contains more words; And for SenPos, it chooses the position of which the adjacent sentences have the closest number of words to the removed sentence. (4) BLEU-n: For ClozeT, it chooses the candidate with a higher BLEU-n score (Papineni et al., 2002) with the context; And for SenPos, it chooses the position of which the adjacent sentences have the largest average BLEU-n score with the removed sentence (n = 1,2). (5) Sentiment: For ClozeT, it chooses the candidate with a higher sentiment score computed by an off-the-shelf Chinese sentiment analyzer;5 for SenPos, it chooses the position where the average sentiment score of its adjacent two sentences is the closest to the score of the removed sentence. (6) Discourse Markers: For ClozeT, it chooses the candidate where its adjacent sentences contain a discourse marker matching with it. For example, if “because” occurs in the last sentence before the position of the candidates, this baseline will choose the candidate that contains “so”.6 If there do not exist such paired markers in an example or there are multiple eligible candidates, this baseline will randomly choose one. The setting of this baseline for SenPos is similar to ClozeT. We manually define 24 marker pairs for this baseline. (7) BERT w/o Context: We fine-tuned BERT to directly choose without taking the context as input (Schwartz et al., 2017). (8) BERT w/o Long: It is used to study whether solving these tasks requires modeling long-range dependencies. For ClozeT, we fine-tuned BERT to choose with only the adjacent sentences of the removed sentence as input. And for SenPos, we encoded each position and its adjacent sentences respectively using BERT and then took the hidden states at these positions for prediction. These baselines cover different levels of features ranging from the token level (e.g., Length), the sentence level (e.g., Sentiment), to the discourse level (e.g., Discourse Markers, BERT w/o Context). We believe that these baselines will provide a comprehensive inspection for the potential biases of our datasets.
As shown in Table 13, both tasks can not be trivially solved by these baselines, suggesting that the datasets may be free of biases in terms of the above features. Therefore, we believe that the tasks can focus on testing the ability of models to capture long-range commonsense and discourse relations.
Baselines . | ClozeT . | SenPos . |
---|---|---|
Random | 50.00 | 16.03 |
Majority | 52.72 | 16.24 |
Length | 52.72 | 16.45 |
BLEU-1/2 | 46.94/48.98 | 14.14/14.95 |
Sentiment | 50.34 | 16.49 |
Discouse Markers | 45.92 | 9.15 |
BERT w/o Context | 57.82 | 18.08 |
BERT w/o Long | 62.24 | 19.00 |
BERT | 69.39 | 43.68 |
Baselines . | ClozeT . | SenPos . |
---|---|---|
Random | 50.00 | 16.03 |
Majority | 52.72 | 16.24 |
Length | 52.72 | 16.45 |
BLEU-1/2 | 46.94/48.98 | 14.14/14.95 |
Sentiment | 50.34 | 16.49 |
Discouse Markers | 45.92 | 9.15 |
BERT w/o Context | 57.82 | 18.08 |
BERT w/o Long | 62.24 | 19.00 |
BERT | 69.39 | 43.68 |
5.6 Memorization Investigation
Overlap between training and test data may result in an over-reporting of the generalization performance of machines. Therefore, it is necessary to investigate how many test data also show up in the training data. To this end, we followRadford et al. (2019) to measure the overlap between two datasets by calculating the percentage of 8-grams from one that are also in the other. We use the jieba tokenizer for tokenization.
Table 14 shows the overlapping analysis for test sets of the four tasks in LOT. We can see that all test sets have less than 1% overlap with their own training sets. Notably, there are 17 test examples of SenPos that contain more than 10% overlapped 8-grams with the training set. This is because a training example and a test example may come from the same story, and thus they share similar information (e.g., characters, locations). A test example contains at most 60.98% overlapped 8-grams, suggesting that the training set and test set do not include exactly the same example. As for the pretraining data of LongLM, the test sets of ClozeT and PlotCom still have less than 1% overlap. However, there are dozens of test examples in SenPos and OutGen that contain more than 10% overlapped 8-grams. Through manual inspection of the overlaps, we found that they mainly come from idioms, proverbs and classicfairy tales, which may be part of some novels in the pretraining data.
Tasks . | ClozeT . | SenPos . | PlotCom . | OutGen . |
---|---|---|---|---|
Overlap with the Training Sets | ||||
Percent | 0.00% | 0.62% | 0.02% | 0.00% |
# 8-grams | 0 | 1,040 | 6 | 2 |
# Exam | 0 | 45 | 3 | 2 |
# Exam >10% | 0 | 17 | 0 | 0 |
Max Percent | 0.00% | 60.98% | 2.53% | 1.00% |
Overlap with the Pretraining Data | ||||
Percent | 0.67% | 4.68% | 0.38% | 1.22% |
# 8-grams | 172 | 7,844 | 151 | 1,212 |
# Exam | 83 | 486 | 88 | 161 |
# Exam >10% | 4 | 71 | 1 | 26 |
Max Percent | 47.22% | 60.96% | 30.77% | 41.18% |
Tasks . | ClozeT . | SenPos . | PlotCom . | OutGen . |
---|---|---|---|---|
Overlap with the Training Sets | ||||
Percent | 0.00% | 0.62% | 0.02% | 0.00% |
# 8-grams | 0 | 1,040 | 6 | 2 |
# Exam | 0 | 45 | 3 | 2 |
# Exam >10% | 0 | 17 | 0 | 0 |
Max Percent | 0.00% | 60.98% | 2.53% | 1.00% |
Overlap with the Pretraining Data | ||||
Percent | 0.67% | 4.68% | 0.38% | 1.22% |
# 8-grams | 172 | 7,844 | 151 | 1,212 |
# Exam | 83 | 486 | 88 | 161 |
# Exam >10% | 4 | 71 | 1 | 26 |
Max Percent | 47.22% | 60.96% | 30.77% | 41.18% |
To investigate how the overlapping data influence the measurement of models’ performance, we re-evaluated LongLMlarge on the test sets of SenPos and OutGen with exclusion of the examples that have more than 10% overlapped 8-grams with the training sets or pretraining data. We also used mT5base as a baseline in the same setting of LongLM. The results for SenPos and OutGen are shown in Tables 15 and 16, respectively. The change of accuracy or BLEU-1 score is very marginal for both mT5 and LongLM when excluding the overlapping data, suggesting that the superior performance of LongLM is rarely attributable to the memorization of training data. Therefore, we believe that it is fair to compare LongLM and other models on these tasks.
SenPos . | Total . | w/o Overlap (Training Set) . | Δ . |
---|---|---|---|
# Exam | 863 | 846 | N/A |
mT5base | 61.41% | 61.82% | +0.41% |
LongLMlarge | 69.41% | 69.50% | +0.09% |
SenPos | Total | w/o Overlap (Pretraining Data) | Δ |
# Exam | 863 | 792 | N/A |
mT5base | 61.41% | 61.24% | –0.17% |
LongLMlarge | 69.41% | 69.32% | –0.09% |
SenPos . | Total . | w/o Overlap (Training Set) . | Δ . |
---|---|---|---|
# Exam | 863 | 846 | N/A |
mT5base | 61.41% | 61.82% | +0.41% |
LongLMlarge | 69.41% | 69.50% | +0.09% |
SenPos | Total | w/o Overlap (Pretraining Data) | Δ |
# Exam | 863 | 792 | N/A |
mT5base | 61.41% | 61.24% | –0.17% |
LongLMlarge | 69.41% | 69.32% | –0.09% |
6 Conclusions
We present LOT, a story-centric benchmark for Chinese long text understanding and generation. LOT includes two story understanding tasks and two story generation tasks, which comprehensively investigate the abilities of commonsense reasoning, controllable generation, and modeling inter-sentence relations and the global discourse structures. We provide standard datasets for the four tasks, which are constructed based on human-written stories processed by automatic and manual annotation. Furthermore, we release a new Chinese long text pretraining model LongLM, which outperforms strong baseline models substantially on both the understanding and generation tasks in LOT. The LOT benchmark and the pretraining model will encourage further research on Chinese long text modeling.
Acknowledgments
This work was supported by the National Science Foundation for Distinguished Young Scholars (no. 62125604) and the NSFC projects (Key project no. 61936010 and regular project no. 61876096). This work was also supported by the Guoqiang Institute of Tsinghua University, with grant nos. 2019GQG1 and 2020GQG0005. We would also like to thank our action editor, Dipanjan Das, and the anonymous reviewers for their invaluable suggestions and feedback.
Notes
The LOT benchmark, the pretraining resources, and the appendix are available at https://github.com/thu-coai/LOT-LongLM.
We set the minimum length of the removed sentence to 10 Chinese characters, and we merge a sentence in a story with its neighbors if it contains less than 10 characters.
Different from English, paired discourse markers like “because”-“so” should be used together in Chinese.
References
Author notes
Action Editor: Dipanjan Das