LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text Understanding and Generation

Standard multi-task benchmarks are essential for developing pretraining models that can generalize to various downstream tasks. Existing benchmarks for natural language processing (NLP) usually focus only on understanding or generating short texts. However, long text modeling requires many distinct abilities in contrast to short texts, such as the modeling of long-range discourse and commonsense relations, and the coherence and controllability of generation. The lack of standardized benchmarks makes it difficult to assess these abilities of a model and fairly compare different models, especially Chinese models. Therefore, we propose a story-centric benchmark named LOT for evaluating Chinese long text modeling, which aggregates two understanding tasks and two generation tasks. We construct new datasets for these tasks based on human-written Chinese stories with hundreds of words. Furthermore, we release an encoder-decoder-based Chinese long text pretraining model named LongLM with up to 1 billion parameters. We pretrain LongLM on 120G Chinese novels with two generative tasks including text infilling and conditional continuation. Extensive experiments show that LongLM outperforms similar-sized pretraining models substantially on both the understanding and generation tasks in LOT.


Introduction
Pretrained language models have achieved significant advances in various natural language understanding (NLU) and generation (NLG) tasks (Devlin et al., 2019;Radford et al., 2019).Standard benchmarks such as GLUE (Wang et al., 2019) further boost the improvement and fast iteration of pretrained models.Popular benchmarks usually aggregate multiple tasks to spur the progress of generalizable models.But these benchmarks focus mainly on understanding or generating short texts.For example, the GLUE tasks take at most two sentences as input.And most tasks in NLG benchmarks such as GLGE (Liu et al., 2020) and GEM (Gehrmann et al., 2021) require generating only several words (e.g., dialogue generation).Although there have been many models pretrained on long texts such as GPT3 (Brown et al., 2020) and CPM (Zhang et al., 2020)), the lack of benchmark datasets makes it difficult to fully assess and compare their abilities of long text modeling.
In this paper, we present LOT, a benchmark for evaluating Chinese LOng Text understanding and generation.As shown in Table 1, modeling long texts requires many distinct abilities compared to short texts, including (1) commonsense reasoning regarding characters' reaction and intention, and knowledge about physical objects (e.g., "river") and abstract concepts (e.g., "irony"); (2) modeling discourse-level features such as inter-sentence relations (e.g., causality) and global discourse structures (e.g., the order of events); and (3) the generation coherence and controllability, which require both maintaining a coherent plot and adhering to controllable attributes (e.g., topics).Accordingly, LOT contains two understanding tasks and two generation tasks regarding the above abilities.We construct new datasets for these tasks based on various kinds of stories such as fables and fairy tales collected from public web resources, considering that stories usually contain abundant commonsense and discourse relations.All these tasks require processing stories with hundreds of words.Note that LOT does not involve extra-long texts with thousands of words since the complicated linguistic phenomena in these texts make it hard to test individual abilities and guide the improvement of generation models.
Furthermore, we release LongLM, a Chinese Long text pretraining Language Model.LongLM is a Transformer-based model with an encoderdecoder architecture.LongLM has three different versions ranging from 60 million to 1 billion parameters.We pretrain LongLM on 120G Chinese novels with two generative tasks, including text infilling (Lewis et al., 2020) and conditional continuation (Radford et al., 2018).The pretraining data do not include other types of texts (e.g., news, Wiki-texts) since we mainly focus on commonsense and discourse relations within general long texts instead of factual and technical knowledge.To the best of our knowledge, LongLM is the first pretraining model of the same size scale that focuses on modeling long-form stories.Extensive experiments on LOT show that LongLM outperforms strong baselines substantially on both the understanding and generation tasks.However, we also observe that LongLM is still far behind human performance, which requires better semantic representations of events and deeper modeling of the commonsense and discourse relations between them.We summarize the main contributions of this paper as follows: I.We propose a new story-centric benchmark LOT for evaluating Chinese long text understanding and generation.LOT consists of four tasks for testing the fundamental abilities to model long texts.We also present new datasets for these tasks.II.We release a new Chinese pretraining model named LongLM.Experiment results demonstrate the strong performance of LongLM on LOT, but there still exists huge room for improvement1 2 Related Work NLP Benchmarks Recently, there have been a lot of multi-task benchmarks proposed to drive the progress of generalizable models.The benchmarks usually aggregate multiple model-agnostic tasks under a unified framework, enabling researchers to fairly compare different models.Sen-tEval (Conneau and Kiela, 2018) gathered multiple classification tasks involving either one or two sentences as inputs to evaluate sentence representations.DiscoEval (Chen et al., 2019) extended these tasks to the discourse level regarding inter-sentence relations.GLUE (Wang et al., 2019) included more diverse tasks such as natural language inference (Rocktäschel et al., 2016).Sarlin et al. (2020) proposed SuperGLUE as a more challenging counterpart of GLUE by introducing multi-sentence tasks.But the additional tasks are only limited to the formats of coreference resolution and question answering.In addition to these English benchmarks, many benchmarks were proposed to evaluate NLU for other languages such as CLUE (Xu et al., 2020a) for Chinese.Moreover, GLGE (Liu et al., 2020) and GEM (Gehrmann et al., 2021) were proposed for evaluating NLG models across diversified generation tasks such as text summarization and personalizing dialogue.However, there is no benchmark designed specifically for long text modeling, especially Chinese.Additionally, the above benchmarks were originally designed to cover as diverse task formats as possible.In contrast, we design the LOT tasks with the guidance of necessary abilities for long text modeling as suggested by Ribeiro et al. (2020), making it easier to figure out where models are failing, and how to improve them.
Long Text Datasets Previous studies in the field of long text modeling have frequently focused on the ROCStories (Mostafazadeh et al., 2016) and WritingPrompts (Fan et al., 2018) datasets.ROC-Stories contains 100k artificial five-sentence stories, while WritingPrompts consists of 300K pairs of prompts and stories with hundreds of words.Recent works collected stories with thousands of words to model longer-range dependencies, such as WikiText-103 (Merity et al., 2016), roleplayerguild (Louis and Sutton, 2018), PG-19 (Rae et al., 2020), STORIUM (Akoury et al., 2020) and Long-Range Arena (Tay et al., 2020).However, these datasets are written in English.LOT will drive the  development of Chinese language models.Moreover, LOT does not include datasets of extra-long texts like PG-19 for the following two reasons: (1) Extra-long texts are far beyond the scope of current machine learning models because the discourse-level linguistic phenomena are entangled and complicated in these texts.Therefore, extra-long texts usually serve for computing perplexity of language models (Dai et al., 2019) but hardly provide fine-grained guidance for improving model designs.(2) LOT aims not to spur research on building fuller connections across tokens within an extra-long sequence, but to drive the progress of machines in the aforementioned fundamental abilities for long text modeling.
Story Understanding and Generation LOT is centered on fundamental abilities for long text modeling and thus includes four story understanding and generation tasks concerning commonsense and discourse relations.Recent studies have proposed various tasks to evaluate story understanding and generation.Firstly, story ending selection (Mostafazadeh et al., 2016), story ending generation (Guan et al., 2019) and story completion (Wang and Wan, 2019) focused on the commonsense reasoning ability on inter-event causal and temporal relations.Secondly, Chen et al. (2019) evaluated the ability to model discourse relations by predicting the position of a sentence or a paragraph in a text.Thirdly, some works focused on the coherence of story generation conditioned on short prompts (Fan et al., 2018), titles (Yao et al., 2019) and beginnings (Guan et al., 2020).Fourthly, some studies centered on controllability, i.e., the imposing of controllable attributes on story generation such as keywords (Xu et al., 2020b), emotional trajectories (Brahman andChaturvedi, 2020), outlines (Rashkin et al., 2020) and styles (Kong et al., 2021).LOT is a comprehensive benchmark to test the above abilities for Chinese long text modeling.
On the other hand, LOT does not involve those tasks that require learning more particular features of stories, such as event chains (Chambers and Jurafsky, 2008), character types (Bamman et al., 2013), inter-character relations (Chaturvedi et al., 2016(Chaturvedi et al., , 2017)), social networks (Agarwal et al., 2013) and abstractive structures (Finlayson, 2012).Non-neural story generation models usually retrieved events from a knowledge base with pre-specified semantic relations based on handcrafted rules (Li et al., 2013), which are costly and lack generalization.In this paper, we focus mainly on evaluating neural models for story understanding and generation.

LOT Benchmark
We design LOT as an aggregation of two understanding tasks including Cloze Test (ClozeT) and Sentence Position Prediction (SenPos), and two generation tasks including Plot Completion (Plot-Com) and Outline-conditioned Generation (Out-Gen).We show the task descriptions and data statistics in Table 2 and 3, respectively.We use the jieba tokenizer2 for word tokenization.
We design LOT based on the following principles: (1) Task Diversity: The tasks vary in task formats, types and lengths of inputs and outputs, focused abilities, making LOT a comprehensive framework for evaluating the generalization of models.(2) Task Difficulty: The tasks take hundreds of words as inputs or outputs, and do not involve domain-specific knowledge about science, films, etc.Therefore, they are beyond the scope of current state-of-the-art models, but are solvable by  most Chinese native speakers.
(3) Task Formulation: The tasks have been well formulated in prior studies and agreed to be challenging but meaningful.We introduce new Chinese datasets for these tasks, which are constructed to focus more specifically on testing a certain ability than original datasets.(4) Automatic Evaluation: These tasks have reliable automatic metrics to evaluate the focused abilities.We exclude open-ended generation tasks such as story generation from titles, which is difficult to automatically evaluate (Guan et al., 2021) since the tasks suffer from the notorious one-to-many issue: there are many plausible outputs for the same input (Zhao et al., 2017).
We constructed datasets for LOT through automatic and manual annotation.Firstly, we crawled human-written stories from public web pages as the data source.These stories are under licenses that allow use and redistribution for research purposes.Then, we hired a commercial team to create the LOT examples.The team is led by a professional screenwriter and has taken on hundreds of NLP annotation projects.All annotators are native Chinese speakers and well-trained for the annotation tasks.We show the full list of the source web pages and the annotation details in the appendix.

Mostafazadeh et al. (2016) introduced the Story
Cloze Test (SCT) task for evaluating story comprehension, which requires selecting the right ending from two candidates for a four-sentence leading context.However, SCT suffers from the following issues: (1) Its dataset is artificial and contains innate biases between right and wrong endings in some features such as lengths (Schwartz et al., 2017;Sharma et al., 2018).Such biases may leak information about the target labels.(2) SCT focuses on reasoning only endings but neglects other types of reasoning, such as abductive reasoning (Bhagavatula et al., 2019), which requires reasoning what happens between observed beginnings and endings.(3) SCT limits the scope of commonsense reasoning to realistic events.The limitation may be neither necessary nor sufficient.For example, "Cupid can fly" can be reasoned based on common sense although it is not realistic, while some story settings may be realistic but fail to be reasoned only based on the context and common sense, as shown in Table 4. Therefore, when constructing our ClozeT dataset, we adopt the following approaches to alleviate the above issues: (1) All examples are derived from existing human-written stories.(2) We allow annotators to create examples where the removed sentence is initially in the middle of the story.(3) We change the scope of commonsense reasoning to all events that embody characters' reaction and intention, or the nature of physical objects and concepts.Table 6 shows two ClozeT examples.Furthermore, we also conducted experiments to investigate the potential biases of our dataset in Section 5.5.

Story Filtering
To ensure the quality of LOT examples, we asked annotators to judge whether each crawled story meets the following definition: "anything which is told in the form of a coherent event sequence involving several specific and Table 4: An example for selecting a sentence that can be reasoned based on the context and common sense (in red).We also highlight a sentence that does not satisfy the requirement in green, which introduces a new character "the Devil King".related characters" (Mostafazadeh et al., 2016).We provided detailed cases for annotators to instruct them about this definition.Then, annotators needed to refine those stories which do not meet the definition by rewriting the plots.They should also clean up the stories by the following heuristics: (1) refusing examples which may violate ethical principles (e.g., discrimination); (2) deleting noisy words (e.g., links); (3) changing slang and informal words into standard modern Chinese; (4) rewriting all dialogues to objective events.Finally, we collected 2,427 high-quality Chinese stories, which will be used to construct the datasets for the ClozeT, PlotCom and OutGen tasks.

Dataset Construction
We presented the stories to another group of annotators to construct the ClozeT dataset.For each story, they should select a sentence as the right candidate that can be reasoned based on the context and common sense.Table 4 shows an example presented to the annotators to illustrate how to judge whether a sentence satisfies this requirement.Then, the annotators rewrite the sentence into another one as the wrong candidate that maintains a good topical relatedness with the context but violates common sense.The wrong candidates should either embody unreasonable reactions or intentions, or violate the nature of physical objects or concepts.And we require annotators not to select the first sentence, which usually aims to introduce story settings instead of narrating an event.We browse through the annotation results and give the annotators detailed feedback before approving their submissions.Finally, we collected 1,232 examples in total and split them for training, validation and testing.

Sentence Position Prediction
We use the sentence position prediction task (Chen et al., 2019) to evaluate the ability to capture intersentence relations (e.g., causality).We formulate the task as follows: given a text with a sentence removed, models should choose the correct position of the sentence in the text from multiple candidates.Chen et al. (2019) constructed an English dataset for this task by randomly removing sentences from existing texts.However, such examples may be invalid since a sentence may have multiple plausible positions in a text, as illustrated in Table 5.Therefore, we construct the dataset for our task based on the following pipeline: (1) extracting paragraphs with less than 500 words from crawled stories; (2) randomly selecting a sentence to remove for each paragraph, and regarding all positions between two adjacent sentences as candidates3 , and (3) asking annotators to refine part of the auto-constructed examples as the validation and test sets, and the remaining as the training set.Dataset Construction We asked annotators to refine each example so that the removed sentence has only one reasonable position in the text.We did not allow annotators to select the first or last sentence of the original text as the removed sentence since they usually contain obvious wording features (e.g., "once upon a time," "they lived happily together"), which may make this task trivial.Unlike ClozeT, we allowed the texts for SenPos to be incomplete or include dialogues which also embody rich inter-sentence relations.Finally, we collected 1,663 examples for validation and testing through human annotation.And we constructed 20,000 examples automatically for training.

Plot Completion
We use the Plot Completion task (Wang and Wan, 2019) to test the ability to make inferences based on common sense.We formulate this task as follows: given a story with a sentence removed, models should generate a sentence to complete the story and make it reasonable and coherent.

Context Wrong Candidates
Right Candidates A silly wolf and a fox stole a jar of honey and then hid it in a tree hole.They agreed that neither of them were allowed to eat the honey alone.However, the fox sneaked back to eat up all the honey the next day.Afterwards, whenever the wolf asked the fox to eat the honey together, the fox always refused its request.Finally the wolf could not help coming back to the tree hole and found that the jar had been empty.The wolf felt very regretful that the honey became dry because it had been too long.It had no doubts about the fox at all. [MASK] When hearing this, the fox became very angry and decided no longer to look for food together with the wolf .
After hearing this, the fox became more active to look for food together with the wolf.
Once upon a time, there lived a mother and her son at the foot of a mountain.After her son grew up, he went out to learn skills and never came back.Therefore, the mother went to the nearby city to look for him.However, her son became an official and disowned his mother.The mother sat by the roadside and cried sadly.A young man passed by and knew the cause.Then the man took her home.
[MASK] He decreed to remove the position of the disobedient son.And the mother lived happily in the palace.
Actually the man was also an official.
Actually the man was the prince of the city.
Table 6: Two ClozeT examples.The right candidates are extracted from the original stories (at the position of "[MASK]") while the wrong candidates are written by crowd-sourced annotators.The first example focuses on common sense regarding the fox's reaction to the silly wolf 's behaviour, while the second example focuses on common sense regarding the relations between palace and prince.We highlight the entities and events related to the commonsense relations in red, and those which violate common sense in the wrong candidates in green.
Texts Removed Sentences Labels There was a man named Jiang, whose grandfather and father were killed by snakes when catching them.But he still made his living by catching snakes.
[1] When Liu advised him no longer to catch snakes, the man cried and said that he would rather be killed by snakes than give up catching snakes.
[2] Actually some villagers had already lost everything and have nothing to eat.
[3] They could do nothing but tremble with fear when the officers went into their houses to collect taxes and struck out violently.
[4] Even dogs and chickens couldn't get any peace in such scenario, let alone humans!This was because he was able to pay taxes to the government only by catching snakes. [2] A wolf went out to look for food.It happened to pass by a house.It heard a child crying and then an old woman scared the child to say: "Do not cry!If you cry again, I will fling out you to feed wolves right away.
[1]" Hearing this, the wolf was overjoyed and then squatted down and waited.However, the child was not flung out even when it was dark.
If the wolf comes, let's kill and eat it."[3]The wolf was so frightened that he ran back to its lair.
[4] When its friends asked it what happened, it said in dismay: "Don't mention it." After sunset, the wolf was getting impatient and planned to break into the house. [2] Table 7: Two SenPos examples.The special tokens from [1] to [9] refer to the candidate positions.The first/second example focuses on testing the ability to capture the inter-sentence causal/temporal relations, respectively.We highlight the entities and events implying the relations in red.
Dataset Construction Prior studies (Wang and Wan, 2019;Paul and Frank, 2021) automatically constructed datasets for this task based on existing datasets by randomly removing one sentence from a story.However, as shown in Table 4, not all sentences in a story can be reasoned only based on the context and common sense.Therefore, we only used the above automatic method to construct the training data.And we adapted the ClozeT data to this task for validation and testing, since annotators have marked out the qualified sentences.
Specifically, we randomly sampled some ClozeT examples and took the incomplete story of each example as input, and the right candidate as the target sentence to be generated.

Outline-conditioned Generation
Prior works tended to test the ability of long text generation through story generation conditioned on inputs with limited information such as titles (Yao et al., 2019).However, these tasks are extremely open-ended so that it is difficult to reliably measure the generation quality using automatic metrics (Guan and Huang, 2020).To alleviate the issue, we introduce the Outline-conditioned Generation task (Rashkin et al., 2020), which requires generating a coherent long-form story conditioned on an outline of characters and events.We formulate the outline as a set of out-of-order phrases, which not only narrows down the set of plausible stories but also serves for testing the controllability and planning ability of models to arrange the given events reasonably at the discourse level.

Dataset Construction
We built the dataset for this task automatically based on filtered stories.We followed Rashkin et al. ( 2020) to extract the outline of a story using the RAKE algorithm (Rose et al., 2010).We extract at most eight phrases for each story, and each phrase contains no more than eight words.For example, the outline for the story in Table 1 is {"told his son with irony," "purchasing flour from a mill," "crossing the river," "drop the sack into the river," "indeed pushed the sack," "familiar to his son's temper," "shouted," "one bag"}.The outline can serve as discourse-level guidance for generation models, which should rearrange the events reasonably and generate a story with a good global discourse structure, rather than focus on modeling only the local coherence.

Overall Score
Existing benchmarks usually summarize the performance of a model as a single score by averaging all metric scores without considering task difficulties.To encourage models to progress on those tasks where there is a more significant gap between machines and humans, we propose to average metric scores with different weights.Suppose that there are a total of M metrics for all tasks, we derive the overall score as follows: where H i , B i and S i are the score of humans, a pre-selected baseline and the evaluated model for the i-th metric, respectively, and w i is the weight for this metric.Intuitively, the metric scores where the baseline model has a larger gap with humans will have a larger weight when computing the overall score.We use BERT and GPT2 as the baseline models for the understanding and generation tasks in LOT, respectively.

Long Text Pretraining Model
To provide more flexibility on both understanding and generation tasks, we build LongLM following the original encoder-decoder design of Transformer (Vaswani et al., 2017) with three different sizes, as shown in Table 8.We follow Cui et al. (2020) to use a sentencepiece vocabulary of 32,000 wordpieces (Kudo and Richardson, 2018).
And we set the maximum sequence length to 512 for both the encoder and decoder.Pretraining Data We collect 120G novels as the pretraining data for LongLM, which cover various topics such as romance, military, etc.Since a novel is usually much longer than the maximum input and output length of LongLM, we split a novel into multiple segments for pretraining.
Pretraining Tasks Encoder-decoder models are trained typically by maximizing the likelihood of the target output given an input.To improve capacities of both the encoder and decoder, we propose to train LongLM with two pretraining tasks including text infilling (Raffel et al., 2020) and conditional continuation (Radford et al., 2019).
For the first task, the input is a text where a number of spans are sampled and replaced by special tokens with unique IDs, while the output is the spans delimited by the special tokens used in the input.The lengths of masked spans are drawn from a Poisson distribution with λ=3 and all masked tokens compress 15% of the original texts.As for the second task, the input and output are respectively the front and back half of a text, which is split into two parts randomly.We show an example of the pretraining tasks in Figure 1.Pretraing Details We set the learning rate to 1e-4 with the Adam optimizer and the batch size to 1,000.We pretrained LongLM for 2.5M steps.It took about two months to train the largest model using eight NVIDIA V100 GPUs.
Model Performance To assess the performance of LongLM on the pretraining tasks, we randomly separated out 1,000 texts from the initial pretraining data for testing, which were never seen in the pretraining phase.We used perplexity and BLEUn (n=3,4) to evaluate both pretraining tasks.And we generated outputs using the greedy decoding algorithm for the text infilling task, and top-k sampling (Fan et al., 2018) with k = 40 and a softmax temperature of 0.7 (Goodfellow et al., 2014) for the conditional continuation task.As shown in Table 9, the performance improves substantially as the number of parameters increases.

Experiments
In this section, we tested LongLM and existing models on LOT with automatic and manual evaluation.Furthermore, we conducted extensive experiments to investigate the potential biases of the ClozeT and SenPos datasets (Section 5.5), and measure the overlap between training and testing data (Section 5.6).

Evaluated Models
We evaluated the following models, which are implemented based on the register models of HuggingFace Transformers4 : (1) Vanilla Transformer: It has the same architecture as BERT base except that the number of layers is set to 3 (Vaswani et al., 2017).
(5) mT5: It's implemented based on the google/mt5-base register model (Xue et al., 2021).We set all the baseline models to the base version due to limited computational resources.
To show the generic benefits of the pretraining data of LongLM for long text modeling, we pretrained a left-to-right language model from scratch on the data with the standard language modeling objective.This model has the same architecture as GPT2 base and is denoted as GPT2 † base .Moreover, we evaluated two task-specific pretraining models including PlotMachines (PM) (Rashkin et al., 2020) and Plan&Write (PW) (Yao et al., 2019), and two typical non-pretrained models including ConvS2S (Gehring et al., 2017) and Fusion (Fan et al., 2018) on the generation tasks in LOT.We used GPT2 base as the backbone model of PM and PW.For PM, we regard input sentences (for Plot-Com) or input phrases (for OutGen) as the plot elements used in the memory network, and update the memory representations at each step of decoding.As for PW, we take a keyword extracted from the target sentence using the RAKE algorithm (for PlotCom) or the sorted input phrases in order (for OutGen) as the intermediate representations for planning.We implemented these models based on the codes provided by the original papers.

Experiment Settings
Understanding Tasks For both tasks, we encode the input of each example and then predict a distribution over all candidates by normalizing the dot-product values between the representations of each candidate and the context.We use the candidate with the maximum probability as the prediction result.For ClozeT, we represent a candidate using the hidden state at the end of it, and we re-  gard the hidden state at the position of the removed sentence appearing in the original text as the context representation.And for SenPos, we take the hidden state at each candidate position as the candidate representation and the hidden state at the end of the removed sentence as the context representation.When evaluating mT5 and LongLM, we feed the same input into the encoder and decoder (Lewis et al., 2020) and use the hidden states of the decoder for prediction in the above way.
Generation Tasks For PlotCom, we take the incomplete story of an example as input to generate the missing sentence.And for OutGen, we concatenate all phrases in an outline with special tokens as input to generate a story.
Hyper-Parameters For all models, we set the batch size to 12, the maximum sequence length to 512, and the learning rate to 3e-5.We decode outputs use top-k sampling with k = 40 and a softmax temperature of 0.7 for the generation tasks.

Automatic Evaluation
Metrics We use accuracy to evaluate the understanding tasks.As for generation tasks, we use BLEU-n (B-n) and Distinct-n (D-n) to evaluate the n-gram overlap with ground-truth texts (Papineni et al., 2002) and n-gram generation diversity (Li et al., 2016), respectively.We set n = 1, 2 for both generation tasks.Additionally, we also use the following two metrics to evaluate OutGen: (1) Coverage (Cover): It is used to evaluate the generation controllability, which is computed as the average Rouge-L recall score (Lin, 2004) between the generated text and each input phrase.A higher coverage score indicates the generated text covers more input phrases.
(2) Order: It is used to measure the gap between the positional orders of input phrases appearing in the generated texts and ground-truth texts.Specifically, we compute the order score as the average ratio of the number of inversions in the generated story to the number of all position pairs of any two phrases.An inversion refers to a position pair that are out of the ground-truth order.And we use the position of the longest common subsequence between a story and a phrase as the position of the phrase in the story.
Because an input phrase does not always appear in the generated story, we regard all position pairs of such a phrase and others as inversions.
Results Table 10 and 11 show the results on the understanding and generation tasks, respectively.
To obtain the human performance on the understanding tasks, we randomly sampled 100 examples from the validation set or test set and hired three crowd-sourced annotators (native Chinese speakers) to do these tasks.We made final decisions among them through majority voting.All results show an almost perfect inter-annotator agreement with Fleiss's κ > 0.85 (Fleiss and Joseph, 1971).For generation tasks, we regard the scores of ground-truth texts as human performance.
We summarize the evaluation results as follows: (1) Pretrained models have significantly bet-

Manual Evaluation
Since automatic metrics may be unreliable for evaluating NLG (Guan and Huang, 2020), we conducted a point-wise manual evaluation to measure the disparity between machines and humans for the generation tasks in LOT.For each task, we randomly sampled 100 examples from the test set and obtained 100 ground-truth texts and 300 generated texts from three typical models including GPT2 base , mT5 base and LongLM large .For each text along with the input, we hired three crowdsourced workers to judge its quality with a binary score (1 for good, and 0 otherwise) in terms of three aspects: (1) grammaticality (intra-sentence grammar quality of generated texts), (2) coherence (causal and temporal dependencies within generated texts), and (3) relatedness to inputs (reasonable logical connections to the input context for PlotCom; and reasonable utilization of input phrases for OutGen).These aspects are independently evaluated.We made final decisions among three annotators through majority voting.We show the annotation instructions in the appendix.
Table 12 shows the evaluation results.For both tasks, LongLM outperforms GPT2 and mT5 significantly in all aspects (p < 0.05, sign test).However, it is difficult for all models to generate a logical completion for PlotCom (relatedness score < 0.1), showing their poor ability to capture commonsense and inter-sentence relations.And the big gap between LongLM and humans also proves both tasks challenging to existing generation models.We also observe the positive correlation between the manual evaluation and automatic evalu-ation (Table 11), suggesting that it may be acceptable to use automatic evaluation to compare and improve models on the generation tasks in LOT.

Bias Investigation
It is essential to investigate potential biases of a dataset, which may leak information about target labels and enable models to easily use shortcuts to handle complex inputs without actually mastering the focused abilities (Ribeiro et al., 2020).Therefore, we experimented with the following baselines to inspect the ClozeT and Sen-Pos datasets: (1) Random: It chooses a candidate randomly.
(2) Majority: It chooses the candidate with an index that is most frequently selected in the training set.(3) Length: For ClozeT, it chooses the candidate that contains more words; And for SenPos, it chooses the position of which the adjacent sentences have the closest number of words to the removed sentence.(4) BLEUn: For ClozeT, it chooses the candidate with a higher BLEU-n score (Papineni et al., 2002) with the context; And for SenPos, it chooses the position of which the adjacent sentences have the largest average BLEU-n score with the removed sentence (n=1,2).(5) Sentiment: For ClozeT, it chooses the candidate with a higher sentiment score computed by an off-the-shelf Chinese sentiment analyzer5 ; And for SenPos, it chooses the position where the average sentiment score of its adjacent two sentences is the closest to the score of the removed sentence.(6) Discourse Markers: For ClozeT, it chooses the candidate where its adjacent sentences contain a discourse marker matching with it.For example, if "because" occurs in the last sentence before the position of the candidates, this baseline will choose the candidate that contains "so"6 .If there does not exist such paired markers in an example or there are multiple eligible candidates, this baseline will randomly choose one.The setting of this baseline for SenPos is similar to ClozeT.We manually define 24 marker pairs for this baseline.(7) BERT w/o Context: We fine-tuned BERT to directly choose without taking the context as input (Schwartz et al., 2017)  the adjacent sentences of the removed sentence as input.And for SenPos, we encoded each position and its adjacent sentences respectively using BERT and then took the hidden states at these positions for prediction.These baselines cover different levels of features ranging from the token level (e.g., Length), the sentence level (e.g., Sentiment) to the discourse level (e.g., Discourse Markers, BERT w/o Context).We believe that these baselines will provide a comprehensive inspection for the potential biases of our datasets.
As shown in Table 13, both tasks can not be trivially solved by these baselines, suggesting that the datasets may be free of biases in terms of the above features.Therefore, we believe that the tasks can focus on testing the ability of models to capture long-range commonsense and discourse relations.

Memorization Investigation
Overlap between training and test data may result in an over-reporting of the generalization performance of machines.Therefore, it is necessary to investigate how many test data also show up in the training data.To this end, we follow Radford et al. (2019) to measure the overlap between two datasets by calculating the percentage of 8-grams from one that are also in the other.We use the jieba tokenizer for tokenization.
Table 14 shows the overlapping analysis for test sets of the four tasks in LOT.We can see that all test sets have less than 1% overlap with their own training sets.Notably, there are 17 test examples of SenPos that contain more than 10% overlapped 8-grams with the training set.This is because a training example and a test example may come from the same story, and thus they share similar information (e.g., characters, locations).A test example contains at most 60.98% overlapped  8-grams, suggesting that the training set and test set do not include exactly the same example.As for the pretraining data of LongLM, the test sets of ClozeT and PlotCom still have less than 1% overlap.However, there are dozens of test examples in SenPos and OutGen that contain more than 10% overlapped 8-grams.Through manual inspection of the overlaps, we found that they mainly come from idioms, proverbs and classic fairy tales, which may be part of some novels in the pretraining data.
To investigate how the overlapping data influence the measurement of models' performance, we re-evaluated LongLM large on the test sets of SenPos and OutGen with exclusion of the examples that have more than 10% overlapped 8-grams with the training sets or pretraining data.We also used mT5 base as a baseline in the same setting of LongLM.The results for SenPos and OutGen are shown in Table 15 and Table 16, respectively.The change of accuracy or BLEU-1 score is very marginal for both mT5 and LongLM when excluding the overlapping data, suggesting that the superior performance of LongLM is rarely attributable to the memorization of training data.Therefore, we believe that it is fair to compare LongLM and  other models on these tasks.

Conclusions
We present LOT, a story-centric benchmark for Chinese long text understanding and generation.LOT includes two story understanding tasks and two story generation tasks, which comprehensively investigate the abilities of commonsense reasoning, controllable generation, and modeling inter-sentence relations and the global discourse structures.We provide standard datasets for the four tasks, which are constructed based on humanwritten stories processed by automatic and manual annotation.Furthermore, we release a new Chinese long text pretraining model LongLM, which outperforms strong baseline models substantially on both the understanding and generation tasks in LOT.The LOT benchmark, the pretraining model, and the evaluation platform will encourage further research on Chinese long text modeling.

Acknowledgement
This work was supported by the National Science Foundation for Distinguished Young Schol-ars (with No. 62125604) and the NSFC projects (Key project with No. 61936010 and regular project with No. 61876096).This work was also supported by the Guoqiang Institute of Tsinghua University, with Grant No. 2019GQG1 and 2020GQG0005.We would also like to thank our action editor, Dipanjan Das, and the anonymous reviewers for their invaluable suggestions and feedback.

Figure 1 :
Figure1: Schematic of the pretraining tasks.<X> and <Y> is the special tokens used for masking spans.<Z> is the "end of sequence" token.

Figure 2 :
Figure 2: Accuracy of BERT for SenPos as the size of training data increases.

Table 1 :
A long text example.The concepts and events concerning commonsense and discourse relations are highlighted in bold.

Table 2 :
Overview of the tasks in LOT for the abilities they test, inputs and outputs, and the evaluation metrics.Dist and Cover refer to Distinct and Coverage (Section 5.3), respectively.

Table 3 :
Data statistics of LOT tasks.The abbreviation char/sent/len is short for character/sentence/length, respectively.
Table 7 shows two SenPos examples.I couldn't control my anger very well.[1]Myparents would yell at me, and i ran to my room.[2]Iburied my head in a pillow and screamed.[3]Ithrew my pillow and hit it hard.
Text:Removed Sentence: I tried to express my anger.

Table 5 :
A poor example for the SenPos task.The removed sentence has multiple reasonable positions including [2] and [3] in the original text.

Table 8 :
Hyper-parameter settings for different versions of LongLM.d m , d ff and d kv are the dimension of hidden states, the feed forward layers, the keys/values in the self-attention layers, respectively.n h is the number of attention heads.n e and n d denote the number of hidden layers for the encoder and decoder, respectively.# P is the number of parameters.

Table 10 :
Accuracy (%) on the understanding tasks in LOT.# P means the number of parameters.The best performance is in bold and the second best is underlined.w i is the metric weight with BERT as the baseline model when computing the overall score.

Table 11 :
Evaluation results on the generation tasks in LOT.# P means the number of parameters.The best performance is in bold and the second best is underlined.w i is the metric weight with GPT2 base as the baseline model when computing the overall score.

Table 12 :
Manual evaluation results for PlotCom and OutGen in terms of grammaticality (Gram), coherence (Cohe) and relatedness (Relat).The best performance is highlighted in bold.All results show a fair inter-annotator agreement with Fleiss' κ > 0.2.
. (8) BERT w/o Long: It is used to study whether solving these tasks requires modeling long-range dependencies.For ClozeT, we fine-tuned BERT to choose with only

Table 13 :
Accuracy (%) of different baselines on the test sets of ClozeT and SenPos for bias investigation.We use the results of BERT as a reference.

Table 14 :
Overlapping analysis for the test sets of the four tasks with respect to their own training sets or the pretraining data of LongLM.We compute the following statistics: (1) Percent: the percentage of 8-grams from the test set that are also in the training sets or the pretraining data; (2) # 8-grams: the number of overlapped 8-grams; (3) # Exam: the number of examples that contain at least one overlapped 8-gram; (4) # Exam >10% : the number of examples that have more than 10% overlapped 8-grams.(4) Max Percent: the maximum percentage of overlapped 8-grams from an example.

Table 15 :
Accuracy on the test set of SenPos.Total means using the whole test set while w/o Overlap means excluding the examples that have more than 10% overlapped 8-grams with the training set or pretraining data from the test set.# Exam is the number of examples.∆ denotes the change of accuracy when excluding the overlapping data compared with using the total test set.

Table 16 :
BLEU-1 score on the test set of OutGen.Other notations are the same as Table15.