Evaluating Document Coherence Modelling

While pretrained language models ("LM") have driven impressive gains over morpho-syntactic and semantic tasks, their ability to model discourse and pragmatic phenomena is less clear. As a step towards a better understanding of their discourse modelling capabilities, we propose a sentence intrusion detection task. We examine the performance of a broad range of pretrained LMs on this detection task for English. Lacking a dataset for the task, we introduce INSteD, a novel intruder sentence detection dataset, containing 170,000+ documents constructed from English Wikipedia and CNN news articles. Our experiments show that pretrained LMs perform impressively in in-domain evaluation, but experience a substantial drop in the cross-domain setting, indicating limited generalisation capacity. Further results over a novel linguistic probe dataset show that there is substantial room for improvement, especially in the cross-domain setting.


Introduction
Rhetorical relations refer to the transition of one sentence to the next in a span of text (Mann and Thompson, 1988;Asher and Lascarides, 2003). They are important as a discourse device that contributes to the overall coherence, understanding, and flow of the text. These relations span a tremendous breadth of types, including contrast, elaboration, narration, and justification. These connections allow us to communicate cooperatively in understanding one another (Grice, 2002;Wilson and Sperber, 2004). The ability to understand such coherence (and conversely detect incoherence) is potentially beneficial for downstream tasks, such as storytelling (Fan et al., 2019;Hu et al., 2020b), recipe generation (Chandu et al., 2019), document-level text generation (Park and Kim, 2015;Holtzman et al., 2018), and essay scoring (Tay et al., 2018;. However, there is little work on document coherence understanding, especially examining the capacity of pretrained LMs to model the coherence of longer documents. To address this gap, we examine the capacity of pretrained language models to capture document coherence, focused around two research questions: (1) do models truly capture the intrinsic properties of document coherence? and (2) what types of document incoherence can/can't these models detect?
We propose the sentence intrusion detection task: (1) to determine whether a document contains an intruder sentence (coarse-grained level); and (2) to identify the span of any intruder sentence (fine-grained level). We restrict the scope of the intruder text to a single sentence, noting that in practice, the incoherent text could span multiple sentences, or alternatively be sub-sentential.
Existing datasets in document coherence measurement Clercq et al., 2014;Lai and Tetreault, 2018;Mim et al., 2019;Pitler and Nenkova, 2008;Tien Nguyen and Joty, 2017) are unsuitable for our task: they are either prohibitively small, or do not specify the span of incoherent text. For example, in the dataset of Lai and Tetreault (2018), each document is assigned a coherence score, but the span of incoherent text is not specified. There is thus a need for a large-scale dataset which includes annotation of the position of intruder text. Identifying the span of incoherent text can benefit tasks where explainability and immediate feedback are important, such as essay scoring (Tay et al., 2018;. In this work, we introduce a dataset consisting of English documents from two domains: Wikipedia articles (106K) and CNN news articles (72K). This dataset fills a gap in research pertain-(1) Mark Ferguson (born 21 May 1990) is an Irish handballer, currently playing in Dublin, Ireland.
(2) It is a twelve time Asian Champion, the tournament has been won by any other nation only twice.  ing to document coherence: our dataset is large in scale, includes both coherent and incoherent documents, and has mark-up of the position of any intruder sentence. Figure 1 is an example document with an intruder sentence. Here, the highlighted sentence reads as though it should be an elaboration of the previous sentence, but clearly exhibits an abrupt change of topic and the pronoun it cannot be readily resolved.
This paper makes the following contributions: (1) we propose the sentence intrusion detection task, and examine how pretrained LMs perform over the task and hence at document coherence understanding; (2) we construct a large-scale dataset from two domains -Wikipedia and CNN news articles -that consists of coherent and incoherent documents, and is accompanied with the positions of intruder sentences, to evaluate in both indomain and cross-domain settings; (3) we examine the behaviour of models and humans, to better understand the ability of models to model the intrinsic properties of document coherence; and (4) we further hand-craft adversarial test instances across a variety of linguistic phenomena to better understand the types of incoherence that a given model can detect.

Related Work
We first review tasks relevant to our proposed task, then describe existing datasets used in coherence measurement, and finally discuss work on dataset artefacts and linguistic probes.

Document Coherence Measurement
Coherence measurement has been studied across various tasks, such as the document discrimination task (Barzilay and Lapata, 2005;Elsner et al., 2007;Barzilay and Lapata, 2008;Elsner and Charniak, 2011;Li and Jurafsky, 2017; Putra and Tokunaga, 2017), sentence insertion (Elsner and Charniak, 2011;Putra and Tokunaga, 2017;Xu et al., 2019), paragraph reconstruction (Lapata, 2003;Elsner et al., 2007;Li and Jurafsky, 2017;Xu et al., 2019;Prabhumoye et al., 2020), summary coherence rating (Barzilay and Lapata, 2005;Pitler et al., 2010;Guinaudeau and Strube, 2013;Tien Nguyen and Joty, 2017), readability assessment (Guinaudeau and Strube, 2013;Strube, 2016, 2018), and essay scoring (Mesgar and Strube, 2018;Somasundaran et al., 2014;Tay et al., 2018). These tasks differ from our task of intruder sentence detection as follows. First, the document discrimination task assigns coherence scores to a document and its sentence-permuted versions, where the original document is considered to be well-written and coherent and permuted versions incoherent. Incoherence is introduced by shuffling sentences, while our intruder sentences are selected from a second document, and there is only ever a single intruder sentence per document. Second, sentence insertion aims to find the correct position to insert a removed sentence back into a document. Paragraph reconstruction aims to recover the original sentence order of a shuffled paragraph given its first sentence. These two tasks do not consider sentences from outside of the document of interest. Third, the aforementioned three tasks are artificial, and have very limited utility in terms of real-world tasks, while our task can provide direct benefit in applications such as essay scoring, in identifying incoherent (intruder) sentences as a means of providing user feedback and explainability of essay scores. Lastly, in summary coherence rating, readability assessment, and essay scoring, coherence is just one dimension of the overall document quality measurement.
Various methods have been proposed to capture local and global coherence, while our work aims to examine the performance of existing pretrained LMs in document coherence understanding. To assess local coherence, traditional studies have used entity matrices, e.g. to represent entity transitions across sentences Lapata, 2005, 2008). Guinaudeau and Strube (2013) and Mesgar and Strube (2016) use a graph to model entity transition sequences. Sentences in a document are represented by nodes in the graph, and two nodes are connected if they share the same or similar entities. Neural models have also been proposed (Ji and Smith, 2017;Li and Jurafsky, 2017;Mesgar and Strube, 2018;Mim et al., 2019;Tien Nguyen and Joty, 2017). For example, Tay et al. (2018) capture local coherence by computing the similarity of the output of two LSTMs (Hochreiter and Schmidhuber, 1997), which they concatenate with essay representations to score essays.  use multiheaded self-attention to capture long distance relationships between words, which are passed to an LSTM layer to estimate essay coherence scores. Xu et al. (2019) use the average of local coherence scores between consecutive pairs of sentences as the document coherence score.
Another relevant task is disfluency detection in spontaneous speech transcription (Johnson and Charniak, 2004;Jamshid Lou et al., 2018). This task detects the reparandum and repair in spontaneous speech transcriptions to make the text fluent by replacing the reparandum with the repair. Also relevant is language identification in code-switched text (Adouane et al., 2018a,b;Mave et al., 2018;Yirmibeşoglu and Eryigit, 2018), where disfluency is defined at the language level (for a monolingual speaker, e.g.). Lau et al. (2015) and Warstadt et al. (2019) predict sentence-level acceptability (how natural a sentence is). However, none of tasks are designed to measure document coherence, although sentence-level phenomena can certainly impact on document coherence.

Document Coherence Datasets
There exist a number of datasets targeted at discourse understanding.
For example, Alikhani et al. (2019) construct a multi-modal dataset for understanding discourse relations between text and imagery, such as elaboration and exemplification. In contrast, we focus on discourse relations in a document at the intersentential level. The Penn Discourse Treebank (Miltsakaki et al., 2004;Prasad et al., 2008) is a corpus of coherent documents with annotations of discourse connectives and their arguments, noting that inter-sentential discourse relations are not always lexically marked (Webber, 2009).
The most relevant work to ours is the discourse coherence dataset of , which was proposed to evaluate the capabilities of pretrained LMs in capturing discourse context. This dataset contains documents (18K Wikipedia articles and 10K documents from the Ubuntu IRC channel) with fixed sentence length, and labels documents only in terms of whether they are incoherent, without considering the position of the incoherent sentence. In contrast, our dataset: (1) provides more fine-grained information (i.e. the sentence position); (2) is larger in scale (over 170K documents); (3) contains documents of varying length; (4) incorporates adversarial filtering to reduce dataset artefacts (see Section 3); and (5) is accompanied with human annotation over the Wikipedia subset, allowing us to understand behaviour patterns of machines and humans.

Dataset Artefacts
Also relevant to this research is work on removing artefacts in datasets (Zellers et al., 2019;McCoy et al., 2019;Zellers et al., 2018). For example, based on analysis of the SWAG dataset (Zellers et al., 2018), Zellers et al. (2019) find artefacts such as stylistic biases, which correlate with the document labelling and mean that naive models are able to achieve abnormally high results. Similarly, McCoy et al. (2019) examine artefacts in an NLI dataset, and find that naive heuristics which are not directly related to the task can perform remarkably well. We incorporate the findings of such work in the construction of our dataset.

Linguistic Probes
Adversarial training has been used to craft adversarial examples to obtain more robust models, either by manipulating model parameters (whitebox attacks) or minimally editing text at the character/word/phrase level (black-box attacks). For example, Papernot et al. (2018) provide a reference library of adversarial example construction techniques and adversarial training methods.
As we aim to understand the linguistic properties that each model has captured, we focus on black-box attacks (Sato et al., 2018;Cheng et al., 2020;Liang et al., 2018;Yang et al., 2020;Samanta and Mehta, 2017). For example, Samanta and Mehta (2017) construct adversarial examples for sentiment classification and gender detection by deleting, replacing, or inserting words in the text. For a comprehensive review of such studies, see Belinkov and Glass (2019).
There is also a rich literature on exploring what kinds of linguistic phenomena a model has learned (Hu et al., 2020a;Hewitt and Liang, 2019;Hewitt and Manning, 2019;McCoy et al., 2019;Conneau et al., 2018;Gulordava et al., 2018;Peters et al., 2018;Tang et al., 2018;Blevins et al., 2018;Wilcox et al., 2018;Kuncoro et al., 2018;Tran et al., 2018;Belinkov et al., 2017). The basic idea is to use learned representations to predict linguistic properties of interest. Example linguistic properties are subject-verb agreement or syntactic structure, while representations can be word or sentence embeddings. For example, Marvin and Linzen (2018) construct minimal sentence pairs, consisting of a grammatical and ungrammtical sentence, to explore the capacity of LMs in capturing phenomena such as subjectverb agreement, reflexive anaphora, and negative polarity items. In our work, we hand-construct intruder sentences which result in incoherent documents, based on a broad range of linguistic phenomena.

Dataset Desiderata
To construct a large-scale, low-noise dataset that truly tests the ability of systems to detect intruder sentences, we posit five desiderata: 1. Multiple sources: The dataset should not be too homogeneous in terms of genre or domain, and should ideally test the ability of models to generalise across domain. 2. Defences against hacking: Human annotators and machines should not be able to hack the task and reverse-engineer the labels by sourcing the original documents. 3. Free of artefacts: The dataset should be free of artefacts, that allow naive heuristics to perform well. 4. Topic consistency: The intruder sentence, which is used to replace a sentence from a coherent document to obtain an incoherent document, should be relevant to the topic of the document, to focus the task on coherence and not simple topic detection.
5. KB-free: Our goal is NOT to construct a fact-checking dataset; the intruder sentence should be determinable based on the content of the document, without reliance on external knowledge bases or fact-checking.

Data Sources
We construct a dataset from two sources -Wikipedia and CNN -which differ in style and genre, satisfying the first desideratum. Similar to WikiQA (Yang et al., 2015) and HotpotQA (Yang et al., 2018), we represent a Wikipedia document by its summary section (i.e., the opening paragraph), constraining the length to be between 3 and 8 sentences. For CNN, we adopt the dataset of Hermann et al. (2015) and Nallapati et al. (2016), which consists of over 100,000 news articles. To obtain documents with sentence length similar to those from Wikipedia, we randomly select the first 3-8 sentences from each article.
To defend against dataset hacks 1 which could expose the labels of the test data (desideratum 2), the Wikipedia test set is randomly sampled from 37 historical dumps of Wikipedia, where the selected article has a cosine similarity less than the historical average of 0.72 with its online version. 2 For the training set, we remove this requirement and randomly select articles from different Wikipedia dumps, i.e., the articles in the training set might be the same as their current online version. For CNN, we impose no such limitations.

Generating Candidate Positive Samples
We consider the original documents to be coherent. We construct incoherent documents from half of our sampled documents as follows (satisfying desiderata 3-5): 1. Given a document D, use bigram hashing and TF-IDF matching (Chen et al., 2017) to retrieve the top-10 most similar documents from a collection of documents from the same domain, where D is the query text. Let the set of retrieved documents be R D . 2. Randomly choose a non-opening sentence S from document D, to be replaced by a sentence candidate generated later. We do not  replace the opening sentence as it is needed to establish document context. 3. For each document D ′ ∈ R D , randomly select one non-opening sentence S ′ ∈ D ′ as an intruder sentence candidate. 4. Calculate the TF-IDF-weighted cosine similarity between sentence S and each candidate S ′ . Remove any candidates with similarity scores ≥ 0.6, to attempt to generate a KBfree incoherence. 5. Replace sentence S with each low-similarity candidate S ′ , and use a fine-tuned XLNet-Large model (Yang et al., 2019) to check whether it is easy for XLNet-Large to detect (see Section 5). For documents with both easy and difficult sentence candidates, we randomly sample from the difficult sentence candidates; otherwise, we randomly choose from all the sentence candidates. The decision to filter out sentence candidates with similarity ≥ 0.6 was based on the observation that more similar sentences often led to the need for world knowledge to identify the intruder sentence (violating the fifth desideratum). For example, given It is the second novel in the first of three trilogies about Bernard Samson, ..., a candidate intruder sentence candidate with high similarity is It is the first novel in the first of three trilogies about Bernard Samson ....
We also trialled other ways of generating incoherent samples, such as using sentence S from document D as the query text to retrieve documents, and adopting a 2-hop process to retrieve relevant documents. We found that these methods resulted in documents which can be identified by the pretrained models easily.

Statistics of the Dataset
The process described in Section 3 resulted in 106,352 Wikipedia documents and 72,670 CNN documents, at an average sentence length of 5 in both cases (see Table 1). The percentages of positive samples (46% and 49%, respectively) are slightly less than 50% due to our data generation constraints (detailed in Section 3.3), which can lead to no candidate intruder sentence S ′ being generated for original sentence S. We set aside 8% of Wikipedia (which we manually tag, as detailed in Section 4.5) and 20% of CNN for testing.

Types of Incoherence
To better understand the different types of issues resulting from our automatic method, we sampled 100 (synthesised) incoherent documents from Wikipedia and manually classified the causes of incoherence according to three overlapping categories (ranked in terms of expected ease of detection): (1) information structure inconsistency (a break in information flow); (2) logical inconsistency (a logically inconsistent world state is generated, such as someone attending school before they were born); and (3) factual inconsistency (where the intruder sentence is factually incorrect). See Table 2 for a breakdown across the categories, noting that a single document can be incoherent across multiple categories. Information structure inconsistency is the most common form of incoherence, followed by factual inconsistency. The 35% of documents with factual inconsistencies break down into 8% (overall) that have other types of incoherence, and 27% that only have a factual inconsistency. This is an issue for the fifth desideratum for our dataset (see Section 3.1), motivating the need for manual checking of the dataset to determine how readily the intruder sentence can be detected.

Evaluation Metrics
We base evaluation of intruder sentence detection at both the document and sentence levels: • document level: Does the document contain an intruder sentence? This is measured based on classification accuracy (Acc), noting that the dataset is relatively balanced at the document level (see Table 1). A prediction is "correct" if at least one sentence/none of the sentences is predicted to be an intruder. • sentence level: Is a given (non-opening) sentence an intruder sentence? This is measured  based on F 1 , noting that most (roughly 88%) sentences are non-intruder sentences.

Testing for Dataset Artefacts
To test for artefacts, we use XLNet-Large (Yang et al., 2019) to predict whether each nonopening sentence is an intruder sentence, in complete isolation of its containing document (i.e., as a standalone sentence classification task). We compare the performance of XLNet-Large with a majority-class baseline ("Majority-class") which predicts all sentences to be non-intruder sentences (i.e., from the original document), where XLNet-Large is fine-tuned over the Wikipedia/CNN training set, and tested over the corresponding test set. For Wikipedia, XLNet-Large obtains an Acc of 55.4% (vs. 55.1% for Majority-class) and F 1 of 3.4% (vs. 0.0% for Majority-class). For CNN, the results are 50.8% and 1.2%, respectively (vs. 51.0% and 0.0% resp. for Majority-class). These results suggest that the dataset does not contain obvious artefacts, at least for XLNet-Large. We also experiment with a TF-IDF weighted bag-of-words logistic regression model, achieving slightly worse results than XLNet-Large (Acc = 55.1%, F 1 = 0.05% for Wikipedia, and Acc = 50.6%, F 1 = 0.3% for CNN).

Human Verification
We performed crowdsourcing via Amazon Mechanical Turk over the Wikipedia test data to examine how humans perform over this task. Each Human Intelligence Task (HIT) contained 5 documents and was assigned to 5 workers. For each document, the task was to identify a single sentence which "creates an incoherence or break in the content flow", or in the case of no such sentence, "None of the above", indicating a coherent document. In the task instructions, workers were informed that there is at most one intruder sentence per document, and were not able to select the opening sentence. Among the 5 documents for each HIT, there was one incoherent document from the training set, which was pre-identified as being easily detectable by an author of the paper, and acts as a quality control item. We include documents where at least 3 humans assign the same label as our test dataset (90.3% of the Wikipedia test dataset), where all the results are reported over these documents, if not specified. 5 Payment was calibrated to be above Australian minimum wage. Figure 2 shows the distribution of instances where different numbers of workers produced the correct answer (the red bar). For example, for 6.2% of instances, 2/5 workers annotated correctly. The blue bars indicate the proportion of incoherent documents where the intruder sentence was correctly detected by the given number of annotators (e.g., for 9.3% of incoherent documents, only 2/5 workers were able to identify the intruder sentence correctly). Humans tend to agree with each other over coherent documents, as indicated by the increasing percentages for red bars but decreasing percentages for blue bars across the xaxis. Intruder sentences in incoherent documents, however, are harder to detect. One possible explanation is that the identification of intruder sentences requires fact-checking, which workers were instructed not to do (and base their judgement only on the information in the provided document); another reason is that intruder sentences disrupt local incoherence with neighbouring sentences, creating confusion as to which is the intruder sentence (with many of the sentence-level mis-annotations being off-by-one errors).

Models
We model intruder sentence detection as a binary classification task: each non-opening sentence in a document is concatenated with the document, and a model is asked to predict whether the sentence is an intruder sentence to the document. Our focus is on the task, dataset, and how existing models perform at document coherence prediction rather than modelling novelty, and we thus experiment with pre-existing pre-trained models. The models are as follows, each of which is fed into an MLP layer with a softmax output.
BoW: Average the word embeddings for the combined document (sentence + sequence of sentences in the document), based on pretrained 300D GloVe embeddings trained on a 840B-token corpus (Pennington et al., 2014).
Bi-LSTM: Feed the sequence of words in the combined document into a single-layer 512D Bi-LSTM with average-pooling; word embeddings are initialised as with BoW.
InferSent: Generate representations for the sentence and document with InferSent (Conneau et al., 2017), and concatenate the two; InferSent is based on a Bi-LSTM with a maxpooling layer, trained on SNLI (Bowman et al., 2015).
Skip-Thought: Generate representations for the sentence and document with Skip-Thought , and concatenate the two; Skip-Thought is an encoder-decoder model where the encoder extracts generic sentence embeddings and the decoder reconstructs surrounding sentences of the encoded sentence.
BERT: Generate representations for the concatenated sentence and document with BERT (Devlin et al., 2019), which was pretrained on the tasks of masked language modelling and next sentence prediction over Wikipedia and BooksCorpus ; we experiment with both BERT-Large and BERT-Base (the cased versions).
RoBERTa: Generate representations for the concatenated sentence and document with RoBERTa (Liu et al., 2019), which was pretrained on the task of masked language modelling (dynamically masking) and each input consisting of continuous sentences from the same document or multiple documents (providing broader context) over Cc-news, OpenWebTextCorpus, and STORIES (Trinh and Le, 2018), in addition to the same data BERT was pretrained on; we experiment with both RoBERTa-Large and RoBERTa-Base.
ALBERT: Generate representations for the concatenated sentence and document with ALBERT (Lan et al., 2020), which was pretrained over the same dataset as BERT but replaces the next sentence prediction objective with a sentence-order prediction objective, to model document coherence; we experiment with both ALBERT-Large and ALBERT-xxLarge.
XLNet: Generate representations for the concatenated sentence and document with XLNet (Yang et al., 2019), which was pretrained using a permutation language modelling objective over datasets including Wikipedia, BooksCorpus, Giga5 (Parker et al., 2011), ClueWeb 2012-B (Callan et al., 2009)  is used in removing data artefacts when selecting the intruder sentences, our experiments suggest that the comparative results across models (with or without artefact filtering) are robust.

Preliminary Results
In our first experiments, we train the various models across both Wikipedia and CNN, and evaluate them in-domain and cross-domain. We are particularly interested in the cross-domain setting, to test the true ability of the model to detect document incoherence, as distinct from overfitting to domain-specific idiosyncrasies. It is also worth mentioning that BERT, RoBERTa, ALBERT, and XLNet are pretrained on multisentence Wikipedia data, and have potentially memorised sentence pairs, making in-domain experiments problematic for Wikipedia in particular. Also of concern in applying models to the automatically-generated data is that it is entirely possible that an intruder sentence is undetectable to a human, because no incoherence results from the sentence substitution (bearing in mind that only 58% of documents in Table 2 contained information structure inconsistencies). From Table 3, we can see that the simpler models (BoW, Bi-LSTM, InferSent, and Skip-Thought) perform only at the level of Majority-class at the document level, for both Wikipedia and CNN. At the sentence level (F 1 ), once again the models perform largely at the level of Majority-class (F 1 = 0.0), other than Bi-LSTM indomain for Wikipedia and CNN. In the final row of the table, we also see that humans are much better at detecting whether documents are incoherent (at the document level) than identifying the position of intruder sentences (at the sentence level), and that in general, human performance is low. This is likely the result of the fact that there are only 58% of documents in Table 2 containing information structure inconsistencies. We only conducted crowdsourcing over Wikipedia due to budget limitations and the fact that the CNN documents are available online, making dataset hacks possible. 6 Among the pretrained LMs, ALBERT-xxLarge achieves the best performance over Wikipedia and CNN, at both the document and sentence levels. Looking closer at the Wikipedia results, we find that BERT-Large achieves a higher precision than XLNet-Large (71.0% vs. 60.3%), while XLNet-Large achieves a higher recall (51.3% vs. 27.4%). ALBERT-xxLarge achieves a precision higher than BERT-Large (79.7%) and a recall higher than XLNet-Large (64.9%), leading to the overall best performance. Over CNN, ALBERT-xxLarge, RoBERTa-Large, and XLNet-Large achieve high precision and recall (roughly 93.0% to 97%). 7 The competitive results for ALBERT-xxLarge over Wikipedia and CNN result from the pretraining strategies, especially the sentence-order prediction loss capturing document coherence in isolation, different from next sentence prediction loss which conflates topic prediction and coherence prediction in a lower-difficulty single task. The performance gap for ALBERT, RoBERTa, and XLNet between the base and large models are bigger than that of BERT, suggesting that they benefit from greater model capacity. 8 We also examine how pretrained LMs perform with only the classifier parameters being updated during training. Here, we focus on exclusively on ALBERT-xxLarge, given its superiority. As shown in Figure 3, the pretrained LM ALBERT-xxLarge is unable to different coherent documents from incoherent ones, resulting into random guess, although it considers document coherence during pretraining. This indicates the necessity of finetuning LMs for document coherent understanding.
Looking to the cross-domain results, again, ALBERT-xxLarge achieves the best performance over both Wikipedia and CNN. The lower results for RoBERTa-Large and XLNet-Large over Wikipedia may be due to both RoBERTa and XL-Net being pretrained over newswire documents, and fine-tuning over CNN reducing the capacity of the model to generalise. ALBERT and BERT do not suffer from this as they are not pretrained over newswire documents. The substantial drop between the in-and cross-domain settings for AL-BERT, RoBERTa, XLNet, and BERT indicates that the models have limited capacity to learn a generalised representation of document coherence, in addition to the style differences between 7 The higher performance for all models/humans over the CNN dataset indicates that it is easier for models/humans to identify the presence of intruder sentences. This is can be explained by the fact that a large proportion of documents include named entities, making it easier to detect the intruder sentences. In addition, the database used to retrieve candidate intruder sentences is smaller compared to that of Wikipedia. 8 We also performed experiments where the models were allowed to predict the first sentence as the intruder sentence. As expected, model performance drops, e.g., F 1 of XLNet-Large drops from 55.4% to 47.9%, reflecting both the increased complexity of the task and the lack of (at least) one previous sentence to provide document context.  Wikipedia and CNN.

Results over the Existing Dataset
We also examine how ALBERT-xxLarge performs over the coarse-grained dataset of , where 50 documents from each domain were annotated by a native English speaker. Performance is measured at the document level only, as the dataset does not include indication of which sentence is the intruder sentence. As shown in Table 4, ALBERT-xxLarge achieves an Acc of 96.8% over the Wikipedia subset, demonstrating that our Wikipedia dataset is more challenging (Acc of 81.7%) and also underlining the utility of adversarial filtering in dataset construction. Given the considerably lower results, one could conclude that Ubuntu is a good source for a dataset. However, when one of the authors attempted to perform the task manually, they found the document-level task to be extremely difficult as it relied heavily on expert knowledge of Ubuntu packages, much more so than document coherence understanding.
In the cross-domain setting, there is a substantial drop over the Wikipedia dataset, which can be explained by ALBERT-Large failing to generate a representation of document coherence from the Ubuntu dataset, due to the high dependence on domain knowledge as described above, resulting in near-random results. The cross-domain results for ALBERT-xxLarge over Ubuntu are actually marginally higher than the in-domain results but still close to random, suggesting that the indomain model isn't able to capture either document coherence or domain knowledge, and underlining the relatively minor role of coherence for the Ubuntu dataset.

Performance on Documents of Different Difficulty Levels
One concern with our preliminary experiments was whether the intruder sentences generate genuine incoherence in the information structure of the documents. We investigate this question by breaking down the results over the bestperforming model (ALBERT-xxLarge) based on the level of agreement between the human annotations and the generated gold-standard, for Wikipedia. The results are in Figure 3, where the x-axis denotes the number of annotators who agree with the gold-standard: for example, "2" indicates that 2/5 annotators were able to assign the gold-standard labels to the documents.
Our assumption is that the incoherent documents which humans fail to detect are actually not perceptibly incoherent, 9 and that any advantage for the models over humans for documents with low-agreement (with respect to the gold-standard) is actually due to dataset artefacts. At the document level (Acc), there is reasonable correlation between model and human performance (i.e., the model struggles on the same documents as the humans). At the sentence level (F 1 ), there is less discernible difference in model performance over documents of varying human difficulty.

Analysis over Documents with High Human Agreement
To understand the relationship between humanassigned labels and the gold-standard, we further examine documents where all 5 annotators agree, noting that human-assigned labels can potentially be different from the gold-standard here. Table 5 shows the statistics of humans over these documents, with regard to whether there is an intruder   sentence in the documents. Encouragingly, we can see that humans tend to agree more over coherent documents (documents without any intruder sentences) than incoherent documents (documents with an intruder sentence). Examining the 11 original coherent documents which were annotated as incoherent by all annotators, we find out that there is a break in information flow due to references or urls, even though there is no intruder sentence. For documents with an intruder sentence (+ intruder), where humans disagree with the goldstandard (humans perceive the documents as coherent or the position of the intruder sentence to be other than the actual intruder sentence), we find that 98% of the documents are considered to be coherent. We randomly sampled 100 documents from these documents and examined whether the intruder sentence results in a break in information flow. We find that fact-checking is needed to identify the intruder sentence for 93% of the documents.
10 Table 6 shows the performance over the Wikipedia documents that are annotated consistently by all 5 annotators (from Table 5). Consistent with the results from Table 3, ALBERT-xxLarge achieves the best performance both in-and cross-domain. To understand the different behaviours of humans and ALBERT-xxLarge, we analyse documents which only humans got correct, only ALBERT-xxLarge got correct, or neither humans nor ALBERT-xxLarge got correct, as follows: 1. Humans only: 7 incoherent (+intruder) and 73 coherent (−intruder) documents 2. ALBERT-xxLarge only: 181 incoherent (+intruder) (of which we found 97% to require fact-checking 11 ) and 9 coherent (−intruder) documents (of which 8 contain urls/references, which confused humans) 3. Neither humans nor models: 223 incoherent (+intruder) (of which 98.2% and 77.1% were predicted to be coherent by humans and ALBERT-xxLarge, respectively, and for the remainder, the wrong intruder sentence was identified) and 2 coherent (−intruder) documents (both of which were poorly organised, confusing allcomers) Looking over the incoherent documents which require fact-checking, no obvious differences are discernible between the documents that ALBERT-xxLarge predicts correctly and those it misses.
Our assumption here is that ALBERT-xxLarge is biased by the pretraining dataset, and that many of the cases where it makes the wrong prediction are attributable to mismatches between the text in our dataset and the Wikipedia version used in pretraining the model.

Question Revisited
Q1: Do models truly capture the intrinsic properties of document coherence?
A: It is certainly true that models which incorporate a more explicit notion of document coherence into pretraining (e.g. ALBERT) tend to perform better. In addition, larger-context models (RoBERTa) and robust training strategies (XLNet) during pretraining are also beneficial for document coherent understanding. This suggests a tentative yes, but there were equally instances of strong disagreement with human intuitions and model predictions for the better-performing models and evidence to suggest that the models were performing fact-checking at the same time as coherence modelling.  . All the probes generate syntactically correct sentences, and the first four generally lead to sentences which are also semantically felicitous, with the incoherence being at the document level. For example, in He was never convicted and was out on parole within a few years, if we replace he with she, the sentence is felicitous, but if the focus entity in the preceding and subsequent sentences is a male, the information flow will be disrupted.
The last four language probes are crafted to explore the capacity of a model to capture commonsense reasoning, in terms of discourse relationships, tense and polarity awareness, and under-standing of numbers. For Conjunction, we only focus on explicit connectives within a sentence. For Past to Future, there can be intra-sentence inconsistency if there are time-specific signals, failing which broader document context is needed to pick up on the tense flip. Similarly for Negation and Number, the change can lead to inconsistency either intra-or inter-sententially. For example, He did not appear in more than 400 films between 1914 and 1941 ... is intra-sententially incoherent. Table 7 lists the performance of pretrained LMs at recognising intruder sentences within incoherent documents, with and without the addition of the respective linguistic probes.

Experimental Results
13 For a given model, we break down the results across probes into two columns: the first column ("F 1 ") shows the sentence-level performance over the original intruder sentence (without the addition of the linguistic probe), and the second column ("∆F 1 ") shows the absolute difference in performance with the addition of the linguistic probe. Our expectation is that results should improve on average with the inclusion of the linguistic probe (i.e. ∆F 1 values should be positive), given that we have reinforced the incoherence generated by the intruder sentence.
All models achieve near-perfect results with Gender linguistic probes (i.e. the sum of F 1 and ∆F 1 is close to 100), and are also highly successful at detecting Animacy mismatches and Past to Future (the top half of Table 7). For the probes in the bottom half of the table, none of the three models except ALBERT-xxLarge performs particularly well, especially for Demonstrative. For each linguistic probe, we observe that the pretrained LMs can more easily detect incoherent text with the addition of these lexical/grammatical inconsistencies (except for XLNet-Large and ALBERT-xxLarge over Demonstrative and ALBERT-xxLarge over Conjunction).
In the cross-domain setting, the overall performance of XLNet-Large CNN and ALBERT-xxLarge CNN drops across all linguistic probes, but the absolute gain through the inclusion of the linguistic probe is almost universally larger, suggest that while domain differences hurt the models, they are attuned to the impact of linguistic probes on document coherence and thus 13 Results for coherent documents are omitted due to space. learning some more general properties of document (in)coherence. On the other hand, BERT-Large CNN (over Gender, Animacy↓, and Ani-macy↑) and RoBERTa-Large CNN (Gender and An-imacy↑) actually perform better than in-domain. RoBERTa-Large CNN achieves the best overall performance over Gender, Animacy↑, and Number while ALBERT-xxLarge CNN achieves the best overall performance over Past to Future, Conjunction, Demonstrative, and Negation. The reason that the models tend to struggle with Demonstrative and Conjunction is not immediately clear, and will be explored in future work.
We also conducted human evaluations on this dataset via Amazon Mechanical Turk, based on the same methodology as described in Section 4.5 (without explicit instruction to look out for linguistic artefacts, and with a mixture of coherent and incoherent documents, as per the original annotation task). As detailed in Table 7, humans generally benefit from the inclusion of the linguistic probes. Largely consistent with the results for the models, humans are highly sensitised to the effects of Gender, Animacy, Past to Future, and Negation, but largely oblivious to the effects of Demonstrative and Conjunction. Remarkably, the best models (ALBERT-xxLarge and RoBERTa-Large) perform on par with humans in the in-domain setting, but are generally well below humans in the crossdomain setting.

Conclusion
We propose the new task of detecting whether there is an intruder sentence in a document, generated by replacing an original sentence with a similar sentence from a second document. To benchmark model performance over this task, we construct a large-scale dataset consisting of documents from English Wikipedia and CNN news articles. Experimental results show that pretrained LMs which incorporate larger document contexts in pretraining perform remarkably well in-domain, but experience a substantial drop cross-domain. In follow-up analysis based on human annotations, substantial divergences from human intuitions were observed, pointing to limitations in their ability to model document coherence. Further results over a linguistic probe dataset show that pretrained models fail to identify some linguistic characteristics that affect document coherence, suggesting room to improve for them to truly capture document coherence, and motivating the construction of a dataset with intruder text at the intra-sentential level.