Efficient Long-Text Understanding with Short-Text Models

Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scientific articles, and long documents due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on custom implementations that require expensive pretraining from scratch. In this work, we propose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. Specifically, we partition the input into overlapping chunks, encode each with a short-text LM encoder and use the pretrained decoder to fuse information across chunks (fusion-in-decoder). We illustrate through controlled experiments that SLED offers a viable strategy for long text understanding and evaluate our approach on SCROLLS, a benchmark with seven datasets across a wide range of language understanding tasks. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.


Introduction
Transformer-based pretrained language models (Vaswani et al., 2017;Devlin et al., 2019;Lewis et al., 2020;Raffel et al., 2020b;Brown et al., 2020) have been widely successful across all areas of natural language understanding (NLU).However, applying them over long texts (such as stories, scripts, or scientific articles) is prohibitive due to their quadratic complexity in the input length.To bridge this gap, recent work has developed more efficient transformer variants (Kitaev et al., 2020a;Beltagy et al., 2020;Zaheer et al., 2020a;Guo et al., 2022) and applied them over  (Shaham et al., 2022) as a function of parameter count.Plugging existing pretrained LMs into the SLED framework dramatically improves their SCROLLS score (arrows from blue circles to orange stars).Gray triangles indicate models with dedicated pretraining for capturing longrange dependencies.BART large -SLED is competitive with LongT5 base (Guo et al., 2022) and UL2 (Tay et al., 2022b) (which has 50x more parameters), and slightly lags behind larger LongT5 models.
However, most efficient transformers use specialized architectures with custom implementations that are not guaranteed to scale as well as vanilla transformers (Tay et al., 2022a).Moreover, they require an expensive pretraining step and do not exploit off-the-shelf pretrained LMs that were trained for short texts.To date, their performance on long texts has not matched the success of their short-range counterparts.
In this work, we present SLED: SLiding-Encoder and Decoder, a simple yet powerful method for applying off-the-shelf pretrained encoder-decoder models on long text problems, with a linear time and space dependency.Specifically (see Fig. 2), we partition long documents into overlapping chunks of tokens of constant length and encode each chunk independently with an already-pretrained encoder.Then, a pretrained decoder attends to all contextualized input representations to generate the output.Our main assumption is that input tokens can be contextualized through their local surrounding (using a shorttext LM), and any global cross-chunk reasoning can be handled by the decoder, similar to fusionin-decoder (FiD) (Izacard and Grave, 2021).Our approach can be readily applied to any pretrained encoder-decoder LM such as T5 (Raffel et al., 2020b) and BART (Lewis et al., 2020) (but is not applicable to decoder-only (Brown et al., 2020) or encoder-only models (Liu et al., 2019;Conneau et al., 2020)).
We evaluate SLED on a wide range of language understanding tasks.To substantiate SLED's adequacy for text processing, we perform controlled experiments over modified versions of SQuAD 1.1 (Rajpurkar et al., 2016) and HotpotQA (Yang et al., 2018) to show that SLED can (a) find relevant information that is embedded within a long text sequence and (b) fuse information from chunks that were encoded separately.
Our main evaluation is over SCROLLS, a recently-released benchmark that includes 7 longrange tasks across Question Answering (QA), Summarization, and Natural Language Inference (NLI).We show (Fig. 1) that taking a pretrained encoder-decoder model, such as BART (Lewis et al., 2020) or T5 (Raffel et al., 2020b), and embedding it into SLED's framework results in dramatic improvement in performance (6 points on average across models).Moreover, BART large -SLED's performance is comparable to LongT5 base (Guo et al., 2022), a model that was specifically pretrained to handle long-range dependencies, and surpasses UL2 (Tay et al., 2022b), which contains 50x more parameters.Importantly, SLED-based models can use any future pretrained LM out-of-the-box without requiring additional pretraining to further improve performance.
Due to its simplicity, SLED can also be used as a diagnostic tool for analyzing long-range benchmarks.We analyze the seven datasets in SCROLLS through the lens of SLED and show which datasets require the input to be contextualized with remote tokens.Specifically, we find that in QA and NLI tasks, relatively local contextualization is sufficient for high performance.
While SLED is similar to FiD from a technical standpoint, past usage of FiD has centered around open-domain question answering (Izacard and Grave, 2021), where unrelated passages are naturally encoded independently.Here, we test fusion-in-decoder on long documents, where local encoding of chunks is a modeling assumption that needs testing.In recent work, Vig et al. (2022) proposed a similar architecture to tackle long inputs from QMSum (Zhong et al., 2021), but did not systematically analyze it.We standardize this methodology for the first time, and extensively analyze the effectiveness of FiD for encoding long documents across multiple tasks.
To summarize, our main contributions are: 1.We present SLED, a simple and effective approach for processing long texts that leverages off-the-shelf encoder-decoder LMs based on fusion-in-decoder.
2. We demonstrate SLED's efficacy in both controlled experiments, as well as on the SCROLLS benchmark, which leads to competitive results compared to specialized models that include up to 50x more parameters.
3. We use SLED as a diagnostic tool for analyzing the long-range properties of datasets in the SCROLLS benchmark.

4.
We provide an open-source implementation of SLED,1 seamlessly integrated into the Transformers library (Wolf et al., 2020).

Background
Recent advances in natural language processing have been by and large fueled by the transformer architecture (Vaswani et al., 2017).A core component of the transformer is the self-attention layer where every input token "attends" to every other token to produce its contextualized representation.This results in quadratic time and space dependency w.r.t. the length of the input, limiting the ability of transformers to process long sequences.This long-text limitation has sparked ample interest in developing efficient transformer variants.One prominent family of methods is based on sparse attention, where each token attends to a constant number of other tokens, overcoming the quadratic dependency.Tokens typically attend either to their local surrounding (Zaheer et al., 2020a;Beltagy et al., 2020;Ainslie et al., 2020;Gupta and Berant, 2020) or to tokens that are semantically similar (Kitaev et al., 2020b;Roy et al.,  2021).Moreover, a constant number of global tokens that attend to and are attended by all input tokens are often added to each attention sub-layer.Recent analyses (Xiong et al., 2022a) have shown that sparse transformers with local attention are competitive with other variants on multiple language understanding tasks.
Our method, SLED, falls into the family of local attention variants.However, unlike prior work, SLED re-uses and extends existing shortrange encoder-decoder models, and does not require specialized pretraining or dedicated CUDA implementations.
In most local attention variants, e.g., LED (Beltagy et al., 2020), attention is local per-layer, but the receptive field of tokens grows across layers.In SLED, which we describe next, tokens have access to the same number of tokens, independent of a layer's depth, which enables better parallelization.For a survey on the families of efficient transformers, see Tay et al. (2020).For an in-depth comparison of SLED and LED, we refer to Appendix B.

Method
In this work, we propose a simple approach for avoiding transformer's quadratic complexity, motivated by the Locality of information assumption: In an encoder-decoder architecture, the encoder can effectively contextualize input tokens with local context only, leaving long-range dependencies to be handled by the decoder.SLED relies on said modeling assumption to encode shorter chunks independently and perform fusion of information in the decoder (Izacard and Grave, 2021).We now describe the SLED model in detail.
Input SLED uses a pretrained encoder-decoder model M as a backbone.SLED receives a tokenized document of length n (blue squares in Fig. 2), and an optional short tokenized prefix of length m n, typically representing a question about the document, an instruction to perform some generation task, or a hypothesis (orange squares in Fig. 2).Unlike static task-specific prefixes (e.g., "summarize"), SLED supports also sample-specific prefixes that are part of the input (e.g., the question in QA datasets).
Steps SLED follows the following steps: (a) Document tokens are split into C chunks of length c (In Fig. 2, c = 4).The middle (1 − ρ) × c tokens in each chunk are contextualized from both the left and right by P := ρ×c 2 tokens, where ρ ∈ [0, 0.5] (ρ = 0.5 in Fig. 2).We call these middle tokens the effective chunk, since they will constitute the output of the encoder, and term the tokens on each side by context padding.(e) To give the decoder access to prefix tokens, we encode the prefix tokens with M enc , and prepend the result to the contextualized representation (leftmost chunk in Fig. 2(a)-(d)).
(f) Finally, we generate the output with the backbone decoder, M dec , which uses standard cross-attention over the m + n encoded tokens (Fig. 2(e)).
SLED requires handling a few edge cases, namely, dealing with the first and last chunk that do not have bidirectional context.We refer to Appendix A for these details.
SLED's Complexity SLED divides an input of length n to C chunks of size c.Since ρ ∈ [0, 0.5], it follows that C ∈ n c , 2n c .While the complexity of encoding each chunk is quadratic in c due to self-attention, c n is constant and thus the memory and compute dependency is linear in n. 2In particular, the complexity to encode the input with a model of l attention layers is: Decoding is done as proposed by Vaswani et al. (2017), thus requiring O(nk + k 2 ) memory.Assuming a constant output sequence length k n, this remains linear in n.

Efficacy of Fusion in Decoder
As mentioned ( §3), SLED relies on the assumption that chunks can be encoded independently and fusion across them can be delegated to the decoder (Locality of information assumption).This is similar to the Fusion-in-Decoder approach, introduced by Izacard and Grave (2021) for opendomain question answering (ODQA).However, there, the encoder-decoder receives a set of independent passages and needs to generate an answer that can typically be extracted from a single passage.Here, we extend the scope of FiD by applying it over a single, long, and coherent input that potentially requires global contextualization.
To demonstrate the viability of FiD for long text language tasks, we design two controlled experiments that quantify the extent to which FiD can perform two operations at the heart of long-text processing.First, can FiD find a "needle-in-ahaystack", i.e., locate a piece of short information embedded in long text, disregarding irrelevant information.Second, can FiD "piece the puzzle" and fuse two pieces of information that are encoded independently when generating an output.

Needle in a haystack
To check if SLED can ignore irrelevant text and locate a single piece of information, we cast SQuAD 1.1 (Rajpurkar et al., 2016) as a sequenceto-sequence task with long input.SQuAD is a question answering dataset, where given a question-paragraph pair the goal is to generate the answer (which lies within the paragraph).For each question-paragraph pair, we randomly sample 9 other paragraphs from the the dataset and concatenate them to form a long document. 3We then finetune and evaluate our models in two settings: a) Ordered Distractors: the gold paragraph is the first one, and all other distractors are concatenated after it.b) Shuffled Distractors: we randomly shuffle the order of all paragraphs so the answer can be anywhere in the input document.Since this is a QA task, the prefix is the question.
We use BART base (Lewis et al., 2020) as our backbone model, M , throughout §4, and compare SLED to an oracle BART base that is given the gold paragraph only with no distractor paragraphs.this is an oracle setup since BART base can take 1,024 tokens as input and all gold paragraphs are shorter.If SLED can match the oracle performance, we can infer that indeed the decoder can find a needle in a haystack.In addition, we compare SLED to BART base which is given only the first 1K tokens and to LED (Beltagy et al., 2020), which uses local sparse attention, similar to SLED (LED has the same backbone BART base ).However, as explained in §2, the receptive field of LED layers linearly grows with the number of layers, and thus information can be fused in the encoder, unlike SLED where cross-chunk fusion must be delegated to the decoder.Last, for QA tasks, LED defines the question tokens as global tokens, and as an additional sanity test we evaluate LED L , i.e., a local LED model where no global tokens are used.For both LED and SLED we use a chunk size c = 256.Results Fig 3 (a) shows the results of our evaluation on the development set.SLED almost matches the performance of an oracle BART base that is not given any distractor paragraphs, reaching an F 1 score of 87.6 compared to the oracle F 1 of 88.1 (horizontal line in the figure).LED also achieves high performance (but lower than SLED in the shuffled setup), showing both models learn to ignore distracting information and find a needle in a haystack.As expected, both LED L and BART suffers a significant drop in performance when the passages are shuffled, as the gold paragraph is not contextualized with the question.

Piecing a puzzle
We now verify that SLED can fuse pieces of information from different chunks.To this end, we modify HotpotQA (Yang et al., 2018), a multihop question answering dataset, in which every Figure 4: F 1 results on our HotpotQA's development set (Yang et al., 2018).(a) SLED reaches an F 1 that is close to the oracle BART base (horizontal line), outperforming a model with access to the paragraph that contains the answer ("second paragraph").This shows that SLED effectively fuses information from two chunks.
See text for further explanation on each model.(b) Ablations on SLED's architecture, see §4.3 for details.
question relies on two pieces of information (located in different paragraphs).While in the original setting, each input in HotpotQA has two gold paragraphs and 8 distractor paragraphs, we include only the two gold paragraphs in our experiments.
To ensure SLED and LED encode the relevant two pieces of information in separate chunks, we set the chunk size to c = 128.Similar to §4.1, we compare SLED to an oracle BART base with full attention over 1,024 tokens,4 to LED, and to LED L .Finally, past work has shown that many examples in HotpotQA can be answered with access to the "second" gold paragraph only, which contains the answer (Jiang and Bansal, 2019).Thus, we also evaluate a BART model that is given the second passage only.
Results Fig. 4(a) shows that indeed, SLED's decoder can effectively fuse information from two separately encoded chunks, reaching an F 1 of 76.5, slightly lower than the oracle F 1 of 78.6.Notably, SLED substantially outperforms a BART model with access to the entire second paragraph, showing that information is fused by the decoder.LED slightly outperforms SLED, but when denied access to global tokens (LED L ) its performance drops sharply.This shows that the large receptive field of deep LED layers does not suffice for information fusion and interaction between the question and text is crucial for the decoder.
To summarize, our two controlled experiments show that SLED can perform the operations of re-trieving and fusing information, which are fundamental for long text language tasks.

Ablations of design choices
We leverage our controlled experimental setup to further investigate the components of SLED.
Efficacy of the encoder While §4.2 shows that SLED can fuse separate pieces of information in the decoder, it is not clear to what extent local contextualization is necessary.To check whether it is possible for all fusion to occur in the decoder, we finetune SLED with a chunk size of c = 1, such that input tokens do not observe any context in the encoder.As can be seen in the leftmost bar(s) in Fig. 3 Contextualizing chunks with a prefix As explained, SLED does not use global tokens, but instead contextualizes each chunk with a prepended prefix.To verify its necessity, we finetune a SLED model that treats the prefix as another chunk and does not prepend it to document chunks. 5The second bar(s) in Fig. 3(b) and Fig. 4(b) shows a significant drop in performance for all settings, suggesting the prefix is needed during encoding.
As expected, there is practically no difference between the Ordered and Shuffled settings in Fig. 3(b).In contrast, LED L which is similar in concept (due to the lack of global tokens) shows a significant drop when paragraphs are shuffled.This shows the possible effectiveness of the increased receptive field in LED, but only when the gold paragraph is relatively close to the prefix.
Encoding the prefix After showing the prefix is crucial for the encoder, we ask whether the decoder needs direct access to the prefix or whether relevant information from the prefix can be infused into the chunk representations.To test that, we finetune SLED as usual, but remove the prefix tokens from the final representation given to the decoder.The rightmost bar(s) in Fig. 3(b) and Fig. 4(b) shows that providing the decoder with prefix representations makes little difference if any at all, suggesting that indeed the encoder can infuse the important information from the prefix into the encoded document tokens.

Experiments
We evaluate SLED on SCROLLS (Shaham et al., 2022), a recently-proposed benchmark for evaluating long text understanding.SCROLLS contains seven datasets that span three different language understanding tasks: 1. Summariazation: GovReport (Huang et al., 2021)  For all QA datasets, we set the question as the prefix.For QuALITY, we consider the four answer options part of the question.
3. Natural language inference: ContractNLI (Koreeda and Manning, 2021) contains short legal hypotheses (set as the prefix) and legal documents as the premise.Models are tasked to predict whether the premise entails, contradicts or is neutral w.r.t. to the hypothesis.
For each task, we use the official evaluation metrics defined in SCROLLS, which are based on the metrics from the original datasets.

Settings
We evaluate SLED with both BART (Lewis et al., 2020) and T5 (Raffel et al., 2020b) as backbone models.For each backbone model, we compare performance with SLED, which can consume long sequences, vs. the backbone models alone that are fed with the first 1,024 tokens.For comparison, we also finetune LED base .In all SLED and LED experiments, we use a maximal sequence length of 16K tokens and chunk size of 256 to allow for a fair evaluation.
For each model-dataset pair, we run hyperparameter tuning (detailed in Appendix C) based on the development set.Additionally, we submit generated predictions over the test set to SCROLLS leaderboard, 6 and compare to the reported performance of other models at the time of submission.

Results
Tab. 1 reports results over SCROLLS development and test sets.Taking short-range pretrained LMs like BART and T5 and casting them into SLED's framework allows them to process long documents effectively, improving the average SCROLLS score by 4.8-7 points.Examining BART base -SLED, we see a large improvement compared to LED base (33.6→35.4),and competitive performance on multiple tasks compared to LongT5 base and UL2.Moreover, adding SLED to BART large results in a high-performing model with results that are comparable to LongT5 base and outperforming UL2, despite UL2's large parameter count (50x larger), and with no need for expensive pretraining geared towards long-range tasks.BART large -SLED's performance is moderately lower than the larger LongT5 models.
Barring QuALITY, SLED significantly improves performance across all tasks compared to the corresponding backbone models.All summarization datasets (GovReport, SummScreenFD and QMSum) show impressive gains of up to 35% compared to their baseline scores, across all metrics (Rouge-1/Rouge-2/Rouge-L (Lin, 2004)) and for all three backbone models.Similarly, on Con-tractNLI (Koreeda and Manning, 2021) we see large relative improvements.As the performance of the baseline models was already high, this boost in performance is even more significant.Finally, the QA datasets Qasper and NarrativeQA show the largest gains, improving by an average of 60%.
QuALITY In stark contrast to other datasets lies the multi-choice QA dataset QuALITY (Pang et al., 2022).While the performance of BART large -SLED is above chance, it barely improves the performance of its backbone model (BART large ), which observes only the first 1K tokens, with a similar trend in other backbone models.Analyzing test scores in Tab. 1, we see that increasing model size consistently improves performance (up to 46% exact match), but increasing input length has a negligible effect.Since reported human accuracy on QuALITY is high 6 https://www.scrolls-benchmark.com/leaderboard (93.5%), this hints that QuALITY might require commonsense reasoning and knowledge that are absent from models with a lower parameter count.
Summary We have shown that taking offthe-shelf pretrained LMs and embedding them into SLED leads to competitive performance on SCROLLS.Importantly, any future pretrained LM can be easily plugged into SLED, without the need for an expensive pretraining step.

Datasets analysis
SLED's simplicity and modularity allow it to be used as a useful tool for dataset analyses.Specifically, we can vary the chunk size, c, and the number of tokens, n, across datasets to analyze a) how local are individual pieces of relevant information, and b) how far into the document they are located.
Locality of information SLED relies on an assumption that information can be contextualized locally at encoding time.To analyze locality, we vary the chunk size, c, which defines the attention window, and measure the effect on SCROLLS datasets with input length 16K.Fig. 5 shows the results of this experiment, where the y-axis shows the relative improvement compared to BART base on a target metric as a function of the chunk size c for all datasets.We observe that in all datasets the best performing chunk size is relatively small (up to 256), and further increasing c even hurts the performance in some cases.However, the summarization datasets show a much larger gain in performance when increasing c up to that threshold.This coincides with a common hypothesis that QA and NLI require relatively local context, and thus increasing c can add noise and hurt optimization, while summarization may require a more highlevel view of information.
Distance from start of document We now analyze whether indeed the entire document is required for tasks in SCROLLS by varying the maximum document length, n.Fig. 6 shows the results of this experiment, where the y-axis shows relative improvement of BART base -SLED compared to BART base as a function of the first n tokens from the document (chunk size c = 256).As expected, all datasets (except QuALITY) show a roughly monotonic improvement in performance with n.This shows that (a) SLED is able to effectively use all of the information in a long se-  Shaham et al. (2022) and are lower than our LED base implementation, presumably since our implementation uses all question tokens for global attention rather than just the first one.The results for LongT5 and UL2 were submitted to the SCROLLS leaderboard by their authors.quence (up to 16K tokens), 7 and that (b) observing the entire inputs from SCROLLS improves performance.

Effect of context padding
In all experiments thus far, we used a conservative padding value ρ = 0.5, resulting in effective chunk size of c 2 and c 4 context padding tokens on each side.Since both memory and, more importantly, the number of forwards passes through the encoder are linear in the number of chunks, a natural question is how much padding and overlap are necessary to achieve satisfactory results.
To explore this, we finetune BART base -SLED on all six datasets where SLED showed gains over its baseline model (i.e., all datasets except for QuALITY), varying the value of ρ, and fixing c = 256.Tab. 2 shows the results of this experiment, where we compare relative gain compared to BART base across different ρ values.
As expected, decreasing the padding factor and consequently the number of chunks reduces train-7 For ContractNLI, the length of over 95% of the tokenized examples is less than 8K.ing time.When ρ = 0.05 training can be faster by up to 2x compared to ρ = 0.5 as the number of chunks drops to almost half.Moreover, relative gain (i.e., improvement relative to the baseline) is often close to or even higher with less padding (perhaps due to better encoding or more stable optimization).Nevertheless, there is no single ρ value that consistently beats the conservative choice of ρ = 0.5.In particular, in all six datasets, setting ρ = 0.5 results in a top-2 performance, often by a large margin and never considerably worse then the best result.Thus, we conclude that one may improve the efficiency and performance of SLED by tuning the hyperparameter ρ for optimal behaviour w.r.t. a specific task, and we fix ρ = 0.5 in our experiments.Moreover, Tab. 2 demonstrates the importance of having chunks at least partially overlapping.In all six dataset, using non-overlapping chunks (ρ = 0) results in a drop of at least 10% gain compared to the best setting, where in some cases this gap grows to over 50%.This supports our hypothesis that chunking inputs with no overlap may lead to crucial loss of information.

Related Work
Efficient transformers Many efficient attention variants were proposed in recent years, to alleviate the quadratic complexity of dense attention (Tay et al., 2020;Fournier et al., 2021).Among those are clustering vectors to distinct buckets, calculating attention only within each one (Kitaev et al., 2020a), attending only to a fixed number of hidden vectors (Ma et al., 2021), using random features to approximate the attention matrix (Choromanski et al., 2021;Peng et al., 2021), and using lowrank factorizations (Wang et al., 2020).Despite achieving respectable performance when finetuning these models on the Long Range Arena benchmark (Tay et al., 2021), many of them were not yet proven to work well as a backbone for pretrained language models.In fact, recent work (Xiong et al., 2022b) on encoder-only models found many do not outperform a simple local attention sliding window on downstream language tasks.We discuss such methods next.Sparse attention variants A popular and simple solution for allowing attention-based models to process long sequences is to use local attention, where each token attend to a local window around it.Longformer (Beltagy et al., 2020), GMAT (Gupta and Berant, 2020), and ETC (Ainslie et al., 2020) use short windows of full attention, combined with full attention to a small number of predefined global input tokens.BigBird (Zaheer et al., 2020b) shares the local and global features, and additionally randomly samples tokens to attend to.Finally, the recently proposed LongT5 (Guo et al., 2022)  (S4) architecture showing dramatic gains over transformers on the LRA benchmark (Tay et al., 2021).State space models are now an active research field (Gupta, 2022;Mehta et al., 2022), but their efficacy on long-range language understanding tasks has not been tested yet.
Fusion-in-Decoder Izacard and Grave (2021) proposed to encode multiple independent passages separately, and concatenate the encodings prior to the decoding phase.Despite encouraging empirical evidence (Amouyal et al., 2022;Yavuz et al., 2022), we are the first (to our knowledge) to analyze FiD's feasibility and limitations in a controlled setting.Importantly, we test FiD on longrange tasks over a single long document, rather than a collection of independent passages.

Pretrained models with sliding windows
Wrapping a BERT encoder within a sliding window was proposed by Cui and Hu (2021) in the context of a specialized architecture for summarization.Wang et al. (2019) showed that sliding BERT across text improves performance on several QA datasets.In this work, we propose a sliding window approach that can be easily plugged into any existing encoder-decoder model without additional parameters or task-specific training, and show its efficacy for long-range text understanding.Most similar to SLED, is the SEGENC approach proposed by Vig et al. (2022).By dividing inputs from QMSum into overlapping chunks, encoding them separately, and then performing FiD (using two representations for every input token), the authors were able to achieve state-of-the-art results.However, Vig et al. (2022) were focused on summarization and did not perform a systematic analysis of this type of architecture.

Limitations
We present SLED as a simple and effective method to extend the capabilities of pretrained short-text models to long-text tasks.Despite its impressive empirical performance on SCROLLS, SLED suffers from two disadvantages which may limit its applicability to some long-range tasks.
Long output To obtain linear complexity, SLED assumes the output length k is constant.This is since the decoder uses quadratic selfattention over the output, on top of O(nk) crossattention between the output and input.While most current long-text tasks follow this assumption, future tasks, such as academic reports or script writing, may require long text generation.This limitation is not unique to SLED and affects other long-range transformers including LongT5 and LED.Aside from finetuning, this also affects pretraining models on long inputs with selfsupervised losses such as span-corruption (Raffel et al., 2020b) or denoising (Lewis et al., 2020), which require the decoder to process an output that is linear in the length of the input.
Co-reference resolution and fact retention An assumption at the heart of SLED is the Locality of information assumption.When the input text is long, this assumption may break if distant entity resolution or factual knowledge are required.For example, a chapter in a book may mention "they were walking into the room" when knowledge of what room or who walked is located a few chapters back.In such cases, the encoder used by SLED will not be able to access this information, moving more responsibility to the decoder and reducing the effectiveness of the contextual encoding.Similarly, in multi-hop questions (Yang et al., 2018), attending to one part of the context is necessary in order to fully understand the question and encode a second piece of information correctly.As the encoder will not have access to the first context that leads to better question understanding, here as well more responsibility is delegated to the decoder.

Conclusions
In this work we present SLED, a simple approach for modeling long texts which slides a pretrained short-range encoder over a long input document and then generates an output by attending to the encoded tokens.We show SLED can perform core operations that are important for long text understanding, such as finding relevant pieces of information and fusing them at decoding time, and demonstrate competitive performance on the SCROLLS benchmark compared to larger models and models that employ a dedicated and expensive pretraining step.One of SLED's most attractive features is that it can be readily used with any short-range pretrained LM.Thus, any future encoder-decoder model can be flexibly plugged into it to achieve further gains in performance on SCROLLS, some of its tasks, or any other long-range task.
We open source SLED and hope it encourages the research community to easily extend to longer inputs and push the borders of NLU models' applicability in real-world use-cases.
tokens are considered the effective chunk tokens in the first chunk.To account for the final tokens, the last chunk will always start at token t n−c+1 so it would contain exactly c tokens, and its effective chunk tokens will be defined as all tokens that were not part of any previous effective chunk.

B Chunking vs. local-attention
Both LED and SLED are long-range models built on top of the same short-text model (BART), and employ local attention.However, SLED relies on chunking, while LED uses per-layer local attention.In this section, we now discuss in more detail the relation between the two approaches.
Implementation One of SLED's biggest advantages is that it is agnostic to the backbone encoder-decoder model, and can extend any existing model without additional implementation overhead.In contrast, The attention mechanism in Longformer, and subsequently LED, was implemented by Beltagy et al. (2020) with a specialized CUDA kernel that is coupled to the architecture and implementation of BART.This makes LED more efficient, but extending it to new architectures incurs significant engineering overhead.This is since LED uses a "diagonal" local-window attention across layers, for which a naïve implementation is inefficient.Conversely, SLED uses chunking, which allows to simply wrap an existing encoder-decoder model.

Contextualization
The most significant difference between LED and SLED from a conceptual point of view is their contextualization mechanism.While SLED splits the input into (overlapping) chunks and encodes each of them independently, LED performs local attention perlayer.This results in an effective receptive field that grows linearly with the encoder depth, potentially allowing it to perform more "global" contextualization.Our results in §4 suggest that such global contextualization is beneficial, and a similar conclusion can be reached when observing that LED base , which uses all prefix tokens as global tokens, outperforms LED SCROLLS base , which uses only a single token for global contextualization.
Positional information SLED's chunking mechanism means that it utilizes the positional encoding of the underlying model independently in each chunk, and is thus agnostic to the positional embedding technique used by the backbone model.Moreover, it potentially allows SLED to generalize to arbitrary input lengths.In contrast, LED utilizes BART's absolute embeddings, duplicating them 16 times to support 16K-long sequences.This limits its ability to generalize to longer inputs, and potentially induces a requirement for significant amounts of long-input samples to properly tune those new parameters (Shaham et al., 2022).This is evident in Tab. 1 when comparing tests scores of LED base against BART base -SLED and considering the number for training samples.In NarrativeQA and Gov-Report, which contain ∼71K and ∼19K samples respectively, LED is comparable to SLED and even slightly outperforms it on some metrics.In ContractNLI (∼10K examples), it does slightly worse.In all other datasets, where the training data is small, LED is significantly worse than SLED.
Complexity We analyzed the complexity analysis of SLED's encoder ( §3), which is O (l × c × n).A similar analysis of LED yields that in each layer, LED considers O(n) windows of length c, where in each window only the middle token attends to its local neighborhood, resulting in O (l × c × n) memory complexity as well.However, due to SLED's use of overlap and full self-attention within each chunk, SLED's encoding may require up 2x more memory compared to LED when ρ = 0.5.

C Experimental details
Our experimental setup is based on the SCROLLS official repository. 8The datasets inputs and splits remained as suggested by the authors of SCROLLS as well as the suggested number of epochs per dataset.
To perform model selection, for each model-dataset pair we finetuned 9 models with LINEAR learning rate scheduling, AdamW optimizer with the default settings, and setting the learning rate to one of {2e−5, 5e−5, 1e−4} and the effective batch size to one of {8, 16, 32}.Warmup was fixed at 10% and weight decay at 0.01.All code, data, python environment requirements, hyperparameters and scripts required to reproduce our results will be made public upon publication.

Figure 1 :
Figure1: Models' SCROLLS score(Shaham et al., 2022) as a function of parameter count.Plugging existing pretrained LMs into the SLED framework dramatically improves their SCROLLS score (arrows from blue circles to orange stars).Gray triangles indicate models with dedicated pretraining for capturing longrange dependencies.BART large -SLED is competitive with LongT5 base(Guo et al., 2022) and UL2(Tay et al., 2022b) (which has 50x more parameters), and slightly lags behind larger LongT5 models.

Figure 2 :
Figure 2: Overview of SLED.(a) Input tokens (t 1 , . . ., t n ) are chunked into C overlapping chunks of length c (here, c = 4).Each chunk is made of P := ρ×c 2 context padding tokens at the right and left edges of the chunk, and (1 − ρ) × c effective chunk tokens in the middle (here, ρ = 0.5, P = 1).(b) We prepend the prefix tokens (p 1 , . . ., p m ) to each chunk (m n).(c) Each chunk is encoded independently using the already pretrained backbone encoder M enc .(d) We gather the encoded effective chunks tokens (yellow) and discard the context padding tokens (pink) (e) We pass the encoded input to the decoder to generate the final output sequence (o 1 , . . ., o k ).
(b) Each chunk is prepended by (optional) prefix tokens (Fig. 2(b)).(c) Each chunk is encoded independently, using the backbone encoder M enc (see Fig. 2(c)).(d) To create a contextualized representation for each token, we keep from each chunk only the tokens from the effective chunk, and concatenate them (Fig. 2(d)).

Figure 3 :
Figure 3: F 1 results on our modified SQuAD 1.1's (Rajpurkar et al., 2016) development set evaluation: (a) the horizontal line gives the performance of an oracle BART base given the gold paragraph only.SLED matches oracle performance in both the ordered and shuffled setting (see text).LED slightly underperforms SLED in the shuffled setup.Both BART (given only the first 1K tokens) and LED with no global tokens (LED L ) performs poorly in the shuffled setup.(b) Ablations on SLED's architecture, see §4.3 for details.
(b) and Fig. 4(b), removing local contextualization results in poor performance, illustrating the importance of local contextualization.

Figure 5 :
Figure 5: BART base -SLED relative improvement compared to BART base results, when varying the SLED's chunk size (i.e.c), fixing the maximum input length to 16K.Top: Summarization datasets.The yaxis measures relative improvement of Rouge-2.Bottom: QA and NLI datasets.The y-axis measures relative improvement of exact match for QuALITY and ContractNLI and F 1 for NarrativeQA and Qasper.

Figure 6 :
Figure 6: BART base -SLED relative improvement compared to BART base results, when varying the input length fed to SLED, fixing c = 256.Top: Summarization datasets compared w.r.t.Rouge-2.Bottom: QA and NLI datasets.Relative improvment is measured w.r.t.exact match for QuALITY and ContractNLI and F 1 for NarrativeQA and Qasper.

Table 1 :
Shaham et al. (2022)SCROLLS benchmark.Chunk/Input refers to the chunk size used (c) and to the maximal input length (n).Avg is the average SCROLLS score as described inShaham et al. (2022).Development scores for QuALITY are only for the full set (T). † indicates reported results from SCROLLS public leaderboard.
6LED SCROLLS base scores were reported by

Table 2 :
BART base -SLED relative improvement compared to BART base when varying the padding percentage (ρ).In all cases the maximum input length is 16K and c = 256.Relative gain is measured w.r.t.Rouge-2 for GovReport, SummScreenFD and QMSum, F 1 for Qasper and NarrativeQA and exact match for Con-tractNLI.In each column, boldface marks the top performing value and underline the second-best.