Abstract
Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scientific articles, and long documents due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on custom implementations that require expensive pretraining from scratch. In this work, we propose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. Specifically, we partition the input into overlapping chunks, encode each with a short-text LM encoder and use the pretrained decoder to fuse information across chunks (fusion-in-decoder). We illustrate through controlled experiments that SLED offers a viable strategy for long text understanding and evaluate our approach on SCROLLS, a benchmark with seven datasets across a wide range of language understanding tasks. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.
1 Introduction
Transformer-based pretrained language models (Vaswani et al., 2017; Devlin et al., 2019; Lewis et al., 2020; Raffel et al., 2020b; Brown et al., 2020) have been widely successful across all areas of natural language understanding. However, applying them over long texts (such as stories, scripts, or scientific articles) is prohibitive due to their quadratic complexity in the input length. To bridge this gap, recent work has developed more efficient transformer variants (Kitaev et al., 2020; Beltagy et al., 2020; Zaheer et al., 2020a; Guo et al., 2022) and applied them over long-range language understanding tasks (Mehta et al., 2022; Shaham et al., 2022).
However, most efficient transformers use specialized architectures with custom implementations that are not guaranteed to scale as well as vanilla transformers (Tay et al., 2022a). Moreover, they require an expensive pretraining step and do not exploit off-the-shelf pretrained LMs that were trained for short texts. To date, their performance on long texts has not matched the success of their short-range counterparts.
In this work, we present SLED: SLiding-Encoder and Decoder, a simple yet powerful method for applying off-the-shelf pretrained encoder-decoder models on long text problems, with a linear time and space dependency. Specifically (see Figure 2), we partition long documents into overlapping chunks of tokens of constant length and encode each chunk independently with an already-pretrained encoder. Then, a pretrained decoder attends to all contextualized input representations to generate the output. Our main assumption is that input tokens can be contextualized through their local surrounding (using a short-text LM), and any global cross-chunk reasoning can be handled by the decoder, similar to fusion-in-decoder (FiD) (Izacard and Grave, 2021). Our approach can be readily applied to any pretrained encoder-decoder LM such as T5 (Raffel et al., 2020b) and BART (Lewis et al., 2020) (but is not applicable to decoder-only [Brown et al., 2020] or encoder-only models [Liu et al., 2019; Conneau et al., 2020]).
We evaluate SLED on a wide range of language understanding tasks. To substantiate SLED’s adequacy for text processing, we perform controlled experiments over modified versions of SQuAD 1.1 (Rajpurkar et al., 2016) and HotpotQA (Yang et al., 2018) to show that SLED can (a) find relevant information that is embedded within a long text sequence and (b) fuse information from chunks that were encoded separately.
Our main evaluation is over SCROLLS, a recently-released benchmark that includes 7 long-range tasks across Question Answering (QA), Summarization, and Natural Language Inference (NLI). We show (Figure 1) that taking a pre-trained encoder-decoder model, such as BART (Lewis et al., 2020) or T5 (Raffel et al., 2020b), and embedding it into SLED’s framework results in dramatic improvement in performance (6 points on average across models). Moreover, BARTlarge-SLED’s performance is comparable to LongT5base (Guo et al., 2022), a model that was specifically pretrained to handle long-range dependencies, and surpasses UL2 (Tay et al., 2022b), which contains 50x more parameters. Importantly, SLED-based models can use any future pretrained LM out-of-the-box without requiring additional pretraining to further improve performance.
Due to its simplicity, SLED can also be used as a diagnostic tool for analyzing long-range benchmarks. We analyze the seven datasets in SCROLLS through the lens of SLED and show which datasets require the input to be contextualized with remote tokens. Specifically, we find that in QA and NLI tasks, relatively local contextualization is sufficient for high performance.
While SLED is similar to FiD from a technical standpoint, past usage of FiD has centered around open-domain question answering (Izacard and Grave, 2021), where unrelated passages are naturally encoded independently. Here, we test fusion-in-decoder on long documents, where local encoding of chunks is a modeling assumption that needs testing. In recent work, Vig et al. (2022) proposed a similar architecture to tackle long inputs from QMSum (Zhong et al., 2021), but did not systematically analyze it. We standardize this methodology for the first time, and extensively analyze the effectiveness of FiD for encoding long documents across multiple tasks.
To summarize, our main contributions are:
We present SLED, a simple and effective approach for processing long texts that leverages off-the-shelf encoder-decoder LMs based on fusion-in-decoder.
We demonstrate SLED’s efficacy in both controlled experiments, as well as on the SCROLLS benchmark, which leads to competitive results compared to specialized models that include up to 50x more parameters.
We use SLED as a diagnostic tool for analyzing the long-range properties of datasets in the SCROLLS benchmark.
We provide an open-source implementation of SLED,1 seamlessly integrated into the Transformers library (Wolf et al., 2020).
2 Background
Recent advances in natural language processing have been by and large fueled by the transformer architecture (Vaswani et al., 2017). A core component of the transformer is the self-attention layer where every input token “attends” to every other token to produce its contextualized representation. This results in quadratic time and space dependency w.r.t. the length of the input, limiting the ability of transformers to process long sequences.
This long-text limitation has sparked ample interest in developing efficient transformer variants. One prominent family of methods is based on sparse attention, where each token attends to a constant number of other tokens, overcoming the quadratic dependency. Tokens typically attend either to their local surrounding (Zaheer et al., 2020a; Beltagy et al., 2020; Ainslie et al., 2020; Gupta and Berant, 2020) or to tokens that are semantically similar (Kitaev et al., 2020; Roy et al., 2021). Moreover, a constant number of global tokens that attend to and are attended by all input tokens are often added to each attention sub-layer. Recent analyses (Xiong et al., 2022a) have shown that sparse transformers with local attention are competitive with other variants on multiple language understanding tasks.
Our method, SLED, falls into the family of local attention variants. However, unlike prior work, SLED re-uses and extends existing short-range encoder-decoder models, and does not require specialized pretraining or dedicated CUDA implementations.
In most local attention variants, for example, LED (Beltagy et al., 2020), attention is local per-layer, but the receptive field of tokens grows across layers. In SLED, which we describe next, tokens have access to the same number of tokens, independent of a layer’s depth, which enables better parallelization. For a survey on the families of efficient transformers, see (Tay et al., 2020). For an in-depth comparison of SLED and LED, we refer to Appendix B.
3 Method
In this work, we propose a simple approach for avoiding transformer’s quadratic complexity, motivated by the Locality of information assumption:
In an encoder-decoder architecture, the encoder can effectively contextualize input tokens with local context only, leaving long-range dependencies to be handled by the decoder.
SLED relies on said modeling assumption to encode shorter chunks independently and perform fusion of information in the decoder (Izacard and Grave, 2021). We now describe the SLED model in detail.
Input
SLED uses a pretrained encoder-decoder model M as a backbone. SLED receives a tokenized document of length n (blue squares in Figure 2), and an optional short tokenized prefix of length m ≪ n, typically representing a question about the document, an instruction to perform some generation task, or a hypothesis (orange squares in Figure 2). Unlike static task-specific prefixes (e.g., “summarize”), SLED supports also sample-specific prefixes that are part of the input (e.g., the question in QA datasets).
Steps
SLED follows the following steps:
- (a)
Document tokens are split into C chunks of length c (In Figure 2, c = 4). The middle (1 − ρ) × c tokens in each chunk are contextualized from both the left and right by tokens, where ρ ∈ [0,0.5] (ρ = 0.5 in Figure 2). We call these middle tokens the effective chunk, since they will constitute the output of the encoder, and term the tokens on each side by context padding.
- (b)
Each chunk is prepended by (optional) prefix tokens (Figure 2(b)).
- (c)
Each chunk is encoded independently, using the backbone encoder Menc (see Figure 2(c)).
- (d)
To create a contextualized representation for each token, we keep from each chunk only the tokens from the effective chunk, and concatenate them (Figure 2(d)).
- (e)
To give the decoder access to prefix tokens, we encode the prefix tokens with Menc, and prepend the result to the contextualized representation (leftmost chunk in Figure 2(a)–(d)).
- (f)
Finally, we generate the output with the backbone decoder, Mdec, which uses standard cross-attention over the m + n encoded tokens (Figure 2(e)).
SLED requires handling a few edge cases, namely, dealing with the first and last chunk that do not have bidirectional context. We refer to Appendix A for these details.
SLED’s Complexity
4 Efficacy of Fusion in Decoder
As mentioned (§3), SLED relies on the assumption that chunks can be encoded independently and fusion across them can be delegated to the decoder (Locality of information assumption). This is similar to the Fusion-in-Decoder approach, introduced by Izacard and Grave (2021) for open-domain question answering (ODQA). However, there, the encoder-decoder receives a set of independent passages and needs to generate an answer that can typically be extracted from a single passage. Here, we extend the scope of FiD by applying it over a single, long, and coherent input that potentially requires global contextualization.
To demonstrate the viability of FiD for long text language tasks, we design two controlled experiments that quantify the extent to which FiD can perform two operations at the heart of long-text processing. First, can FiD find a “needle-in-a-haystack”, that is, locate a piece of short information embedded in long text, disregarding irrelevant information. Second, can FiD “piece the puzzle” and fuse two pieces of information that are encoded independently when generating an output.
4.1 Needle in a Haystack
To check if SLED can ignore irrelevant text and locate a single piece of information, we cast SQuAD 1.1 (Rajpurkar et al., 2016) as a sequence-to-sequence task with long input. SQuAD is a question answering dataset, where given a question-paragraph pair the goal is to generate the answer (which lies within the paragraph). For each question-paragraph pair, we randomly sample 9 other paragraphs from the the dataset and concatenate them to form a long document.3 We then finetune and evaluate our models in two settings: a) Ordered Distractors: the gold paragraph is the first one, and all other distractors are concatenated after it. b) Shuffled Distractors: we randomly shuffle the order of all paragraphs so the answer can be anywhere in the input document. Since this is a QA task, the prefix is the question.
We use BARTbase (Lewis et al., 2020) as our backbone model, M, throughout §4, and compare SLED to an oracle BARTbase that is given the gold paragraph only with no distractor paragraphs. this is an oracle setup since BARTbase can take 1,024 tokens as input and all gold paragraphs are shorter. If SLED can match the oracle performance, we can infer that indeed the decoder can find a needle in a haystack. In addition, we compare SLED to BARTbase, which is given only the first 1K tokens, and to LED (Beltagy et al., 2020), which uses local sparse attention, similar to SLED (LED has the same backbone BARTbase). However, as explained in §2, the receptive field of LED layers linearly grows with the number of layers, and thus information can be fused in the encoder, unlike SLED where cross-chunk fusion must be delegated to the decoder. Last, for QA tasks, LED defines the question tokens as global tokens, and as an additional sanity test we evaluate , that is, a local LED model where no global tokens are used. For both LED and SLED we use a chunk size c = 256.
Results
Figure 3(a) shows the results of our evaluation on the development set. SLED almost matches the performance of an oracle BARTbase that is not given any distractor paragraphs, reaching an F1 score of 87.6 compared to the oracle F1 of 88.1 (horizontal line in the figure). LED also achieves high performance (but lower than SLED in the shuffled setup), showing both models learn to ignore distracting information and find a needle in a haystack. As expected, both and BART suffer a significant drop in performance when the passages are shuffled, as the gold paragraph is not contextualized with the question.
4.2 Piecing a Puzzle
We now verify that SLED can fuse pieces of information from different chunks. To this end, we modify HotpotQA (Yang et al., 2018), a multi-hop question answering dataset, in which every question relies on two pieces of information (located in different paragraphs). While in the original setting, each input in HotpotQA has two gold paragraphs and 8 distractor paragraphs, we include only the two gold paragraphs in our experiments. To ensure that SLED and LED encode the relevant two pieces of information in separate chunks, we set the chunk size to c = 128.
Similar to §4.1, we compare SLED to an oracle BARTbase with full attention over 1,024 tokens,4 to LED, and to . Finally, past work has shown that many examples in HotpotQA can be answered with access to the “second” gold paragraph only, which contains the answer (Jiang and Bansal, 2019). Thus, we also evaluate a BART model that is given the second passage only.
Results
Figure 4(a) shows that indeed, SLED’s decoder can effectively fuse information from two separately encoded chunks, reaching an F1 of 76.5, slightly lower than the oracle F1 of 78.6. Notably, SLED substantially outperforms a BART model with access to the entire second paragraph, showing that information is fused by the decoder. LED slightly outperforms SLED, but when denied access to global tokens () its performance drops sharply. This shows that the large receptive field of deep LED layers does not suffice for information fusion and interaction between the question and text is crucial for the decoder.
To summarize, our two controlled experiments show that SLED can perform the operations of retrieving and fusing information, which are fundamental for long text language tasks.
4.3 Ablations of Design Choices
We leverage our controlled experimental setup to further investigate the components of SLED.
Efficacy of the Encoder
While §4.2 shows that SLED can fuse separate pieces of information in the decoder, it is not clear to what extent local contextualization is necessary. To check whether it is possible for all fusion to occur in the decoder, we finetune SLED with a chunk size of c = 1, such that input tokens do not observe any context in the encoder. As can be seen in the leftmost bar(s) in Figure 3(b) and Figure 4(b), removing local contextualization results in poor performance, illustrating the importance of local contextualization.
Contextualizing Chunks with a Prefix
As explained, SLED does not use global tokens, but instead contextualizes each chunk with a prepended prefix. To verify its necessity, we finetune a SLED model that treats the prefix as another chunk and does not prepend it to document chunks.5 The second bar(s) in Figure 3(b) and Figure 4(b) shows a significant drop in performance for all settings, suggesting the prefix is needed during encoding.
As expected, there is practically no difference between the Ordered and Shuffled settings in Figure 3(b). In contrast, , which is similar in concept (due to the lack of global tokens), shows a significant drop when paragraphs are shuffled. This shows the possible effectiveness of the increased receptive field in LED, but only when the gold paragraph is relatively close to the prefix.
Encoding the Prefix
After showing that the prefix is crucial for the encoder, we ask whether the decoder needs direct access to the prefix or whether relevant information from the prefix can be infused into the chunk representations. To test that, we finetune SLED as usual, but remove the prefix tokens from the final representation given to the decoder. The rightmost bar(s) in Figure 3(b) and Figure 4(b) shows that providing the decoder with prefix representations makes little difference if any at all, suggesting that indeed the encoder can infuse the important information from the prefix into the encoded document tokens.
5 Experiments
We evaluate SLED on SCROLLS (Shaham et al., 2022), a recently proposed benchmark for evaluating long text understanding. SCROLLS contains seven datasets that span three different language understanding tasks:
Summariazation: GovReport (Huang et al., 2021) is a summarization task over reports from the Congressional Research Service; SummScreenFD (Chen et al., 2022) is a summarization dataset over TV scripts; QMSum (Zhong et al., 2021) is a query-based summarization dataset over meeting transcripts from various domains. While GovReport and SummScreenFD do not contain a prefix, for QMSum we consider the query as the prefix.
Question answering (QA): Qasper (Dasigi et al., 2021) is a QA benchmark that contains questions over NLP papers; NarrativeQA (Kočiský et al., 2018) contains questions over entire books and movie scripts; QuALITY (Pang et al., 2022) is a multiple-choice QA dataset over books and articles. For all QA datasets, we set the question as the prefix. For QuALITY, we consider the four answer options part of the question.
Natural language inference: ContractNLI (Koreeda and Manning, 2021) contains short legal hypotheses (set as the prefix) and legal documents as the premise. Models are tasked to predict whether the premise entails, contradicts or is neutral w.r.t. to the hypothesis.
For each task, we use the official evaluation metrics defined in SCROLLS, which are based on the metrics from the original datasets.
5.1 Settings
We evaluate SLED with both BART (Lewis et al., 2020) and T5 (Raffel et al., 2020b) as backbone models. For each backbone model, we compare performance with SLED, which can consume long sequences, vs. the backbone models alone that are fed with the first 1,024 tokens. For comparison, we also finetune LEDbase. In all SLED and LED experiments, we use a maximal sequence length of 16K tokens and chunk size of 256 to allow for a fair evaluation.
For each model-dataset pair, we run hyperparameter tuning (detailed in Appendix C) based on the development set. Additionally, we submit generated predictions over the test set to SCROLLS leaderboard,6 and compare to the reported performance of other models at the time of submission.
5.2 Results
Table 1 reports results over SCROLLS development and test sets. Taking short-range pretrained LMs like BART and T5 and casting them into SLED’s framework allows them to process long documents effectively, improving the average SCROLLS score by 4.8-7 points. Examining BARTbase-SLED, we see a large improvement compared to LEDbase (33.635.4), and competitive performance on multiple tasks compared to LongT5base and UL2. Moreover, adding SLED to BARTlarge results in a high-performing model with results that are comparable to LongT5base and outperforming UL2, despite UL2’s large parameter count (50x larger), and with no need for expensive pretraining geared towards long-range tasks. BARTlarge-SLED’s performance is moderately lower than the larger LongT5 models.
Model . | (Chunk/Input) . | #Params . | Avg . | GovRep . | SumScr . | QMSum . | Qspr . | Nrtv . | QALT . | CNLI . |
---|---|---|---|---|---|---|---|---|---|---|
ROUGE-1/2/L . | ROUGE-1/2/L . | ROUGE-1/2/L . | F1 . | F1 . | EM-T/H . | EM . | ||||
Development Scores | ||||||||||
LEDbase | (256/16K) | 162M | – | 57.3/27.9/30.0 | 30.7/6.3/17.9 | 32.5/9.0/21.1 | 30.4 | 20.2 | 30.9 | 82.3 |
T5base | (1K/1K) | 220M | – | 32.8/11.7/20.2 | 22.2/3.7/15.3 | 26.1/6.6/19.8 | 13.2 | 14.9 | 35.1 | 76.8 |
T5base-SLED | (256/16K) | 220M | – | 47.0/20.2/25.2 | 25.3/5.0/16.6 | 29.9/8.7/21.4 | 38.2 | 18.2 | 34.6 | 82.4 |
BARTbase | (1K/1K) | 139M | – | 47.7/18.5/22.3 | 30.1/7.0/18.3 | 32.2/9.3/21.1 | 23.3 | 15.9 | 33.8 | 78.4 |
BARTbase-SLED | (256/16K) | 139M | – | 55.7/24.8/25.8 | 33.6/8.5/19.2 | 34.4/11.5/22.7 | 35.8 | 21.3 | 33.7 | 85.3 |
BARTlarge | (1K/1K) | 406M | – | 50.6/19.8/23.5 | 32.1/7.4/18.7 | 33.3/9.4/21.6 | 24.5 | 17.9 | 36.1 | 79.3 |
BARTlarge-SLED | (256/16K) | 406M | – | 57.4/26.3/27.5 | 35.3/8.8/19.5 | 36.3/12.2/23.3 | 42.5 | 23.6 | 37.2 | 85.3 |
Test Scores | ||||||||||
LEDbase | (256/16K) | 162M | 33.6 | 56.8/27.3/29.2 | 30.0/6.0/17.5 | 31.3/8.6/20.5 | 34.8 | 21.0 | 28.5/28.3 | 82.9 |
T5base | (1K/1K) | 220M | 26.3 | 33.2/12.1/20.4 | 21.4/3.6/15.0 | 24.2/5.9/18.6 | 16.3 | 15.0 | 31.9/28.6 | 76.3 |
T5base-SLED | (256/16K) | 220M | 33.3 | 46.6/20.1/25.1 | 24.5/4.6/16.5 | 28.4/8.7/20.5 | 43.0 | 18.9 | 31.2/29.4 | 81.4 |
BARTbase | (1K/1K) | 139M | 30.6 | 48.0/19.1/22.7 | 30.1/6.6/18.1 | 31.2/9.1/20.3 | 27.6 | 16.0 | 32.5/31.6 | 77.1 |
BARTbase-SLED | (256/16K) | 139M | 35.4 | 54.7/24.4/25.4 | 32.7/7.9/19.1 | 33.8/11.7/22.6 | 41.1 | 21.5 | 29.7/30.4 | 85.6 |
BARTlarge | (1K/1K) | 406M | 32.1 | 50.7/20.1/23.5 | 31.6/6.8/18.5 | 32.0/9.1/20.8 | 29.2 | 18.3 | 34.8/33.9 | 79.7 |
BARTlarge-SLED | (256/16K) | 406M | 38.0 | 57.5/26.3/27.4 | 35.2/8.7/19.4 | 34.2/11.0/22.0 | 46.9 | 24.1 | 34.8/34.8 | 87.3 |
LEDbaseSCROLLS† | (1K/16K) | 162M | 29.2 | 56.2/26.6/28.8 | 24.2/4.5/15.4 | 25.1/6.7/18.8 | 26.6 | 18.5 | 25.8/25.4 | 71.5 |
LongT5base† | (255/16K) | 220M | 38.2 | 53.5/27.3/29.3 | 34.8/9.6/21.1 | 33.9/11.0/22.8 | 46.6 | 23.0 | 37.9/36.6 | 85.6 |
LongT5large† | (255/16K) | 770M | 40.5 | 54.2/27.8/29.8 | 35.6/9.2/21.2 | 35.1/12.0/23.3 | 52.3 | 27.2 | 40.6/38.6 | 87.3 |
LongT5XL† | (255/16K) | 3B | 41.9 | 54.7/28.2/30.2 | 35.8/9.6/21.1 | 34.9/11.8/23.5 | 53.1 | 29.3 | 46.0/42.1 | 88.2 |
UL2† | (2K/2K) | 20B | 37.9 | 53.6/26.1/28.8 | 32.9/7.8/19.4 | 31.1/8.5/20.4 | 37.6 | 24.2 | 45.8/40.7 | 88.7 |
Model . | (Chunk/Input) . | #Params . | Avg . | GovRep . | SumScr . | QMSum . | Qspr . | Nrtv . | QALT . | CNLI . |
---|---|---|---|---|---|---|---|---|---|---|
ROUGE-1/2/L . | ROUGE-1/2/L . | ROUGE-1/2/L . | F1 . | F1 . | EM-T/H . | EM . | ||||
Development Scores | ||||||||||
LEDbase | (256/16K) | 162M | – | 57.3/27.9/30.0 | 30.7/6.3/17.9 | 32.5/9.0/21.1 | 30.4 | 20.2 | 30.9 | 82.3 |
T5base | (1K/1K) | 220M | – | 32.8/11.7/20.2 | 22.2/3.7/15.3 | 26.1/6.6/19.8 | 13.2 | 14.9 | 35.1 | 76.8 |
T5base-SLED | (256/16K) | 220M | – | 47.0/20.2/25.2 | 25.3/5.0/16.6 | 29.9/8.7/21.4 | 38.2 | 18.2 | 34.6 | 82.4 |
BARTbase | (1K/1K) | 139M | – | 47.7/18.5/22.3 | 30.1/7.0/18.3 | 32.2/9.3/21.1 | 23.3 | 15.9 | 33.8 | 78.4 |
BARTbase-SLED | (256/16K) | 139M | – | 55.7/24.8/25.8 | 33.6/8.5/19.2 | 34.4/11.5/22.7 | 35.8 | 21.3 | 33.7 | 85.3 |
BARTlarge | (1K/1K) | 406M | – | 50.6/19.8/23.5 | 32.1/7.4/18.7 | 33.3/9.4/21.6 | 24.5 | 17.9 | 36.1 | 79.3 |
BARTlarge-SLED | (256/16K) | 406M | – | 57.4/26.3/27.5 | 35.3/8.8/19.5 | 36.3/12.2/23.3 | 42.5 | 23.6 | 37.2 | 85.3 |
Test Scores | ||||||||||
LEDbase | (256/16K) | 162M | 33.6 | 56.8/27.3/29.2 | 30.0/6.0/17.5 | 31.3/8.6/20.5 | 34.8 | 21.0 | 28.5/28.3 | 82.9 |
T5base | (1K/1K) | 220M | 26.3 | 33.2/12.1/20.4 | 21.4/3.6/15.0 | 24.2/5.9/18.6 | 16.3 | 15.0 | 31.9/28.6 | 76.3 |
T5base-SLED | (256/16K) | 220M | 33.3 | 46.6/20.1/25.1 | 24.5/4.6/16.5 | 28.4/8.7/20.5 | 43.0 | 18.9 | 31.2/29.4 | 81.4 |
BARTbase | (1K/1K) | 139M | 30.6 | 48.0/19.1/22.7 | 30.1/6.6/18.1 | 31.2/9.1/20.3 | 27.6 | 16.0 | 32.5/31.6 | 77.1 |
BARTbase-SLED | (256/16K) | 139M | 35.4 | 54.7/24.4/25.4 | 32.7/7.9/19.1 | 33.8/11.7/22.6 | 41.1 | 21.5 | 29.7/30.4 | 85.6 |
BARTlarge | (1K/1K) | 406M | 32.1 | 50.7/20.1/23.5 | 31.6/6.8/18.5 | 32.0/9.1/20.8 | 29.2 | 18.3 | 34.8/33.9 | 79.7 |
BARTlarge-SLED | (256/16K) | 406M | 38.0 | 57.5/26.3/27.4 | 35.2/8.7/19.4 | 34.2/11.0/22.0 | 46.9 | 24.1 | 34.8/34.8 | 87.3 |
LEDbaseSCROLLS† | (1K/16K) | 162M | 29.2 | 56.2/26.6/28.8 | 24.2/4.5/15.4 | 25.1/6.7/18.8 | 26.6 | 18.5 | 25.8/25.4 | 71.5 |
LongT5base† | (255/16K) | 220M | 38.2 | 53.5/27.3/29.3 | 34.8/9.6/21.1 | 33.9/11.0/22.8 | 46.6 | 23.0 | 37.9/36.6 | 85.6 |
LongT5large† | (255/16K) | 770M | 40.5 | 54.2/27.8/29.8 | 35.6/9.2/21.2 | 35.1/12.0/23.3 | 52.3 | 27.2 | 40.6/38.6 | 87.3 |
LongT5XL† | (255/16K) | 3B | 41.9 | 54.7/28.2/30.2 | 35.8/9.6/21.1 | 34.9/11.8/23.5 | 53.1 | 29.3 | 46.0/42.1 | 88.2 |
UL2† | (2K/2K) | 20B | 37.9 | 53.6/26.1/28.8 | 32.9/7.8/19.4 | 31.1/8.5/20.4 | 37.6 | 24.2 | 45.8/40.7 | 88.7 |
Barring QuALITY, SLED significantly improves performance across all tasks compared to the corresponding backbone models. All summarization datasets (GovReport, SummScreenFD and QMSum) show impressive gains of up to 35% compared to their baseline scores, across all metrics (Rouge-1/Rouge-2/Rouge-L [Lin, 2004]) and for all three backbone models. Similarly, on ContractNLI (Koreeda and Manning, 2021) we see large relative improvements. As the performance of the baseline models was already high, this boost in performance is even more significant. Finally, the QA datasets Qasper and NarrativeQA show the largest gains, improving by an average of 60%.
QuALITY
In stark contrast to other datasets lies the multi-choice QA dataset QuALITY (Pang et al., 2022). While the performance of BARTlarge-SLED is above chance, it barely improves the performance of its backbone model (BARTlarge), which observes only the first 1K tokens, with a similar trend in other backbone models. Analyzing test scores in Table 1, we see that increasing model size consistently improves performance (up to 46% exact match), but increasing input length has a negligible effect. Since reported human accuracy on QuALITY is high (93.5%), this hints that QuALITY might require commonsense reasoning and knowledge that are absent from models with a lower parameter count.
Summary
We have shown that taking off- the-shelf pretrained LMs and embedding them into SLED leads to competitive performance on SCROLLS. Importantly, any future pretrained LM can be easily plugged into SLED, without the need for an expensive pretraining step.
5.3 Dataset Analysis
SLED’s simplicity and modularity allow it to be used as a useful tool for dataset analyses. Specifically, we can vary the chunk size, c, and the number of tokens, n, across datasets to analyze a) how local are individual pieces of relevant information, and b) how far into the document they are located.
Locality of Information
SLED relies on an assumption that information can be contextualized locally at encoding time. To analyze locality, we vary the chunk size, c, which defines the attention window, and measure the effect on SCROLLS datasets with input length 16K. Figure 5 shows the results of this experiment, where the y-axis shows the relative improvement compared to BARTbase on a target metric as a function of the chunk size c for all datasets. We observe that in all datasets the best performing chunk size is relatively small (up to 256), and further increasing c even hurts the performance in some cases. However, the summarization datasets show a much larger gain in performance when increasing c up to that threshold. This coincides with a common hypothesis that QA and NLI require relatively local context, and thus increasing c can add noise and hurt optimization, while summarization may require a more high-level view of information.
Distance from Start of Document
We now analyze whether the entire document is indeed required for tasks in SCROLLS by varying the maximum document length, n. Figure 6 shows the results of this experiment, where the y-axis shows relative improvement of BARTbase-SLED compared to BARTbase as a function of the first n tokens from the document (chunk size c = 256). As expected, all datasets (except QuALITY) show a roughly monotonic improvement in performance with n. This shows that (a) SLED is able to effectively use all of the information in a long sequence (up to 16K tokens),7 and that (b) observing the entire inputs from SCROLLS improves performance.
5.4 Effect of Context Padding
In all experiments thus far, we used a conservative padding value ρ = 0.5, resulting in effective chunk size of and context padding tokens on each side. Since both memory and, more importantly, the number of forwards passes through the encoder are linear in the number of chunks, a natural question is how much padding and overlap are necessary to achieve satisfactory results.
To explore this, we finetune BARTbase-SLED on all six datasets where SLED showed gains over its baseline model (i.e., all datasets except for QuALITY), varying the value of ρ, and fixing c = 256. Table 2 shows the results of this experiment, where we compare relative gain compared to BARTbase across different ρ values.
ρ . | Relative Gain . | |||||
---|---|---|---|---|---|---|
GovRep . | SumScr . | QMSum . | Qspr . | Nrtv . | CNLI . | |
50% | 34.1% | 21.0% | 22.8% | 53.7% | 34.2% | 8.9% |
25% | 28.5% | 19.0% | 17.9% | 54.7% | 29.4% | 10.1% |
5% | 18.7% | 15.9% | 23.5% | 52.0% | 31.9% | 7.4% |
0% | 27.1% | 9.5% | 11.5% | 46.1% | 29.2% | 6.9% |
ρ . | Relative Gain . | |||||
---|---|---|---|---|---|---|
GovRep . | SumScr . | QMSum . | Qspr . | Nrtv . | CNLI . | |
50% | 34.1% | 21.0% | 22.8% | 53.7% | 34.2% | 8.9% |
25% | 28.5% | 19.0% | 17.9% | 54.7% | 29.4% | 10.1% |
5% | 18.7% | 15.9% | 23.5% | 52.0% | 31.9% | 7.4% |
0% | 27.1% | 9.5% | 11.5% | 46.1% | 29.2% | 6.9% |
As expected, decreasing the padding factor and consequently the number of chunks reduces training time. When ρ = 0.05 training can be faster by up to 2x compared to ρ = 0.5 as the number of chunks drops to almost half. Moreover, relative gain (i.e., improvement relative to the baseline) is often close to or even higher with less padding (perhaps due to better encoding or more stable optimization). Nevertheless, there is no single ρ value that consistently beats the conservative choice of ρ = 0.5. In particular, in all six datasets, setting ρ = 0.5 results in a top-2 performance, often by a large margin and never considerably worse then the best result. Thus, we conclude that one may improve the efficiency and performance of SLED by tuning the hyperparameter ρ for optimal behavior w.r.t. a specific task, and we fix ρ = 0.5 in our experiments.
Moreover, Table 2 demonstrates the importance of having chunks at least partially overlapping. In all six dataset, using non-overlapping chunks (ρ = 0) results in a drop of at least 10% gain compared to the best setting, where in some cases this gap grows to over 50%. This supports our hypothesis that chunking inputs with no overlap may lead to crucial loss of information.
6 Related Work
Efficient Transformers
Many efficient attention variants were proposed in recent years, to alleviate the quadratic complexity of dense attention (Tay et al., 2020; Fournier et al., 2021). Among those are clustering vectors to distinct buckets, calculating attention only within each one (Kitaev et al., 2020), attending only to a fixed number of hidden vectors (Ma et al., 2021), using random features to approximate the attention matrix (Choromanski et al., 2021; Peng et al., 2021), and using low-rank factorizations (Wang et al., 2020). Despite achieving respectable performance when finetuning these models on the Long Range Arena benchmark (Tay et al., 2021), many of them were not yet proven to work well as a backbone for pretrained language models. In fact, recent work (Xiong et al., 2022b) on encoder-only models found many do not outperform a simple local attention sliding window on downstream language tasks. We discuss such methods next.
Sparse Attention Variants
A popular and simple solution for allowing attention-based models to process long sequences is to use local attention, where each token attend to a local window around it. Longformer (Beltagy et al., 2020), GMAT (Gupta and Berant, 2020), and ETC (Ainslie et al., 2020) use short windows of full attention, combined with full attention to a small number of predefined global input tokens. BigBird (Zaheer et al., 2020b) shares the local and global features, and, additionally, randomly samples tokens to attend to. Finally, the recently proposed LongT5 (Guo et al., 2022) extends T5 (Raffel et al., 2020a) with local and global attention components based on ETC, relieving the need to manually specify global tokens. In this work, we demonstrate that a simple sliding window with off-the-shelf models without any modifications is a strong alternative for multiple generative tasks that require processing long documents.
Beyond Transformers
As an alternative to transformers for processing long sequences, Gu et al. (2021) proposed the Structured State Space (S4) architecture showing dramatic gains over transformers on the LRA benchmark (Tay et al., 2021). State space models are now an active research field (Gupta, 2022; Mehta et al., 2022), but their efficacy on long-range language understanding tasks has not been tested yet.
Fusion-in-Decoder
Izacard and Grave (2021) proposed to encode multiple independent passages separately, and concatenate the encodings prior to the decoding phase. Despite encouraging empirical evidence (Amouyal et al., 2022; Yavuz et al., 2022), we are the first (to our knowledge) to analyze FiD’s feasibility and limitations in a controlled setting. Importantly, we test FiD on long-range tasks over a single long document, rather than a collection of independent passages.
Pretrained Models with Sliding Windows
Wrapping a BERT encoder within a sliding window was proposed by Cui and Hu (2021) in the context of a specialized architecture for summarization. Wang et al. (2019) showed that sliding BERT across text improves performance on several QA datasets. In this work, we propose a sliding window approach that can be easily plugged into any existing encoder-decoder model without additional parameters or task-specific training, and show its efficacy for long-range text understanding. Most similar to SLED, is the SegEnc approach proposed by Vig et al. (2022). By dividing inputs from QMSum into overlapping chunks, encoding them separately, and then performing FiD (using two representations for every input token), the authors were able to achieve state-of-the-art results. However, Vig et al. (2022) were focused on summarization and did not perform a systematic analysis of this type of architecture.
7 Limitations
We present SLED as a simple and effective method to extend the capabilities of pretrained short-text models to long-text tasks. Despite its impressive empirical performance on SCROLLS, SLED suffers from two disadvantages which may limit its applicability to some long-range tasks.
Long Output
To obtain linear complexity, SLED assumes the output length k is constant. This is because the decoder uses quadratic self-attention over the output, on top of cross-attention between the output and input. While most current long-text tasks follow this assumption, future tasks, such as academic reports or script writing, may require long text generation. This limitation is not unique to SLED and affects other long-range transformers including LongT5 and LED. Aside from finetuning, this also affects pretraining models on long inputs with self-supervised losses such as span-corruption (Raffel et al., 2020b) or denoising (Lewis et al., 2020), which require the decoder to process an output that is linear in the length of the input.
Co-reference Resolution and Fact Retention
An assumption at the heart of SLED is the Locality of information assumption. When the input text is long, this assumption may break if distant entity resolution or factual knowledge are required. For example, a chapter in a book may mention “they were walking into the room” when knowledge of what room or who walked is located a few chapters back. In such cases, the encoder used by SLED will not be able to access this information, moving more responsibility to the decoder and reducing the effectiveness of the contextual encoding. Similarly, in multi-hop questions (Yang et al., 2018), attending to one part of the context is necessary in order to fully understand the question and encode a second piece of information correctly. As the encoder will not have access to the first context that leads to better question understanding, here as well more responsibility is delegated to the decoder.
8 Conclusions
In this work we present SLED, a simple approach for modeling long texts that slides a pretrained short-range encoder over a long input document and then generates an output by attending to the encoded tokens. We show SLED can perform core operations that are important for long text understanding, such as finding relevant pieces of information and fusing them at decoding time, and demonstrate competitive performance on the SCROLLS benchmark compared to larger models and models that employ a dedicated and expensive pretraining step.
One of SLED’s most attractive features is that it can be readily used with any short-range pretrained LM. Thus, any future encoder-decoder model can be flexibly plugged into it to achieve further gains in performance on SCROLLS, some of its tasks, or any other long-range task.
We open source SLED and hope it encourages the research community to easily extend to longer inputs and push the borders of natural language understanding models’ applicability in real-world use-cases.
Acknowledgments
This research was partially supported by The Yandex Initiative for Machine Learning, the Shashua Fellowship, the Len Blavatnik and the Blavatnik Family Foundation, and the European Research Council (ERC) under the European Union Horizons 2020 research and innovation programme (grant ERC DELPHI 802800). We would also like to thank our action editor and the anonymous reviewers for their insightful suggestions and feedback. This work was completed in partial fulfillment for the Ph.D. degree of the first author.
A SLED Implementation Details
While §3 details SLED’s method it leaves out dealing with the edge tokens for brevity. Encoding the first and last input tokens requires special attention, as they lack bidirectional context. To preserve as much commonality between chunks, all first tokens are considered the effective chunk tokens in the first chunk. To account for the final tokens, the last chunk will always start at token tn−c +1 so it would contain exactly c tokens, and its effective chunk tokens will be defined as all tokens that were not part of any previous effective chunk.
B Chunking vs. Local-attention
Both LED and SLED are long-range models built on top of the same short-text model (BART), and employ local attention. However, SLED relies on chunking, while LED uses per-layer local attention. In this section, we now discuss in more detail the relation between the two approaches.
Implementation
One of SLED’s biggest advantages is that it is agnostic to the backbone encoder-decoder model, and can extend any existing model without additional implementation overhead. In contrast, The attention mechanism in Longformer, and subsequently LED, was implemented by Beltagy et al. (2020) with a specialized CUDA kernel that is coupled to the architecture and implementation of BART. This makes LED more efficient, but extending it to new architectures incurs significant engineering overhead. This is since LED uses a “diagonal” local-window attention across layers, for which a naïve implementation is inefficient. Conversely, SLED uses chunking, which allows to simply wrap an existing encoder-decoder model.
Contextualization
The most significant difference between LED and SLED from a conceptual point of view is their contextualization mechanism. While SLED splits the input into (overlapping) chunks and encodes each of them independently, LED performs local attention per-layer. This results in an effective receptive field that grows linearly with the encoder depth, potentially allowing it to perform more “global” contextualization. Our results in §4 suggest that such global contextualization is beneficial, and a similar conclusion can be reached when observing that LEDbase, which uses all prefix tokens as global tokens, outperforms LEDbaseSCROLLS, which uses only a single token for global contextualization.
Positional Information
SLED’s chunking mechanism means that it utilizes the positional encoding of the underlying model independently in each chunk, and is thus agnostic to the positional embedding technique used by the backbone model. Moreover, it potentially allows SLED to generalize to arbitrary input lengths. In contrast, LED utilizes BART’s absolute embeddings, duplicating them 16 times to support 16K-long sequences. This limits its ability to generalize to longer inputs, and potentially induces a requirement for significant amounts of long-input samples to properly tune those new parameters (Shaham et al., 2022). This is evident in Table 1 when comparing tests scores of LEDbase against BARTbase-SLED and considering the number for training samples. In NarrativeQA and GovReport, which contain ∼71K and ∼19K samples respectively, LED is comparable to SLED and even slightly outperforms it on some metrics. In ContractNLI (∼10K examples), it does slightly worse. In all other datasets, where the training data is small, LED is significantly worse than SLED.
Complexity
We analyzed the complexity analysis of SLED’s encoder (§3), which is . A similar analysis of LED yields that in each layer, LED considers windows of length c, where in each window only the middle token attends to its local neighborhood, resulting in memory complexity as well. However, due to SLED’s use of overlap and full self-attention within each chunk, SLED’s encoding may require up 2x more memory compared to LED when ρ = 0.5.
C Experimental Details
Our experimental setup is based on the SCROLLS official repository.8 The dataset inputs and splits remained as suggested by the authors of SCROLLS as well as the suggested number of epochs per dataset. To perform model selection, for each model-dataset pair we finetuned 9 models with LINEAR learning rate scheduling, AdamW optimizer with the default settings, and setting the learning rate to one of {2e−5,5e−5,1e−4} and the effective batch size to one of {8,16,32}. Warmup was fixed at 10% and weight decay at 0.01. All code, data, Python environment requirements, hyperparameters, and scripts required to reproduce our results are available at https://github.com/Mivg/SLED.
Notes
We assume the prefix length (m) is negligible and thus its effect on asymptotic complexity is negligible.
We only consider paragraphs that are not within the gold document and do not contain the gold answer.
All examples have ≤1,024 tokens, including the prefix.
We add masked padding after the prefix to ensure chunking of the document remains identical.
For ContractNLI, the length of over 95% of the tokenized examples is less than 8K.
References
Author notes
Action Editor: Mauro Cettolo